Re: [Qemu-devel] [PATCH v2 00/16] Postcopy: Hugepage support

qemu-devel
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-devel] [PATCH v2 00/16] Postcopy: Hugepage support

From:	Dr. David Alan Gilbert
Subject:	Re: [Qemu-devel] [PATCH v2 00/16] Postcopy: Hugepage support
Date:	Mon, 27 Feb 2017 11:26:58 +0000
User-agent:	Mutt/1.7.1 (2016-10-04)
* Alexey Perevalov (address@hidden) wrote:
> Hi David,
> 
> 
> On Tue, Feb 21, 2017 at 10:03:14AM +0000, Dr. David Alan Gilbert wrote:
> > * Alexey Perevalov (address@hidden) wrote:
> > > 
> > > Hello David,
> > 
> > Hi Alexey,
> > 
> > > On Tue, Feb 14, 2017 at 07:34:26PM +0000, Dr. David Alan Gilbert wrote:
> > > > * Alexey Perevalov (address@hidden) wrote:
> > > > > Hi David,
> > > > > 
> > > > > Thank your, now it's clear.
> > > > > 
> > > > > On Mon, Feb 13, 2017 at 06:16:02PM +0000, Dr. David Alan Gilbert 
> > > > > wrote:
> > > > > > * Alexey Perevalov (address@hidden) wrote:
> > > > > > >  Hello David!
> > > > > > 
> > > > > > Hi Alexey,
> > > > > > 
> > > > > > > I have checked you series with 1G hugepage, but only in 1 
> > > > > > > Gbit/sec network
> > > > > > > environment.
> > > > > > 
> > > > > > Can you show the qemu command line you're using?  I'm just trying
> > > > > > to make sure I understand where your hugepages are; running 1G 
> > > > > > hostpages
> > > > > > across a 1Gbit/sec network for postcopy would be pretty poor - it 
> > > > > > would take
> > > > > > ~10 seconds to transfer the page.
> > > > > 
> > > > > sure
> > > > > -hda ./Ubuntu.img -name PAU,debug-threads=on -boot d -net nic -net 
> > > > > user
> > > > > -m 1024 -localtime -nographic -enable-kvm -incoming tcp:0:4444 -object
> > > > > memory-backend-file,id=mem,size=1G,mem-path=/dev/hugepages 
> > > > > -mem-prealloc
> > > > > -numa node,memdev=mem -trace events=/tmp/events -chardev
> > > > > socket,id=charmonitor,path=/var/lib/migrate-vm-monitor.sock,server,nowait
> > > > > -mon chardev=charmonitor,id=monitor,mode=control
> > > > 
> > > > OK, it's a pretty unusual setup - a 1G page guest with 1G of guest RAM.
> > > > 
> > > > > > 
> > > > > > > I started Ubuntu just with console interface and gave to it only 
> > > > > > > 1G of
> > > > > > > RAM, inside Ubuntu I started stress command
> > > > > > 
> > > > > > > (stress --cpu 4 --io 4 --vm 4 --vm-bytes 256000000 &)
> > > > > > > in such environment precopy live migration was impossible, it 
> > > > > > > never
> > > > > > > being finished, in this case it infinitely sends pages (it looks 
> > > > > > > like
> > > > > > > dpkg scenario).
> > > > > > > 
> > > > > > > Also I modified stress utility
> > > > > > > http://people.seas.harvard.edu/~apw/stress/stress-1.0.4.tar.gz
> > > > > > > due to it wrote into memory every time the same value `Z`. My
> > > > > > > modified version writes every allocation new incremented value.
> > > > > > 
> > > > > > I use google's stressapptest normally; although remember to turn
> > > > > > off the bit where it pauses.
> > > > > 
> > > > > I decided to use it too
> > > > > stressapptest -s 300 -M 256 -m 8 -W
> > > > > 
> > > > > > 
> > > > > > > I'm using Arcangeli's kernel only at the destination.
> > > > > > > 
> > > > > > > I got controversial results. Downtime for 1G hugepage is close to 
> > > > > > > 2Mb
> > > > > > > hugepage and it took around 7 ms (in 2Mb hugepage scenario 
> > > > > > > downtime was
> > > > > > > around 8 ms).
> > > > > > > I made that opinion by query-migrate.
> > > > > > > {"return": {"status": "completed", "setup-time": 6, "downtime": 
> > > > > > > 6, "total-time": 9668, "ram": {"total": 1091379200, 
> > > > > > > "postcopy-requests": 1, "dirty-sync-count": 2, "remaining": 0, 
> > > > > > > "mbps": 879.786851, "transferred": 1063007296, "duplicate": 7449, 
> > > > > > > "dirty-pages-rate": 0, "skipped": 0, "normal-bytes": 1060868096, 
> > > > > > > "normal": 259001}}}
> > > > > > > 
> > > > > > > Documentation says about downtime field - measurement unit is ms.
> > > > > > 
> > > > > > The downtime measurement field is pretty meaningless for postcopy; 
> > > > > > it's only
> > > > > > the time from stopping the VM until the point where we tell the 
> > > > > > destination it
> > > > > > can start running.  Meaningful measurements are only from inside 
> > > > > > the guest
> > > > > > really, or the place latencys.
> > > > > >
> > > > > 
> > > > > Maybe improve it by receiving such information from destination?
> > > > > I wish to do that.
> > > > > > > So I traced it (I added additional trace into postcopy_place_page
> > > > > > > trace_postcopy_place_page_start(host, from, pagesize); )
> > > > > > > 
> > > > > > > postcopy_ram_fault_thread_request Request for HVA=7f6dc0000000 
> > > > > > > rb=/objects/mem offset=0
> > > > > > > postcopy_place_page_start host=0x7f6dc0000000 
> > > > > > > from=0x7f6d70000000, pagesize=40000000
> > > > > > > postcopy_place_page_start host=0x7f6e0e800000 
> > > > > > > from=0x55b665969619, pagesize=1000
> > > > > > > postcopy_place_page_start host=0x7f6e0e801000 
> > > > > > > from=0x55b6659684e8, pagesize=1000
> > > > > > > several pages with 4Kb step ...
> > > > > > > postcopy_place_page_start host=0x7f6e0e817000 
> > > > > > > from=0x55b6659694f0, pagesize=1000
> > > > > > > 
> > > > > > > 4K pages, started from 0x7f6e0e800000 address it's
> > > > > > > vga.ram, /address@hidden/acpi/tables etc.
> > > > > > > 
> > > > > > > Frankly saying, right now, I don't have any ideas why hugepage 
> > > > > > > wasn't
> > > > > > > resent. Maybe my expectation of it is wrong as well as 
> > > > > > > understanding )
> > > > > > 
> > > > > > That's pretty much what I expect to see - before you get into 
> > > > > > postcopy
> > > > > > mode everything is sent as individual 4k pages (in order); once 
> > > > > > we're
> > > > > > in postcopy mode we send each page no more than once.  So you're
> > > > > > huge page comes across once - and there it is.
> > > > > > 
> > > > > > > stress utility also duplicated for me value into appropriate file:
> > > > > > > sec_since_epoch.microsec:value
> > > > > > > 1487003192.728493:22
> > > > > > > 1487003197.335362:23
> > > > > > > *1487003213.367260:24*
> > > > > > > *1487003238.480379:25*
> > > > > > > 1487003243.315299:26
> > > > > > > 1487003250.775721:27
> > > > > > > 1487003255.473792:28
> > > > > > > 
> > > > > > > It mean rewriting 256Mb of memory per byte took around 5 sec, but 
> > > > > > > at
> > > > > > > the moment of migration it took 25 sec.
> > > > > > 
> > > > > > right, now this is the thing that's more useful to measure.
> > > > > > That's not too surprising; when it migrates that data is changing 
> > > > > > rapidly
> > > > > > so it's going to have to pause and wait for that whole 1GB to be 
> > > > > > transferred.
> > > > > > Your 1Gbps network is going to take about 10 seconds to transfer 
> > > > > > that
> > > > > > 1GB page - and that's if you're lucky and it saturates the network.
> > > > > > SO it's going to take at least 10 seconds longer than it normally
> > > > > > would, plus any other overheads - so at least 15 seconds.
> > > > > > This is why I say it's a bad idea to use 1GB host pages with 
> > > > > > postcopy.
> > > > > > Of course it would be fun to find where the other 10 seconds went!
> > > > > > 
> > > > > > You might like to add timing to the tracing so you can see the time 
> > > > > > between the
> > > > > > fault thread requesting the page and it arriving.
> > > > > >
> > > > > yes, sorry I forgot about timing
> > > > > address@hidden:postcopy_ram_fault_thread_request Request for 
> > > > > HVA=7f0280000000 rb=/objects/mem offset=0
> > > > > address@hidden:qemu_loadvm_state_section 8
> > > > > address@hidden:loadvm_process_command com=0x2 len=4
> > > > > address@hidden:qemu_loadvm_state_section 2
> > > > > address@hidden:postcopy_place_page_start host=0x7f0280000000 
> > > > > from=0x7f0240000000, pagesize=40000000
> > > > > 
> > > > > 1487084823.315919 - 1487084818.270993 = 5.044926 sec.
> > > > > Machines connected w/o any routers, directly by cable.
> > > > 
> > > > OK, the fact it's only 5 seconds not 10 I think suggests a lot of the 
> > > > memory was all zero
> > > > so didn't take up the whole bandwidth.
> > 
> > > I decided to measure downtime as a sum of intervals since fault happened
> > > and till page was load. I didn't relay on order, so I associated that
> > > interval with fault address.
> > 
> > Don't forget the source will still be sending unrequested pages at the
> > same time as fault responses; so that simplification might be wrong.
> > My experience with 4k pages is you'll often get pages that arrive
> > at about the same time as you ask for them because of the background 
> > transmission.
> > 
> > > For 2G ram vm - using 1G huge page, downtime measured on dst is around 12 
> > > sec,
> > > but for the same 2G ram vm with 2Mb huge page, downtime measured on dst
> > > is around 20 sec, and 320 page faults happened, 640 Mb was transmitted.
> > 
> > OK, so 20/320 * 1000=62.5msec/ page.   That's a bit high.
> > I think it takes about 16ms to transmit a 2MB page on your 1Gbps network,
> Yes, you right, transfer of the first page doesn't wait for prefetched page
> transmission, and downtime for first page was 25 ms.
> 
> Next requested pages are queued (FIFO) so dst is waiting all prefetched pages,
> it's around 5-7 pages transmission.
> So I have a question why not to put requested page into the head of
> queue in that case, and dst qemu will wait only lesser, only page which
> was already in transmission.

The problem is it's already in the source's network queue.

> Also if I'm not wrong, commands and pages are transferred over the same
> socket. Why not to use OOB TCP in this case for commands?

My understanding was that OOB was limited to quite small transfers
I think the right way is to use a separate FD for the requests, so I'll
do it after Juan's multifd series.
Although even then I'm not sure how it will behave; the other thing
might be to throttle the background page transfer so the FIFO isn't
as full.

> > you're probably also suffering from the requests being queued behind
> > background requests; if you try reducing your tcp_wmem setting on the
> > source it might get a bit better.  Once Juan Quintela's multi-fd work
> > goes in my hope is to combine it with postcopy and then be able to
> > avoid that type of request blocking.
> > Generally I'd not recommend 10Gbps for postcopy since it does pull
> > down the latency quite a bit.
> > 
> > > My current method doesn't take into account multi core vcpu. I checked
> > > only with 1 CPU, but it's not proper case. So I think it's worth to
> > > count downtime per CPU, or calculate overlap of CPU downtimes.
> > > How do your think?
> > 
> > Yes; one of the nice things about postcopy is that if one vCPU is blocked
> > waiting for a page, the other vCPUs will just be able to carry on.
> > Even with 1 vCPU if you've got multiple tasks that can run the guest can
> > switch to a task that isn't blocked (See KVM asynchronous page faults).
> > Now, what the numbers mean when you calculate the total like that might be 
> > a bit
> > odd - for example if you have 8 vCPUs and they're each blocked do you
> > add the times together even though they're blocked at the same time? What
> > about if they're blocked on the same page?
> 
> I implemented downtime calculation for all cpu's, the approach is
> following:
> 
> Initially intervals are represented in tree where key is
> pagefault address, and values:
>     begin - page fault time
>     end   - page load time
>     cpus  - bit mask shows affected cpus
> 
> To calculate overlap on all cpus, intervals converted into
> array of points in time (downtime_intervals), the size of
> array is 2 * number of nodes in tree of intervals (2 array
> elements per one in element of interval).
> Each element is marked as end (E) or not the end (S) of
> interval.
> The overlap downtime will be calculated for SE, only in
> case of sequence S(0..N)E(M) for every vCPU.
> 
> As example we have 3 CPU
>      S1        E1           S1               E1
> -----***********------------xxx***************------------------------> CPU1
> 
>             S2                E2
> ------------****************xxx---------------------------------------> CPU2
> 
>                         S3            E3
> ------------------------****xxx********-------------------------------> CPU3
>               
> We have sequence S1,S2,E1,S3,S1,E2,E3,E1
> S2,E1 - doesn't match condition due to
> sequence S1,S2,E1 doesn't include CPU3,
> S3,S1,E2 - sequenece includes all CPUs, in
> this case overlap will be S1,E2
> 
> 
> But I don't send RFC now,
> due to I faced an issue. Kernel doesn't inform user space about page's
> owner in handle_userfault. So it's the question to Andrea. Is it worth
> to add such information.
> Frankly saying, I don't know is current (task_struct) in
> handle_userfault equal to mm_struct's owner.

Is this so you can find which thread is waiting for it? I'm not sure it's
worth it; we dont normally need that, and anyway if doesn't help if multiple
CPUs need it, where the 2nd CPU hits it just after the 1st one.

Dave

> > 
> > > Also I didn't yet finish IPC to provide such information to src host, 
> > > where
> > > info_migrate is being called.
> > 
> > Dave
> > 
> > > 
> > > 
> > > > 
> > > > > > > Another one request.
> > > > > > > QEMU could use mem_path in hugefs with share key simultaneously
> > > > > > > (-object 
> > > > > > > memory-backend-file,id=mem,size=${mem_size},mem-path=${mem_path},share=on)
> > > > > > >  and vm
> > > > > > > in this case will start and will properly work (it will allocate 
> > > > > > > memory
> > > > > > > with mmap), but in case of destination for postcopy live migration
> > > > > > > UFFDIO_COPY ioctl will fail for
> > > > > > > such region, in Arcangeli's git tree there is such prevent check
> > > > > > > (if (!vma_is_shmem(dst_vma) && dst_vma->vm_flags & VM_SHARED).
> > > > > > > Is it possible to handle such situation at qemu?
> > > > > > 
> > > > > > Imagine that you had shared memory; what semantics would you like
> > > > > > to see ?  What happens to the other process?
> > > > > 
> > > > > Honestly, initially, I thought to handle such error, but I quit forgot
> > > > > about vhost-user in ovs-dpdk.
> > > > 
> > > > Yes, I don't know much about vhost-user; but we'll have to think 
> > > > carefully
> > > > about the way things behave when they're accessing memory that's shared
> > > > with qemu during migration.  Writing to the source after we've started
> > > > the postcopy phase is not allowed.  Accessing the destination memory
> > > > during postcopy will produce pauses in the other processes accessing it
> > > > (I think) and they mustn't do various types of madvise etc - so
> > > > I'm sure there will be things we find out the hard way!
> > > > 
> > > > Dave
> > > > 
> > > > > > Dave
> > > > > > 
> > > > > > > On Mon, Feb 06, 2017 at 05:45:30PM +0000, Dr. David Alan Gilbert 
> > > > > > > wrote:
> > > > > > > > * Dr. David Alan Gilbert (git) (address@hidden) wrote:
> > > > > > > > > From: "Dr. David Alan Gilbert" <address@hidden>
> > > > > > > > > 
> > > > > > > > > Hi,
> > > > > > > > >   The existing postcopy code, and the userfault kernel
> > > > > > > > > code that supports it, only works for normal anonymous memory.
> > > > > > > > > Kernel support for userfault on hugetlbfs is working
> > > > > > > > > it's way upstream; it's in the linux-mm tree,
> > > > > > > > > You can get a version at:
> > > > > > > > >    git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git
> > > > > > > > > on the origin/userfault branch.
> > > > > > > > > 
> > > > > > > > > Note that while this code supports arbitrary sized hugepages,
> > > > > > > > > it doesn't make sense with pages above the few-MB region,
> > > > > > > > > so while 2MB is fine, 1GB is probably a bad idea;
> > > > > > > > > this code waits for and transmits whole huge pages, and a
> > > > > > > > > 1GB page would take about 1 second to transfer over a 10Gbps
> > > > > > > > > link - which is way too long to pause the destination for.
> > > > > > > > > 
> > > > > > > > > Dave
> > > > > > > > 
> > > > > > > > Oops I missed the v2 changes from the message:
> > > > > > > > 
> > > > > > > > v2
> > > > > > > >   Flip ram-size summary word/compare individual page size 
> > > > > > > > patches around
> > > > > > > >   Individual page size comparison is done in ram_load if 
> > > > > > > > 'advise' has been
> > > > > > > >     received rather than checking migrate_postcopy_ram()
> > > > > > > >   Moved discard code into exec.c, reworked ram_discard_range
> > > > > > > > 
> > > > > > > > Dave
> > > > > > > 
> > > > > > > Thank your, right now it's not necessary to set
> > > > > > > postcopy-ram capability on destination machine.
> > > > > > > 
> > > > > > > 
> > > > > > > > 
> > > > > > > > > Dr. David Alan Gilbert (16):
> > > > > > > > >   postcopy: Transmit ram size summary word
> > > > > > > > >   postcopy: Transmit and compare individual page sizes
> > > > > > > > >   postcopy: Chunk discards for hugepages
> > > > > > > > >   exec: ram_block_discard_range
> > > > > > > > >   postcopy: enhance ram_block_discard_range for hugepages
> > > > > > > > >   Fold postcopy_ram_discard_range into ram_discard_range
> > > > > > > > >   postcopy: Record largest page size
> > > > > > > > >   postcopy: Plumb pagesize down into place helpers
> > > > > > > > >   postcopy: Use temporary for placing zero huge pages
> > > > > > > > >   postcopy: Load huge pages in one go
> > > > > > > > >   postcopy: Mask fault addresses to huge page boundary
> > > > > > > > >   postcopy: Send whole huge pages
> > > > > > > > >   postcopy: Allow hugepages
> > > > > > > > >   postcopy: Update userfaultfd.h header
> > > > > > > > >   postcopy: Check for userfault+hugepage feature
> > > > > > > > >   postcopy: Add doc about hugepages and postcopy
> > > > > > > > > 
> > > > > > > > >  docs/migration.txt                |  13 ++++
> > > > > > > > >  exec.c                            |  83 
> > > > > > > > > +++++++++++++++++++++++
> > > > > > > > >  include/exec/cpu-common.h         |   2 +
> > > > > > > > >  include/exec/memory.h             |   1 -
> > > > > > > > >  include/migration/migration.h     |   3 +
> > > > > > > > >  include/migration/postcopy-ram.h  |  13 ++--
> > > > > > > > >  linux-headers/linux/userfaultfd.h |  81 
> > > > > > > > > +++++++++++++++++++---
> > > > > > > > >  migration/migration.c             |   1 +
> > > > > > > > >  migration/postcopy-ram.c          | 138 
> > > > > > > > > +++++++++++++++++---------------------
> > > > > > > > >  migration/ram.c                   | 109 
> > > > > > > > > ++++++++++++++++++------------
> > > > > > > > >  migration/savevm.c                |  32 ++++++---
> > > > > > > > >  migration/trace-events            |   2 +-
> > > > > > > > >  12 files changed, 328 insertions(+), 150 deletions(-)
> > > > > > > > > 
> > > > > > > > > -- 
> > > > > > > > > 2.9.3
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > --
> > > > > > > > Dr. David Alan Gilbert / address@hidden / Manchester, UK
> > > > > > > > 
> > > > > > --
> > > > > > Dr. David Alan Gilbert / address@hidden / Manchester, UK
> > > > > > 
> > > > > 
> > > > > -- 
> > > > > 
> > > > > BR
> > > > > Alexey
> > > > --
> > > > Dr. David Alan Gilbert / address@hidden / Manchester, UK
> > > > 
> > > 
> > > -- 
> > > 
> > > BR
> > > Alexey
> > --
> > Dr. David Alan Gilbert / address@hidden / Manchester, UK
> >
> 
> BR
> Alexey
--
Dr. David Alan Gilbert / address@hidden / Manchester, UK
[Prev in Thread]
Current Thread
[Next in Thread]
Re: [Qemu-devel] [PATCH v2 00/16] Postcopy: Hugepage support, (continued)
- Re: [Qemu-devel] [PATCH v2 00/16] Postcopy: Hugepage support, Laurent Vivier, 2017/02/22
- Re: [Qemu-devel] [PATCH v2 00/16] Postcopy: Hugepage support, Dr. David Alan Gilbert, 2017/02/24
Prev by Date: [Qemu-devel] [PATCH] Adding support for LPD and LPDG instructions
Next by Date: Re: [Qemu-devel] [PATCH v8 0/2] docs: Improve sample configuration files
Previous by thread: Re: [Qemu-devel] [PATCH v2 00/16] Postcopy: Hugepage support
Next by thread: Re: [Qemu-devel] [PATCH v2 00/16] Postcopy: Hugepage support
Index(es):
- Date
- Thread