qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH v2 00/16] Postcopy: Hugepage support


From: Alexey Perevalov
Subject: Re: [Qemu-devel] [PATCH v2 00/16] Postcopy: Hugepage support
Date: Mon, 27 Feb 2017 14:05:35 +0300
User-agent: Mutt/1.5.24 (2015-08-30)

Hi David,


On Tue, Feb 21, 2017 at 10:03:14AM +0000, Dr. David Alan Gilbert wrote:
> * Alexey Perevalov (address@hidden) wrote:
> > 
> > Hello David,
> 
> Hi Alexey,
> 
> > On Tue, Feb 14, 2017 at 07:34:26PM +0000, Dr. David Alan Gilbert wrote:
> > > * Alexey Perevalov (address@hidden) wrote:
> > > > Hi David,
> > > > 
> > > > Thank your, now it's clear.
> > > > 
> > > > On Mon, Feb 13, 2017 at 06:16:02PM +0000, Dr. David Alan Gilbert wrote:
> > > > > * Alexey Perevalov (address@hidden) wrote:
> > > > > >  Hello David!
> > > > > 
> > > > > Hi Alexey,
> > > > > 
> > > > > > I have checked you series with 1G hugepage, but only in 1 Gbit/sec 
> > > > > > network
> > > > > > environment.
> > > > > 
> > > > > Can you show the qemu command line you're using?  I'm just trying
> > > > > to make sure I understand where your hugepages are; running 1G 
> > > > > hostpages
> > > > > across a 1Gbit/sec network for postcopy would be pretty poor - it 
> > > > > would take
> > > > > ~10 seconds to transfer the page.
> > > > 
> > > > sure
> > > > -hda ./Ubuntu.img -name PAU,debug-threads=on -boot d -net nic -net user
> > > > -m 1024 -localtime -nographic -enable-kvm -incoming tcp:0:4444 -object
> > > > memory-backend-file,id=mem,size=1G,mem-path=/dev/hugepages -mem-prealloc
> > > > -numa node,memdev=mem -trace events=/tmp/events -chardev
> > > > socket,id=charmonitor,path=/var/lib/migrate-vm-monitor.sock,server,nowait
> > > > -mon chardev=charmonitor,id=monitor,mode=control
> > > 
> > > OK, it's a pretty unusual setup - a 1G page guest with 1G of guest RAM.
> > > 
> > > > > 
> > > > > > I started Ubuntu just with console interface and gave to it only 1G 
> > > > > > of
> > > > > > RAM, inside Ubuntu I started stress command
> > > > > 
> > > > > > (stress --cpu 4 --io 4 --vm 4 --vm-bytes 256000000 &)
> > > > > > in such environment precopy live migration was impossible, it never
> > > > > > being finished, in this case it infinitely sends pages (it looks 
> > > > > > like
> > > > > > dpkg scenario).
> > > > > > 
> > > > > > Also I modified stress utility
> > > > > > http://people.seas.harvard.edu/~apw/stress/stress-1.0.4.tar.gz
> > > > > > due to it wrote into memory every time the same value `Z`. My
> > > > > > modified version writes every allocation new incremented value.
> > > > > 
> > > > > I use google's stressapptest normally; although remember to turn
> > > > > off the bit where it pauses.
> > > > 
> > > > I decided to use it too
> > > > stressapptest -s 300 -M 256 -m 8 -W
> > > > 
> > > > > 
> > > > > > I'm using Arcangeli's kernel only at the destination.
> > > > > > 
> > > > > > I got controversial results. Downtime for 1G hugepage is close to 
> > > > > > 2Mb
> > > > > > hugepage and it took around 7 ms (in 2Mb hugepage scenario downtime 
> > > > > > was
> > > > > > around 8 ms).
> > > > > > I made that opinion by query-migrate.
> > > > > > {"return": {"status": "completed", "setup-time": 6, "downtime": 6, 
> > > > > > "total-time": 9668, "ram": {"total": 1091379200, 
> > > > > > "postcopy-requests": 1, "dirty-sync-count": 2, "remaining": 0, 
> > > > > > "mbps": 879.786851, "transferred": 1063007296, "duplicate": 7449, 
> > > > > > "dirty-pages-rate": 0, "skipped": 0, "normal-bytes": 1060868096, 
> > > > > > "normal": 259001}}}
> > > > > > 
> > > > > > Documentation says about downtime field - measurement unit is ms.
> > > > > 
> > > > > The downtime measurement field is pretty meaningless for postcopy; 
> > > > > it's only
> > > > > the time from stopping the VM until the point where we tell the 
> > > > > destination it
> > > > > can start running.  Meaningful measurements are only from inside the 
> > > > > guest
> > > > > really, or the place latencys.
> > > > >
> > > > 
> > > > Maybe improve it by receiving such information from destination?
> > > > I wish to do that.
> > > > > > So I traced it (I added additional trace into postcopy_place_page
> > > > > > trace_postcopy_place_page_start(host, from, pagesize); )
> > > > > > 
> > > > > > postcopy_ram_fault_thread_request Request for HVA=7f6dc0000000 
> > > > > > rb=/objects/mem offset=0
> > > > > > postcopy_place_page_start host=0x7f6dc0000000 from=0x7f6d70000000, 
> > > > > > pagesize=40000000
> > > > > > postcopy_place_page_start host=0x7f6e0e800000 from=0x55b665969619, 
> > > > > > pagesize=1000
> > > > > > postcopy_place_page_start host=0x7f6e0e801000 from=0x55b6659684e8, 
> > > > > > pagesize=1000
> > > > > > several pages with 4Kb step ...
> > > > > > postcopy_place_page_start host=0x7f6e0e817000 from=0x55b6659694f0, 
> > > > > > pagesize=1000
> > > > > > 
> > > > > > 4K pages, started from 0x7f6e0e800000 address it's
> > > > > > vga.ram, /address@hidden/acpi/tables etc.
> > > > > > 
> > > > > > Frankly saying, right now, I don't have any ideas why hugepage 
> > > > > > wasn't
> > > > > > resent. Maybe my expectation of it is wrong as well as 
> > > > > > understanding )
> > > > > 
> > > > > That's pretty much what I expect to see - before you get into postcopy
> > > > > mode everything is sent as individual 4k pages (in order); once we're
> > > > > in postcopy mode we send each page no more than once.  So you're
> > > > > huge page comes across once - and there it is.
> > > > > 
> > > > > > stress utility also duplicated for me value into appropriate file:
> > > > > > sec_since_epoch.microsec:value
> > > > > > 1487003192.728493:22
> > > > > > 1487003197.335362:23
> > > > > > *1487003213.367260:24*
> > > > > > *1487003238.480379:25*
> > > > > > 1487003243.315299:26
> > > > > > 1487003250.775721:27
> > > > > > 1487003255.473792:28
> > > > > > 
> > > > > > It mean rewriting 256Mb of memory per byte took around 5 sec, but at
> > > > > > the moment of migration it took 25 sec.
> > > > > 
> > > > > right, now this is the thing that's more useful to measure.
> > > > > That's not too surprising; when it migrates that data is changing 
> > > > > rapidly
> > > > > so it's going to have to pause and wait for that whole 1GB to be 
> > > > > transferred.
> > > > > Your 1Gbps network is going to take about 10 seconds to transfer that
> > > > > 1GB page - and that's if you're lucky and it saturates the network.
> > > > > SO it's going to take at least 10 seconds longer than it normally
> > > > > would, plus any other overheads - so at least 15 seconds.
> > > > > This is why I say it's a bad idea to use 1GB host pages with postcopy.
> > > > > Of course it would be fun to find where the other 10 seconds went!
> > > > > 
> > > > > You might like to add timing to the tracing so you can see the time 
> > > > > between the
> > > > > fault thread requesting the page and it arriving.
> > > > >
> > > > yes, sorry I forgot about timing
> > > > address@hidden:postcopy_ram_fault_thread_request Request for 
> > > > HVA=7f0280000000 rb=/objects/mem offset=0
> > > > address@hidden:qemu_loadvm_state_section 8
> > > > address@hidden:loadvm_process_command com=0x2 len=4
> > > > address@hidden:qemu_loadvm_state_section 2
> > > > address@hidden:postcopy_place_page_start host=0x7f0280000000 
> > > > from=0x7f0240000000, pagesize=40000000
> > > > 
> > > > 1487084823.315919 - 1487084818.270993 = 5.044926 sec.
> > > > Machines connected w/o any routers, directly by cable.
> > > 
> > > OK, the fact it's only 5 seconds not 10 I think suggests a lot of the 
> > > memory was all zero
> > > so didn't take up the whole bandwidth.
> 
> > I decided to measure downtime as a sum of intervals since fault happened
> > and till page was load. I didn't relay on order, so I associated that
> > interval with fault address.
> 
> Don't forget the source will still be sending unrequested pages at the
> same time as fault responses; so that simplification might be wrong.
> My experience with 4k pages is you'll often get pages that arrive
> at about the same time as you ask for them because of the background 
> transmission.
> 
> > For 2G ram vm - using 1G huge page, downtime measured on dst is around 12 
> > sec,
> > but for the same 2G ram vm with 2Mb huge page, downtime measured on dst
> > is around 20 sec, and 320 page faults happened, 640 Mb was transmitted.
> 
> OK, so 20/320 * 1000=62.5msec/ page.   That's a bit high.
> I think it takes about 16ms to transmit a 2MB page on your 1Gbps network,
Yes, you right, transfer of the first page doesn't wait for prefetched page
transmission, and downtime for first page was 25 ms.

Next requested pages are queued (FIFO) so dst is waiting all prefetched pages,
it's around 5-7 pages transmission.
So I have a question why not to put requested page into the head of
queue in that case, and dst qemu will wait only lesser, only page which
was already in transmission.

Also if I'm not wrong, commands and pages are transferred over the same
socket. Why not to use OOB TCP in this case for commands?

> you're probably also suffering from the requests being queued behind
> background requests; if you try reducing your tcp_wmem setting on the
> source it might get a bit better.  Once Juan Quintela's multi-fd work
> goes in my hope is to combine it with postcopy and then be able to
> avoid that type of request blocking.
> Generally I'd not recommend 10Gbps for postcopy since it does pull
> down the latency quite a bit.
> 
> > My current method doesn't take into account multi core vcpu. I checked
> > only with 1 CPU, but it's not proper case. So I think it's worth to
> > count downtime per CPU, or calculate overlap of CPU downtimes.
> > How do your think?
> 
> Yes; one of the nice things about postcopy is that if one vCPU is blocked
> waiting for a page, the other vCPUs will just be able to carry on.
> Even with 1 vCPU if you've got multiple tasks that can run the guest can
> switch to a task that isn't blocked (See KVM asynchronous page faults).
> Now, what the numbers mean when you calculate the total like that might be a 
> bit
> odd - for example if you have 8 vCPUs and they're each blocked do you
> add the times together even though they're blocked at the same time? What
> about if they're blocked on the same page?

I implemented downtime calculation for all cpu's, the approach is
following:

Initially intervals are represented in tree where key is
pagefault address, and values:
    begin - page fault time
    end   - page load time
    cpus  - bit mask shows affected cpus

To calculate overlap on all cpus, intervals converted into
array of points in time (downtime_intervals), the size of
array is 2 * number of nodes in tree of intervals (2 array
elements per one in element of interval).
Each element is marked as end (E) or not the end (S) of
interval.
The overlap downtime will be calculated for SE, only in
case of sequence S(0..N)E(M) for every vCPU.

As example we have 3 CPU
     S1        E1           S1               E1
-----***********------------xxx***************------------------------> CPU1

            S2                E2
------------****************xxx---------------------------------------> CPU2

                        S3            E3
------------------------****xxx********-------------------------------> CPU3
                
We have sequence S1,S2,E1,S3,S1,E2,E3,E1
S2,E1 - doesn't match condition due to
sequence S1,S2,E1 doesn't include CPU3,
S3,S1,E2 - sequenece includes all CPUs, in
this case overlap will be S1,E2


But I don't send RFC now,
due to I faced an issue. Kernel doesn't inform user space about page's
owner in handle_userfault. So it's the question to Andrea. Is it worth
to add such information.
Frankly saying, I don't know is current (task_struct) in
handle_userfault equal to mm_struct's owner.

> 
> > Also I didn't yet finish IPC to provide such information to src host, where
> > info_migrate is being called.
> 
> Dave
> 
> > 
> > 
> > > 
> > > > > > Another one request.
> > > > > > QEMU could use mem_path in hugefs with share key simultaneously
> > > > > > (-object 
> > > > > > memory-backend-file,id=mem,size=${mem_size},mem-path=${mem_path},share=on)
> > > > > >  and vm
> > > > > > in this case will start and will properly work (it will allocate 
> > > > > > memory
> > > > > > with mmap), but in case of destination for postcopy live migration
> > > > > > UFFDIO_COPY ioctl will fail for
> > > > > > such region, in Arcangeli's git tree there is such prevent check
> > > > > > (if (!vma_is_shmem(dst_vma) && dst_vma->vm_flags & VM_SHARED).
> > > > > > Is it possible to handle such situation at qemu?
> > > > > 
> > > > > Imagine that you had shared memory; what semantics would you like
> > > > > to see ?  What happens to the other process?
> > > > 
> > > > Honestly, initially, I thought to handle such error, but I quit forgot
> > > > about vhost-user in ovs-dpdk.
> > > 
> > > Yes, I don't know much about vhost-user; but we'll have to think carefully
> > > about the way things behave when they're accessing memory that's shared
> > > with qemu during migration.  Writing to the source after we've started
> > > the postcopy phase is not allowed.  Accessing the destination memory
> > > during postcopy will produce pauses in the other processes accessing it
> > > (I think) and they mustn't do various types of madvise etc - so
> > > I'm sure there will be things we find out the hard way!
> > > 
> > > Dave
> > > 
> > > > > Dave
> > > > > 
> > > > > > On Mon, Feb 06, 2017 at 05:45:30PM +0000, Dr. David Alan Gilbert 
> > > > > > wrote:
> > > > > > > * Dr. David Alan Gilbert (git) (address@hidden) wrote:
> > > > > > > > From: "Dr. David Alan Gilbert" <address@hidden>
> > > > > > > > 
> > > > > > > > Hi,
> > > > > > > >   The existing postcopy code, and the userfault kernel
> > > > > > > > code that supports it, only works for normal anonymous memory.
> > > > > > > > Kernel support for userfault on hugetlbfs is working
> > > > > > > > it's way upstream; it's in the linux-mm tree,
> > > > > > > > You can get a version at:
> > > > > > > >    git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git
> > > > > > > > on the origin/userfault branch.
> > > > > > > > 
> > > > > > > > Note that while this code supports arbitrary sized hugepages,
> > > > > > > > it doesn't make sense with pages above the few-MB region,
> > > > > > > > so while 2MB is fine, 1GB is probably a bad idea;
> > > > > > > > this code waits for and transmits whole huge pages, and a
> > > > > > > > 1GB page would take about 1 second to transfer over a 10Gbps
> > > > > > > > link - which is way too long to pause the destination for.
> > > > > > > > 
> > > > > > > > Dave
> > > > > > > 
> > > > > > > Oops I missed the v2 changes from the message:
> > > > > > > 
> > > > > > > v2
> > > > > > >   Flip ram-size summary word/compare individual page size patches 
> > > > > > > around
> > > > > > >   Individual page size comparison is done in ram_load if 'advise' 
> > > > > > > has been
> > > > > > >     received rather than checking migrate_postcopy_ram()
> > > > > > >   Moved discard code into exec.c, reworked ram_discard_range
> > > > > > > 
> > > > > > > Dave
> > > > > > 
> > > > > > Thank your, right now it's not necessary to set
> > > > > > postcopy-ram capability on destination machine.
> > > > > > 
> > > > > > 
> > > > > > > 
> > > > > > > > Dr. David Alan Gilbert (16):
> > > > > > > >   postcopy: Transmit ram size summary word
> > > > > > > >   postcopy: Transmit and compare individual page sizes
> > > > > > > >   postcopy: Chunk discards for hugepages
> > > > > > > >   exec: ram_block_discard_range
> > > > > > > >   postcopy: enhance ram_block_discard_range for hugepages
> > > > > > > >   Fold postcopy_ram_discard_range into ram_discard_range
> > > > > > > >   postcopy: Record largest page size
> > > > > > > >   postcopy: Plumb pagesize down into place helpers
> > > > > > > >   postcopy: Use temporary for placing zero huge pages
> > > > > > > >   postcopy: Load huge pages in one go
> > > > > > > >   postcopy: Mask fault addresses to huge page boundary
> > > > > > > >   postcopy: Send whole huge pages
> > > > > > > >   postcopy: Allow hugepages
> > > > > > > >   postcopy: Update userfaultfd.h header
> > > > > > > >   postcopy: Check for userfault+hugepage feature
> > > > > > > >   postcopy: Add doc about hugepages and postcopy
> > > > > > > > 
> > > > > > > >  docs/migration.txt                |  13 ++++
> > > > > > > >  exec.c                            |  83 +++++++++++++++++++++++
> > > > > > > >  include/exec/cpu-common.h         |   2 +
> > > > > > > >  include/exec/memory.h             |   1 -
> > > > > > > >  include/migration/migration.h     |   3 +
> > > > > > > >  include/migration/postcopy-ram.h  |  13 ++--
> > > > > > > >  linux-headers/linux/userfaultfd.h |  81 +++++++++++++++++++---
> > > > > > > >  migration/migration.c             |   1 +
> > > > > > > >  migration/postcopy-ram.c          | 138 
> > > > > > > > +++++++++++++++++---------------------
> > > > > > > >  migration/ram.c                   | 109 
> > > > > > > > ++++++++++++++++++------------
> > > > > > > >  migration/savevm.c                |  32 ++++++---
> > > > > > > >  migration/trace-events            |   2 +-
> > > > > > > >  12 files changed, 328 insertions(+), 150 deletions(-)
> > > > > > > > 
> > > > > > > > -- 
> > > > > > > > 2.9.3
> > > > > > > > 
> > > > > > > > 
> > > > > > > --
> > > > > > > Dr. David Alan Gilbert / address@hidden / Manchester, UK
> > > > > > > 
> > > > > --
> > > > > Dr. David Alan Gilbert / address@hidden / Manchester, UK
> > > > > 
> > > > 
> > > > -- 
> > > > 
> > > > BR
> > > > Alexey
> > > --
> > > Dr. David Alan Gilbert / address@hidden / Manchester, UK
> > > 
> > 
> > -- 
> > 
> > BR
> > Alexey
> --
> Dr. David Alan Gilbert / address@hidden / Manchester, UK
>

BR
Alexey



reply via email to

[Prev in Thread] Current Thread [Next in Thread]