qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] Migration dirty bitmap: should only mark pages as dirty


From: Chunguang Li
Subject: Re: [Qemu-devel] Migration dirty bitmap: should only mark pages as dirty after they have been sent
Date: Fri, 4 Nov 2016 15:03:27 +0800 (GMT+08:00)



> -----Original Messages-----
> From: "Li, Liang Z" <address@hidden>
> Sent Time: Friday, November 4, 2016
> To: "Chunguang Li" <address@hidden>
> Cc: "Dr. David Alan Gilbert" <address@hidden>, "Amit Shah" <address@hidden>, 
> "address@hidden" <address@hidden>, "address@hidden" <address@hidden>, 
> "address@hidden" <address@hidden>, "address@hidden" <address@hidden>
> Subject: RE: RE: [Qemu-devel] Migration dirty bitmap: should only mark pages 
> as dirty after they have been sent
> 
> > > > > > > > I think this is "very" wasteful. Assume the workload writes
> > > > > > > > the pages
> > > > dirty randomly within the guest address space, and the transfer
> > > > speed is constant. Intuitively, I think nearly half of the dirty
> > > > pages produced in Iteration 1 is not really dirty. This means the
> > > > time of Iteration 2 is double of that to send only really dirty pages.
> > > > > > >
> > > > > > > It makes sense, can you get some perf numbers to show what
> > > > > > > kinds of workloads get impacted the most?  That would also
> > > > > > > help us to figure out what kinds of speed improvements we can
> > expect.
> > > > > > >
> > > > > > >
> > > > > > >           Amit
> > > > > >
> > > > > > I have picked up 6 workloads and got the following statistics
> > > > > > numbers of every iteration (except the last stop-copy one) during
> > precopy.
> > > > > > These numbers are obtained with the basic precopy migration,
> > > > > > without the capabilities like xbzrle or compression, etc. The
> > > > > > network for the migration is exclusive, with a separate network for
> > the workloads.
> > > > > > They are both gigabit ethernet. I use qemu-2.5.1.
> > > > > >
> > > > > > Three (booting, idle, web server) of them converged to the
> > > > > > stop-copy
> > > > phase,
> > > > > > with the given bandwidth and default downtime (300ms), while the
> > > > > > other three (kernel compilation, zeusmp, memcached) did not.
> > > > > >
> > > > > > One page is "not-really-dirty", if it is written first and is
> > > > > > sent later (and not written again after that) during one
> > > > > > iteration. I guess this would not happen so often during the
> > > > > > other iterations as during the 1st iteration. Because all the
> > > > > > pages of the VM are sent to the dest node
> > > > during
> > > > > > the 1st iteration, while during the others, only part of the pages 
> > > > > > are
> > sent.
> > > > > > So I think the "not-really-dirty" pages should be produced
> > > > > > mainly during the 1st iteration , and maybe very little during the 
> > > > > > other
> > iterations.
> > > > > >
> > > > > > If we could avoid resending the "not-really-dirty" pages,
> > > > > > intuitively, I think the time spent on Iteration 2 would be
> > > > > > halved. This is a chain
> > > > reaction,
> > > > > > because the dirty pages produced during Iteration 2 is halved,
> > > > > > which
> > > > incurs
> > > > > > that the time spent on Iteration 3 is halved, then Iteration 4, 5...
> > > > >
> > > > > Yes; these numbers don't show how many of them are false dirty
> > though.
> > > > >
> > > > > One problem is thinking about pages that have been redirtied, if
> > > > > the page is
> > > > dirtied
> > > > > after the sync but before the network write then it's the
> > > > > false-dirty that you're describing.
> > > > >
> > > > > However, if the page is being written a few times, and so it would
> > > > > have
> > > > been written
> > > > > after the network write then it isn't a false-dirty.
> > > > >
> > > > > You might be able to figure that out with some kernel tracing of
> > > > > when the
> > > > dirtying
> > > > > happens, but it might be easier to write the fix!
> > > > >
> > > > > Dave
> > > >
> > > > Hi, I have made some new progress now.
> > > >
> > > > To tell how many false dirty pages there are exactly in each
> > > > iteration, I malloc a buffer in memory as big as the size of the
> > > > whole VM memory. When a page is transferred to the dest node, it is
> > > > copied to the buffer; During the next iteration, if one page is
> > > > transferred, it is compared to the old one in the buffer, and the
> > > > old one will be replaced for next comparison if it is really dirty.
> > > > Thus, we are now able to get the exact number of false dirty pages.
> > > >
> > > > This time, I use 15 workloads to get the statistic number. They are:
> > > >
> > > >   1. 11 benchmarks picked up from cpu2006 benchmark suit. They are
> > > > all scientific
> > > >      computing workloads like Quantum Chromodynamics, Fluid Dynamics,
> > etc.
> > > > I pick
> > > >      up these 11 benchmarks because compared to others, they have
> > > > bigger memory
> > > >      occupation and higher memory dirty rate. Thus most of them
> > > > could not converge
> > > >      to stop-and-copy using the default migration speed (32MB/s).
> > > >   2. kernel compilation
> > > >   3. idle VM
> > > >   4. Apache web server which serves static content
> > > >
> > > >   (the above workloads are all running in VM with 1 vcpu and 1GB
> > > > memory, and the
> > > >    migration speed is the default 32MB/s)
> > > >
> > > >   5. Memcached. The VM has 6 cpu cores and 6GB memory, and 4GB are
> > > > used as the cache.
> > > >      After filling up the 4GB cache, a client writes the cache at a 
> > > > constant
> > speed
> > > >      during migration. This time, migration speed has no limit, and is 
> > > > up to
> > the
> > > >      capability of 1Gbps Ethernet.
> > > >
> > > > Summarize the results first: (and you can read the precise number
> > > > below)
> > > >
> > > >   1. 4 of these 15 workloads have a big proportion (>60%, even >80%
> > > > during some iterations)
> > > >      of false dirty pages out of all the dirty pages since iteration 2 
> > > > (and the
> > big
> > > >      proportion lasts during the following iterations). They are
> > cpu2006.zeusmp,
> > > >      cpu2006.bzip2, cpu2006.mcf, and memcached.
> > > >   2. 2 workloads (idle, webserver) spend most of the migration time
> > > > on iteration 1, even
> > > >      though the proportion of false dirty pages is big since
> > > > iteration 2, the space to
> > > >      optimize is small.
> > > >   3. 1 workload (kernel compilation) only have a big proportion
> > > > during iteration 2, not
> > > >      in the other iterations.
> > > >   4. 8 workloads (the other 8 benchmarks of cpu2006) have little
> > > > proportion of false
> > > >      dirty pages since iteration 2. So the spaces to optimize for them 
> > > > are
> > small.
> > > >
> > > > Now I want to talk a little more about the reasons why false dirty
> > > > pages are produced.
> > > > The first reason is what we have discussed before---the mechanism to
> > > > track the dirty pages.
> > > > And then I come up with another reason. Here is the situation: a
> > > > write operation to one memory page happens, but it doesn't change
> > > > any content of the page. So it's "write but not dirty", and kernel
> > > > still marks it as dirty. One guy in our lab has done some
> > > > experiments to figure out the proportion of "write but not dirty"
> > > > operations, and he uses the cpu2006 benchmark suit. According to his
> > > > results, general workloads has a little proportion (<10%) of "write
> > > > but not dirty" out of all the write operations, while few workloads
> > > > have higher proportion (one even as high as 50%). Now we are not
> > > > sure why "write but not dirty" would happen, it just happened.
> > > >
> > > > So these two reasons contribute to the false dirty pages. To
> > > > optimize, I compute and store the SHA1 hash before transferring each
> > > > page. Next time, if one page needs retransmission, its
> > > > SHA1 hash is computed again, and compared to the old hash. If the
> > > > hash is the same, it's a false dirty page, and we just skip this
> > > > page; Otherwise, the page is transferred, and the new hash replaces
> > > > the old one for next comparison.
> > > > The reason to use SHA1 hash but not byte-by-byte comparison is the
> > > > memory overheads. One SHA1 hash is 20 bytes. So we need extra
> > > > 20/4096 (<1/200) memory space of the whole VM memory, which is
> > > > relatively small.
> > > > As far as I know, SHA1 hash is widely used in the scenes of
> > > > deduplication for backup systems.
> > > > They have proven that the probability of hash collision is far
> > > > smaller than disk hardware fault, so it's secure hash, that is, if
> > > > the hashes of two chunks are the same, the content must be the same.
> > > > So I think the SHA1 hash could replace byte-to-byte comparison in
> > > > the VM memory scenery.
> > > >
> > > > Then I do the same migration experiments using the SHA1 hash. For
> > > > the 4 workloads which have big proportions of false dirty pages, the
> > > > improvement is remarkable. Without optimization, they either can not
> > > > converge to stop-and-copy, or take a very long time to complete.
> > > > With the
> > > > SHA1 hash method, all of them now complete in a relatively short time.
> > > > For the reason I have talked above, the other workloads don't get
> > > > notable improvements from the optimization. So below, I only show
> > > > the exact number after optimization for the 4 workloads with
> > > > remarkable improvements.
> > > >
> > > > Any comments or suggestions?
> > >
> > > Maybe you can compare the performance of your solution as that of
> > XBZRLE to see which one is better.
> > > The merit of using SHA1 is that it can avoid data copy as that in XBZRLE, 
> > > and
> > need less buffer.
> > > How about the overhead of calculating the SHA1? Is it faster than copying 
> > > a
> > page?
> > >
> > > Liang
> > >
> > >
> > 
> > Yes, XBZRLE is able to handle the false dirty pages. However, if we want to
> > avoid transferring all of the false dirty pages using XBZRLE, we need a 
> > buffer
> > as big as the whole VM memory, while SHA1 needs a much small buffer. Of
> > course, if we have a buffer as big as the whole VM memory using XBZRLE, we
> > could transfer less data on network than SHA1, because XBZRLE is able to
> > compress similar pages. In a word, yes, the merit of using SHA1 is that it
> > needs much less buffer, and leads to nice improvement if there are many
> > false dirty pages.
> > 
> 
> The current implementation of XBZRLE begins to buffer page from the second 
> iteration,
> Maybe it's worth to make it start to work from the first iteration based on 
> your finding.

Yes, I noticed that. If we make it start to work from the first iteration, I 
think the 
buffer should be large enough to obtain obvious effect.

> 
> > In terms of the overhead of calculating the SHA1 compared with transferring
> > a page, it's related to the CPU and network performance. In my test
> > environment(Intel Xeon
> > E5620 @2.4GHz, 1Gbps Ethernet), I didn't observe obvious extra computing
> > overhead caused by calculating the SHA1, because the throughput of
> > network (got by "info migrate") remains almost the same.
> 
> You can check the CPU usage, or to measure the time spend on a local live 
> migration
>  which use SHA1/ XBZRLE.

Yes, I can compare SHA1 with XBZRLE. Maybe I will post the results later.

Chunguang

> 
> Liang
> 
> 








reply via email to

[Prev in Thread] Current Thread [Next in Thread]