qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] Migration dirty bitmap: should only mark pages as dirty


From: Chunguang Li
Subject: Re: [Qemu-devel] Migration dirty bitmap: should only mark pages as dirty after they have been sent
Date: Fri, 4 Nov 2016 11:07:21 +0800 (GMT+08:00)



> -----Original Messages-----
> From: "Li, Liang Z" <address@hidden>
> Sent Time: Thursday, November 3, 2016
> To: "Chunguang Li" <address@hidden>, "Dr. David Alan Gilbert" <address@hidden>
> Cc: "Amit Shah" <address@hidden>, "address@hidden" <address@hidden>, 
> "address@hidden" <address@hidden>, "address@hidden" <address@hidden>, 
> "address@hidden" <address@hidden>
> Subject: RE: [Qemu-devel] Migration dirty bitmap: should only mark pages as 
> dirty after they have been sent
> 
> > > > > > I think this is "very" wasteful. Assume the workload writes the 
> > > > > > pages
> > dirty randomly within the guest address space, and the transfer speed is
> > constant. Intuitively, I think nearly half of the dirty pages produced in
> > Iteration 1 is not really dirty. This means the time of Iteration 2 is 
> > double of
> > that to send only really dirty pages.
> > > > >
> > > > > It makes sense, can you get some perf numbers to show what kinds of
> > > > > workloads get impacted the most?  That would also help us to figure
> > > > > out what kinds of speed improvements we can expect.
> > > > >
> > > > >
> > > > >               Amit
> > > >
> > > > I have picked up 6 workloads and got the following statistics numbers
> > > > of every iteration (except the last stop-copy one) during precopy.
> > > > These numbers are obtained with the basic precopy migration, without
> > > > the capabilities like xbzrle or compression, etc. The network for the
> > > > migration is exclusive, with a separate network for the workloads.
> > > > They are both gigabit ethernet. I use qemu-2.5.1.
> > > >
> > > > Three (booting, idle, web server) of them converged to the stop-copy
> > phase,
> > > > with the given bandwidth and default downtime (300ms), while the other
> > > > three (kernel compilation, zeusmp, memcached) did not.
> > > >
> > > > One page is "not-really-dirty", if it is written first and is sent later
> > > > (and not written again after that) during one iteration. I guess this
> > > > would not happen so often during the other iterations as during the 1st
> > > > iteration. Because all the pages of the VM are sent to the dest node
> > during
> > > > the 1st iteration, while during the others, only part of the pages are 
> > > > sent.
> > > > So I think the "not-really-dirty" pages should be produced mainly during
> > > > the 1st iteration , and maybe very little during the other iterations.
> > > >
> > > > If we could avoid resending the "not-really-dirty" pages, intuitively, I
> > > > think the time spent on Iteration 2 would be halved. This is a chain
> > reaction,
> > > > because the dirty pages produced during Iteration 2 is halved, which
> > incurs
> > > > that the time spent on Iteration 3 is halved, then Iteration 4, 5...
> > >
> > > Yes; these numbers don't show how many of them are false dirty though.
> > >
> > > One problem is thinking about pages that have been redirtied, if the page 
> > > is
> > dirtied
> > > after the sync but before the network write then it's the false-dirty that
> > > you're describing.
> > >
> > > However, if the page is being written a few times, and so it would have
> > been written
> > > after the network write then it isn't a false-dirty.
> > >
> > > You might be able to figure that out with some kernel tracing of when the
> > dirtying
> > > happens, but it might be easier to write the fix!
> > >
> > > Dave
> > 
> > Hi, I have made some new progress now.
> > 
> > To tell how many false dirty pages there are exactly in each iteration, I 
> > malloc
> > a
> > buffer in memory as big as the size of the whole VM memory. When a page
> > is
> > transferred to the dest node, it is copied to the buffer; During the next
> > iteration,
> > if one page is transferred, it is compared to the old one in the buffer, 
> > and the
> > old one will be replaced for next comparison if it is really dirty. Thus, 
> > we are
> > now
> > able to get the exact number of false dirty pages.
> > 
> > This time, I use 15 workloads to get the statistic number. They are:
> > 
> >   1. 11 benchmarks picked up from cpu2006 benchmark suit. They are all
> > scientific
> >      computing workloads like Quantum Chromodynamics, Fluid Dynamics, etc.
> > I pick
> >      up these 11 benchmarks because compared to others, they have bigger
> > memory
> >      occupation and higher memory dirty rate. Thus most of them could not
> > converge
> >      to stop-and-copy using the default migration speed (32MB/s).
> >   2. kernel compilation
> >   3. idle VM
> >   4. Apache web server which serves static content
> > 
> >   (the above workloads are all running in VM with 1 vcpu and 1GB memory,
> > and the
> >    migration speed is the default 32MB/s)
> > 
> >   5. Memcached. The VM has 6 cpu cores and 6GB memory, and 4GB are used
> > as the cache.
> >      After filling up the 4GB cache, a client writes the cache at a 
> > constant speed
> >      during migration. This time, migration speed has no limit, and is up 
> > to the
> >      capability of 1Gbps Ethernet.
> > 
> > Summarize the results first: (and you can read the precise number below)
> > 
> >   1. 4 of these 15 workloads have a big proportion (>60%, even >80% during
> > some iterations)
> >      of false dirty pages out of all the dirty pages since iteration 2 (and 
> > the big
> >      proportion lasts during the following iterations). They are 
> > cpu2006.zeusmp,
> >      cpu2006.bzip2, cpu2006.mcf, and memcached.
> >   2. 2 workloads (idle, webserver) spend most of the migration time on
> > iteration 1, even
> >      though the proportion of false dirty pages is big since iteration 2, 
> > the space
> > to
> >      optimize is small.
> >   3. 1 workload (kernel compilation) only have a big proportion during
> > iteration 2, not
> >      in the other iterations.
> >   4. 8 workloads (the other 8 benchmarks of cpu2006) have little proportion 
> > of
> > false
> >      dirty pages since iteration 2. So the spaces to optimize for them are 
> > small.
> > 
> > Now I want to talk a little more about the reasons why false dirty pages are
> > produced.
> > The first reason is what we have discussed before---the mechanism to track
> > the dirty
> > pages.
> > And then I come up with another reason. Here is the situation: a write
> > operation to one
> > memory page happens, but it doesn't change any content of the page. So it's
> > "write but
> > not dirty", and kernel still marks it as dirty. One guy in our lab has done 
> > some
> > experiments
> > to figure out the proportion of "write but not dirty" operations, and he 
> > uses
> > the cpu2006
> > benchmark suit. According to his results, general workloads has a little
> > proportion (<10%)
> > of "write but not dirty" out of all the write operations, while few 
> > workloads
> > have higher
> > proportion (one even as high as 50%). Now we are not sure why "write but
> > not dirty" would
> > happen, it just happened.
> > 
> > So these two reasons contribute to the false dirty pages. To optimize, I
> > compute and store
> > the SHA1 hash before transferring each page. Next time, if one page needs
> > retransmission, its
> > SHA1 hash is computed again, and compared to the old hash. If the hash is
> > the same, it's a
> > false dirty page, and we just skip this page; Otherwise, the page is
> > transferred, and the new
> > hash replaces the old one for next comparison.
> > The reason to use SHA1 hash but not byte-by-byte comparison is the
> > memory overheads. One SHA1
> > hash is 20 bytes. So we need extra 20/4096 (<1/200) memory space of the
> > whole VM memory, which
> > is relatively small.
> > As far as I know, SHA1 hash is widely used in the scenes of deduplication 
> > for
> > backup systems.
> > They have proven that the probability of hash collision is far smaller than 
> > disk
> > hardware fault,
> > so it's secure hash, that is, if the hashes of two chunks are the same, the
> > content must be the
> > same. So I think the SHA1 hash could replace byte-to-byte comparison in the
> > VM memory scenery.
> > 
> > Then I do the same migration experiments using the SHA1 hash. For the 4
> > workloads which have
> > big proportions of false dirty pages, the improvement is remarkable. Without
> > optimization,
> > they either can not converge to stop-and-copy, or take a very long time to
> > complete. With the
> > SHA1 hash method, all of them now complete in a relatively short time.
> > For the reason I have talked above, the other workloads don't get notable
> > improvements from the
> > optimization. So below, I only show the exact number after optimization for
> > the 4 workloads with
> > remarkable improvements.
> > 
> > Any comments or suggestions?
> 
> Maybe you can compare the performance of your solution as that of XBZRLE to 
> see which one is better.
> The merit of using SHA1 is that it can avoid data copy as that in XBZRLE, and 
> need less buffer.
> How about the overhead of calculating the SHA1? Is it faster than copying a 
> page?
> 
> Liang
> 
> 

Yes, XBZRLE is able to handle the false dirty pages. However, if we want to 
avoid 
transferring all of the false dirty pages using XBZRLE, we need a buffer as big 
as 
the whole VM memory, while SHA1 needs a much small buffer. Of course, if we 
have a buffer as big as the whole VM memory using XBZRLE, we could transfer 
less data
on network than SHA1, because XBZRLE is able to compress similar pages. In a 
word, yes,
the merit of using SHA1 is that it needs much less buffer, and leads to nice 
improvement
if there are many false dirty pages.

In terms of the overhead of calculating the SHA1 compared with transferring a 
page, 
it's related to the CPU and network performance. In my test environment(Intel 
Xeon
E5620 @2.4GHz, 1Gbps Ethernet), I didn't observe obvious extra computing 
overhead 
caused by calculating the SHA1, because the throughput of network (got by "info 
migrate") 
remains almost the same.

--
Chunguang Li, Ph.D. Candidate
Wuhan National Laboratory for Optoelectronics (WNLO)
Huazhong University of Science & Technology (HUST)
Wuhan, Hubei Prov., China






reply via email to

[Prev in Thread] Current Thread [Next in Thread]