qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] Migration dirty bitmap: should only mark pages as dirty


From: Chunguang Li
Subject: Re: [Qemu-devel] Migration dirty bitmap: should only mark pages as dirty after they have been sent
Date: Tue, 8 Nov 2016 13:27:01 +0800 (GMT+08:00)



> -----Original Messages-----
> From: "Li, Liang Z" <address@hidden>
> Sent Time: Monday, November 7, 2016
> To: "Chunguang Li" <address@hidden>
> Cc: "Dr. David Alan Gilbert" <address@hidden>, "Amit Shah" <address@hidden>, 
> "address@hidden" <address@hidden>, "address@hidden" <address@hidden>, 
> "address@hidden" <address@hidden>, "address@hidden" <address@hidden>
> Subject: RE: [Qemu-devel] Migration dirty bitmap: should only mark pages as 
> dirty after they have been sent
> 
> > > > > > > > > > I think this is "very" wasteful. Assume the workload
> > > > > > > > > > writes the pages
> > > > > > dirty randomly within the guest address space, and the transfer
> > > > > > speed is constant. Intuitively, I think nearly half of the dirty
> > > > > > pages produced in Iteration 1 is not really dirty. This means
> > > > > > the time of Iteration 2 is double of that to send only really dirty 
> > > > > > pages.
> > > > > > > > >
> > > > > > > > > It makes sense, can you get some perf numbers to show what
> > > > > > > > > kinds of workloads get impacted the most?  That would also
> > > > > > > > > help us to figure out what kinds of speed improvements we
> > > > > > > > > can
> > > > expect.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >               Amit
> > > > > > > >
> > > > > > > > I have picked up 6 workloads and got the following
> > > > > > > > statistics numbers of every iteration (except the last
> > > > > > > > stop-copy one) during
> > > > precopy.
> > > > > > > > These numbers are obtained with the basic precopy migration,
> > > > > > > > without the capabilities like xbzrle or compression, etc.
> > > > > > > > The network for the migration is exclusive, with a separate
> > > > > > > > network for
> > > > the workloads.
> > > > > > > > They are both gigabit ethernet. I use qemu-2.5.1.
> > > > > > > >
> > > > > > > > Three (booting, idle, web server) of them converged to the
> > > > > > > > stop-copy
> > > > > > phase,
> > > > > > > > with the given bandwidth and default downtime (300ms), while
> > > > > > > > the other three (kernel compilation, zeusmp, memcached) did not.
> > > > > > > >
> > > > > > > > One page is "not-really-dirty", if it is written first and
> > > > > > > > is sent later (and not written again after that) during one
> > > > > > > > iteration. I guess this would not happen so often during the
> > > > > > > > other iterations as during the 1st iteration. Because all
> > > > > > > > the pages of the VM are sent to the dest node
> > > > > > during
> > > > > > > > the 1st iteration, while during the others, only part of the
> > > > > > > > pages are
> > > > sent.
> > > > > > > > So I think the "not-really-dirty" pages should be produced
> > > > > > > > mainly during the 1st iteration , and maybe very little
> > > > > > > > during the other
> > > > iterations.
> > > > > > > >
> > > > > > > > If we could avoid resending the "not-really-dirty" pages,
> > > > > > > > intuitively, I think the time spent on Iteration 2 would be
> > > > > > > > halved. This is a chain
> > > > > > reaction,
> > > > > > > > because the dirty pages produced during Iteration 2 is
> > > > > > > > halved, which
> > > > > > incurs
> > > > > > > > that the time spent on Iteration 3 is halved, then Iteration 4, 
> > > > > > > > 5...
> > > > > > >
> > > > > > > Yes; these numbers don't show how many of them are false dirty
> > > > though.
> > > > > > >
> > > > > > > One problem is thinking about pages that have been redirtied,
> > > > > > > if the page is
> > > > > > dirtied
> > > > > > > after the sync but before the network write then it's the
> > > > > > > false-dirty that you're describing.
> > > > > > >
> > > > > > > However, if the page is being written a few times, and so it
> > > > > > > would have
> > > > > > been written
> > > > > > > after the network write then it isn't a false-dirty.
> > > > > > >
> > > > > > > You might be able to figure that out with some kernel tracing
> > > > > > > of when the
> > > > > > dirtying
> > > > > > > happens, but it might be easier to write the fix!
> > > > > > >
> > > > > > > Dave
> > > > > >
> > > > > > Hi, I have made some new progress now.
> > > > > >
> > > > > > To tell how many false dirty pages there are exactly in each
> > > > > > iteration, I malloc a buffer in memory as big as the size of the
> > > > > > whole VM memory. When a page is transferred to the dest node, it
> > > > > > is copied to the buffer; During the next iteration, if one page
> > > > > > is transferred, it is compared to the old one in the buffer, and
> > > > > > the old one will be replaced for next comparison if it is really 
> > > > > > dirty.
> > > > > > Thus, we are now able to get the exact number of false dirty pages.
> > > > > >
> > > > > > This time, I use 15 workloads to get the statistic number. They are:
> > > > > >
> > > > > >   1. 11 benchmarks picked up from cpu2006 benchmark suit. They
> > > > > > are all scientific
> > > > > >      computing workloads like Quantum Chromodynamics, Fluid
> > > > > > Dynamics,
> > > > etc.
> > > > > > I pick
> > > > > >      up these 11 benchmarks because compared to others, they
> > > > > > have bigger memory
> > > > > >      occupation and higher memory dirty rate. Thus most of them
> > > > > > could not converge
> > > > > >      to stop-and-copy using the default migration speed (32MB/s).
> > > > > >   2. kernel compilation
> > > > > >   3. idle VM
> > > > > >   4. Apache web server which serves static content
> > > > > >
> > > > > >   (the above workloads are all running in VM with 1 vcpu and 1GB
> > > > > > memory, and the
> > > > > >    migration speed is the default 32MB/s)
> > > > > >
> > > > > >   5. Memcached. The VM has 6 cpu cores and 6GB memory, and 4GB
> > > > > > are used as the cache.
> > > > > >      After filling up the 4GB cache, a client writes the cache
> > > > > > at a constant
> > > > speed
> > > > > >      during migration. This time, migration speed has no limit,
> > > > > > and is up to
> > > > the
> > > > > >      capability of 1Gbps Ethernet.
> > > > > >
> > > > > > Summarize the results first: (and you can read the precise
> > > > > > number
> > > > > > below)
> > > > > >
> > > > > >   1. 4 of these 15 workloads have a big proportion (>60%, even
> > > > > > >80% during some iterations)
> > > > > >      of false dirty pages out of all the dirty pages since
> > > > > > iteration 2 (and the
> > > > big
> > > > > >      proportion lasts during the following iterations). They are
> > > > cpu2006.zeusmp,
> > > > > >      cpu2006.bzip2, cpu2006.mcf, and memcached.
> > > > > >   2. 2 workloads (idle, webserver) spend most of the migration
> > > > > > time on iteration 1, even
> > > > > >      though the proportion of false dirty pages is big since
> > > > > > iteration 2, the space to
> > > > > >      optimize is small.
> > > > > >   3. 1 workload (kernel compilation) only have a big proportion
> > > > > > during iteration 2, not
> > > > > >      in the other iterations.
> > > > > >   4. 8 workloads (the other 8 benchmarks of cpu2006) have little
> > > > > > proportion of false
> > > > > >      dirty pages since iteration 2. So the spaces to optimize
> > > > > > for them are
> > > > small.
> > > > > >
> > > > > > Now I want to talk a little more about the reasons why false
> > > > > > dirty pages are produced.
> > > > > > The first reason is what we have discussed before---the
> > > > > > mechanism to track the dirty pages.
> > > > > > And then I come up with another reason. Here is the situation: a
> > > > > > write operation to one memory page happens, but it doesn't
> > > > > > change any content of the page. So it's "write but not dirty",
> > > > > > and kernel still marks it as dirty. One guy in our lab has done
> > > > > > some experiments to figure out the proportion of "write but not
> > dirty"
> > > > > > operations, and he uses the cpu2006 benchmark suit. According to
> > > > > > his results, general workloads has a little proportion (<10%) of
> > > > > > "write but not dirty" out of all the write operations, while few
> > > > > > workloads have higher proportion (one even as high as 50%). Now
> > > > > > we are not sure why "write but not dirty" would happen, it just
> > happened.
> > > > > >
> > > > > > So these two reasons contribute to the false dirty pages. To
> > > > > > optimize, I compute and store the SHA1 hash before transferring
> > > > > > each page. Next time, if one page needs retransmission, its
> > > > > > SHA1 hash is computed again, and compared to the old hash. If
> > > > > > the hash is the same, it's a false dirty page, and we just skip
> > > > > > this page; Otherwise, the page is transferred, and the new hash
> > > > > > replaces the old one for next comparison.
> > > > > > The reason to use SHA1 hash but not byte-by-byte comparison is
> > > > > > the memory overheads. One SHA1 hash is 20 bytes. So we need
> > > > > > extra
> > > > > > 20/4096 (<1/200) memory space of the whole VM memory, which is
> > > > > > relatively small.
> > > > > > As far as I know, SHA1 hash is widely used in the scenes of
> > > > > > deduplication for backup systems.
> > > > > > They have proven that the probability of hash collision is far
> > > > > > smaller than disk hardware fault, so it's secure hash, that is,
> > > > > > if the hashes of two chunks are the same, the content must be the
> > same.
> > > > > > So I think the SHA1 hash could replace byte-to-byte comparison
> > > > > > in the VM memory scenery.
> > > > > >
> > > > > > Then I do the same migration experiments using the SHA1 hash.
> > > > > > For the 4 workloads which have big proportions of false dirty
> > > > > > pages, the improvement is remarkable. Without optimization, they
> > > > > > either can not converge to stop-and-copy, or take a very long time 
> > > > > > to
> > complete.
> > > > > > With the
> > > > > > SHA1 hash method, all of them now complete in a relatively short
> > time.
> > > > > > For the reason I have talked above, the other workloads don't
> > > > > > get notable improvements from the optimization. So below, I only
> > > > > > show the exact number after optimization for the 4 workloads
> > > > > > with remarkable improvements.
> > > > > >
> > > > > > Any comments or suggestions?
> > > > >
> > > > > Maybe you can compare the performance of your solution as that of
> > > > XBZRLE to see which one is better.
> > > > > The merit of using SHA1 is that it can avoid data copy as that in
> > > > > XBZRLE, and
> > > > need less buffer.
> > > > > How about the overhead of calculating the SHA1? Is it faster than
> > > > > copying a
> > > > page?
> > > > >
> > > > > Liang
> > > > >
> > > > >
> > > >
> > > > Yes, XBZRLE is able to handle the false dirty pages. However, if we
> > > > want to avoid transferring all of the false dirty pages using
> > > > XBZRLE, we need a buffer as big as the whole VM memory, while SHA1
> > > > needs a much small buffer. Of course, if we have a buffer as big as
> > > > the whole VM memory using XBZRLE, we could transfer less data on
> > > > network than SHA1, because XBZRLE is able to compress similar pages.
> > > > In a word, yes, the merit of using SHA1 is that it needs much less
> > > > buffer, and leads to nice improvement if there are many false dirty 
> > > > pages.
> > > >
> > >
> > > The current implementation of XBZRLE begins to buffer page from the
> > > second iteration, Maybe it's worth to make it start to work from the first
> > iteration based on your finding.
> > >
> > > > In terms of the overhead of calculating the SHA1 compared with
> > > > transferring a page, it's related to the CPU and network
> > > > performance. In my test environment(Intel Xeon
> > > > E5620 @2.4GHz, 1Gbps Ethernet), I didn't observe obvious extra
> > > > computing overhead caused by calculating the SHA1, because the
> > > > throughput of network (got by "info migrate") remains almost the same.
> > >
> > > You can check the CPU usage, or to measure the time spend on a local
> > > live migration  which use SHA1/ XBZRLE.
> > >
> > > Liang
> > >
> > >
> > 
> > I compare SHA1 with XBZRLE. I use XBZRLE in two ways:
> > 1. Begins to buffer pages from iteration 1; 2. As current implementation,
> > begins to buffer pages from iteration 2.
> > 
> > I post the results of three workloads: cpu2006.zeusmp, cpu2006.mcf,
> > memcached.
> > I set the cache size as 256MB for zeusmp & mcf (they run in VM with 1GB
> > ram), and set the cache size as 1GB for memcached (it run in VM with 6GB
> > ram, and memcached takes 4GB as cache).
> > 
> > As you can read from the data below, beginning to buffer pages from
> > iteration 1 is better than the current implementation(from iteration 2),
> > because the total migration time is shorter.
> > 
> > SHA1 is better than the XBZRLE with the cache size I choose, because it 
> > leads
> > to shorter migration time, and consumes far less memory overhead (<1/200
> > of the total VM memory).
> > 
> 
> Hi Chunguang,
> 
> Have you tried to use a large XBZRLE cache size which equals to the guest's 
> RAM size?
> Is SHA1 faster in that case?
> 
> Thanks!
> Liang

You can check the data below. For zeusmp and mcf when the XBZRLE cache size 
equals to 
the guest's RAM size (in fact, the 1024 cache size is a little smaller than the 
RAM
size, because the guest's RAM has a little extra ram space besides the 1GB we 
set), 
XBZRLE is faster than SHA1.

For the memcached, I am not able to set the cache size as the 6GB RAM size, 
because the 
cache size has to be a power of 2; And I am not able to set it larger than RAM 
size, because
the current implementation doesn't allow that. So I set the cache size as 4GB, 
and XBZRLE
with this cache size is almost the same as SHA1 in terms of migration time.

Note that XBZRLE begins to buffer pages from iteration 1.

zeusmp 1024MB cache

Iteration   1, duration:  21604 ms , transferred pages:   266450 (dup:    
89509, n:   176941, x:        0) , new dirty pages:   129647 , remaining dirty 
pages:   129647
Iteration   2, duration:    652 ms , transferred pages:    89270 (dup:    
78176, n:     1085, x:    10009) , new dirty pages:    46438 , remaining dirty 
pages:    46438
Iteration   3, duration:    400 ms , transferred pages:    35789 (dup:    
30536, n:        0, x:     5253) , new dirty pages:    33569 , remaining dirty 
pages:    33569
Iteration   4, duration:    470 ms , transferred pages:    19106 (dup:    
10317, n:       75, x:     8714) , new dirty pages:    39307 , remaining dirty 
pages:    39307
Iteration   5, duration:     72 ms , transferred pages:    17853 (dup:    
15904, n:        0, x:     1949) , new dirty pages:     4078 , remaining dirty 
pages:     4078
Iteration   6, duration:     10 ms , transferred pages:     3280 (dup:     
2910, n:        0, x:      370) , new dirty pages:      521 , remaining dirty 
pages:      521
Iteration   7, duration:    254 ms , transferred pages:        0 (dup:        
0, n:        0, x:        0) , new dirty pages:        0 , remaining dirty 
pages:      521
total time: 23481 milliseconds  (v.s. 27225 milliseconds for SHA1)

mcf 1024MB cache

Iteration   1, duration:  31704 ms , transferred pages:   266450 (dup:     
6794, n:   259656, x:        0) , new dirty pages:   233250 , remaining dirty 
pages:   233250
Iteration   2, duration:    544 ms , transferred pages:    34186 (dup:      
182, n:      423, x:    33581) , new dirty pages:    32757 , remaining dirty 
pages:    32757
Iteration   3, duration:     67 ms , transferred pages:     8536 (dup:        
0, n:        0, x:     8536) , new dirty pages:     5305 , remaining dirty 
pages:     5305
Iteration   4, duration:     13 ms , transferred pages:     2125 (dup:        
0, n:        0, x:     2125) , new dirty pages:     1632 , remaining dirty 
pages:     1632
Iteration   5, duration:      9 ms , transferred pages:     1038 (dup:        
0, n:        0, x:     1038) , new dirty pages:     1095 , remaining dirty 
pages:     1095
Iteration   6, duration:      3 ms , transferred pages:      592 (dup:        
0, n:        0, x:      592) , new dirty pages:     1148 , remaining dirty 
pages:     1148
Iteration   7, duration:      2 ms , transferred pages:      136 (dup:        
0, n:        0, x:      136) , new dirty pages:     1123 , remaining dirty 
pages:     1123
Iteration   8, duration:      2 ms , transferred pages:        2 (dup:        
0, n:        0, x:        2) , new dirty pages:      985 , remaining dirty 
pages:      985
Iteration   9, duration:      2 ms , transferred pages:       14 (dup:        
0, n:        0, x:       14) , new dirty pages:      640 , remaining dirty 
pages:      640
Iteration  10, duration:      2 ms , transferred pages:       16 (dup:        
0, n:        0, x:       16) , new dirty pages:      622 , remaining dirty 
pages:      622
Iteration  11, duration:      1 ms , transferred pages:        1 (dup:        
0, n:        0, x:        1) , new dirty pages:      693 , remaining dirty 
pages:      693
Iteration  12, duration:      1 ms , transferred pages:      122 (dup:        
0, n:        0, x:      122) , new dirty pages:      639 , remaining dirty 
pages:      639
Iteration  13, duration:      2 ms , transferred pages:      475 (dup:        
0, n:        0, x:      475) , new dirty pages:      522 , remaining dirty 
pages:      522
Iteration  14, duration:     22 ms , transferred pages:        0 (dup:        
0, n:        0, x:        0) , new dirty pages:       27 , remaining dirty 
pages:      549
total time: 32393 milliseconds  (v.s. 97919 milliseconds for SHA1)

memcached 4096MB cache

Iteration   1, duration:  41025 ms , transferred pages:  1569059 (dup:   
395085, n:  1173974, x:        0) , new dirty pages:   560788 , remaining dirty 
pages:   568899
Iteration   2, duration:   8218 ms , transferred pages:   300889 (dup:     
3963, n:   142928, x:   153998) , new dirty pages:   158832 , remaining dirty 
pages:   167022
Iteration   3, duration:   2408 ms , transferred pages:    98923 (dup:      
285, n:    33854, x:    64784) , new dirty pages:    68647 , remaining dirty 
pages:    77338
Iteration   4, duration:    869 ms , transferred pages:    43408 (dup:       
64, n:    17911, x:    25433) , new dirty pages:    26087 , remaining dirty 
pages:    33845
Iteration   5, duration:    455 ms , transferred pages:    23048 (dup:       
55, n:    10156, x:    12837) , new dirty pages:    15275 , remaining dirty 
pages:    16636
Iteration   6, duration:    162 ms , transferred pages:     7939 (dup:       
55, n:     2425, x:     5459) , new dirty pages:     6009 , remaining dirty 
pages:    10051
Iteration   7, duration:     52 ms , transferred pages:     5761 (dup:      
212, n:      707, x:     4842) , new dirty pages:     2204 , remaining dirty 
pages:     4027
Iteration   8, duration:      1 ms , transferred pages:        0 (dup:        
0, n:        0, x:        0) , new dirty pages:        0 , remaining dirty 
pages:     4027
total time: 53255 milliseconds  (v.s. 54693 milliseconds for SHA1)

--
Chunguang Li, Ph.D. Candidate
Wuhan National Laboratory for Optoelectronics (WNLO)
Huazhong University of Science & Technology (HUST)
Wuhan, Hubei Prov., China






reply via email to

[Prev in Thread] Current Thread [Next in Thread]