qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [RFC 0/1] Rolling stats on colo


From: Dr. David Alan Gilbert
Subject: Re: [Qemu-devel] [RFC 0/1] Rolling stats on colo
Date: Fri, 6 Mar 2015 18:30:22 +0000
User-agent: Mutt/1.5.23 (2014-03-12)

* zhanghailiang (address@hidden) wrote:
> On 2015/3/5 21:31, Dr. David Alan Gilbert (git) wrote:
> >From: "Dr. David Alan Gilbert" <address@hidden>
> 
> Hi Dave,
> 
> >
> >Hi,
> >   I'm getting COLO running on a couple of our machines here
> >and wanted to see what was actually going on, so I merged
> >in my recent rolling-stats code:
> >
> >http://lists.gnu.org/archive/html/qemu-devel/2015-03/msg00648.html
> >
> >with the following patch, and now I get on the primary side,
> >info migrate shows me:
> >
> >capabilities: xbzrle: off rdma-pin-all: off auto-converge: off zero-blocks: 
> >off colo: on
> >Migration status: colo
> >total time: 0 milliseconds
> >colo checkpoint (ms): Min/Max: 0, 10000 Mean: -1.1415868e-13 (Weighted: 
> >4.3136025e-158) Count: 4020 Values: address@hidden, address@hidden, 
> >address@hidden, address@hidden, address@hidden, address@hidden, 
> >address@hidden, address@hidden, address@hidden, address@hidden
> >colo paused time (ms): Min/Max: 55, 2789 Mean: 63.9 (Weighted: 76.243584) 
> >Count: 4019 Values: address@hidden, address@hidden, address@hidden, 
> >address@hidden, address@hidden, address@hidden, address@hidden, 
> >address@hidden, address@hidden, address@hidden
> >colo checkpoint size: Min/Max: 18351, 2.1731606e+08 Mean: 150096.4 
> >(Weighted: 127195.56) Count: 4020 Values: address@hidden, address@hidden, 
> >address@hidden, address@hidden, address@hidden, address@hidden, 
> >address@hidden, address@hidden, address@hidden, address@hidden
> >
> >which suggests I've got a problem with the packet comparison; but that's
> >a separate issue I'll look at.
> >
> 
> There is an obvious mistake we have made in proxy, the macro 
> 'IPS_UNTRACKED_BIT' in colo-patch-for-kernel.patch should be 14,
> so please fix it before do the follow test. Sorry for this low-grade mistake, 
> we should do full test before issue it. ;)

No, that's OK; we all make them.

However, that didn't cure my problem; but after a bit of experimentation I now 
have
COLO working pretty well; thanks for the help!

   1) I had to disable IPv6 in the guest; it doesn't look like the
   conntrack is coping with IPv6 ICMPV6, and on our test network
   we're getting a few 10s of those each second, so it's constant
   miscompares (they seem to be neighbour broadcasts and multicast
   stuff).

   2) It looks like virtio-net is sending ARPs - possibly every time
   that a snapshot is loaded;  it's not the 'qemu' announce-self code,
   (I added some debug there and it's not being called); and ARPs
   cause a miscompare - so you get a continuous streem of miscompares
   because a miscompare triggers a new snapshot, that sends more ARPs.
   I solved this by switching to e1000.

   3) The other problem with virtio is it's occasionally triggering a
   'virtio: error trying to map MMIO memory' from qemu;  I'm not sure
   why, the state COLO sends over should always be consistent.

   4) With the e1000 setup; connections are generally fairly responsive,
   but sshing into the guest takes *ages* (10s of seconds).  I'm not sure
   why, because a curl to a web server seems OK (less than a second)
   and once the ssh is open it's pretty responsive.

   5) I've seen one instance of; 
      'qemu-system-x86_64: block/raw-posix.c:836: handle_aiocb_rw: Assertion `p 
- buf == aiocb->aio_nbytes' failed.'
      on the primary side.

Stats for a mostly idle guest are now showing:

colo checkpoint (ms): Min/Max: 0, 10004 Mean: 1592.1 (Weighted: 1806.214) 
Count: 227 Values: address@hidden, address@hidden, address@hidden, 
address@hidden, address@hidden, address@hidden, address@hidden, address@hidden, 
address@hidden, address@hidden
colo paused time (ms): Min/Max: 58, 2975 Mean: 90.3 (Weighted: 94.109752) 
Count: 227 Values: address@hidden, address@hidden, address@hidden, 
address@hidden, address@hidden, address@hidden, address@hidden, address@hidden, 
address@hidden, address@hidden
colo checkpoint size: Min/Max: 212252, 1.9241972e+08 Mean: 5569622.6 (Weighted: 
4826386.5) Count: 227 Values: address@hidden, address@hidden, address@hidden, 
address@hidden, address@hidden, address@hidden, address@hidden, address@hidden, 
address@hidden, address@hidden

So, one checkpoint every ~1.5 seconds; that's just with an
ssh connected and a script doing a 'curl' to it's http
repeatedly.   Running 'top' on the ssh with a fast refresh
brings the checkpoints much faster; I guess that's because
the output of top is quite random.

> To be honest, the proxy part in github is not integrated, we have cut it just 
> for easy review and understand, so there may be some mistakes.

Yes, that's OK; and I've had a few kernel crashes; normally 
when the qemu crashes, the kernel doesn't really like it;
but that's OK, I'm sure it will get better.

I added the following to make my debug easier; which is how
I found the IPv6 problem.

diff --git a/xt_PMYCOLO.c b/xt_PMYCOLO.c
index 9e50b62..13c0b48 100644
--- a/xt_PMYCOLO.c
+++ b/xt_PMYCOLO.c
@@ -1072,7 +1072,7 @@ resolve_master_ct(struct sk_buff *skb, unsigned int 
dataoff,
        h = nf_conntrack_find_get(&init_net, NF_CT_DEFAULT_ZONE, &tuple);
 
        if (h == NULL) {
-               pr_dbg("can't find master's ct for slaver packet\n");
+               pr_dbg("can't find master's ct for slaver packet (pf/l3num=%d 
protonum=%d)\n", l3num, protonum);
                return NULL;
        }
 
@@ -1092,7 +1092,7 @@ nf_conntrack_slaver_in(u_int8_t pf, unsigned int hooknum,
        /* rcu_read_lock()ed by nf_hook_slow */
        l3proto = __nf_ct_l3proto_find(pf);
        if (l3proto->get_l4proto(skb, skb_network_offset(skb), &dataoff, 
&protonum) <= 0) {
-               pr_dbg("slaver: l3proto not prepared to track yet or error 
occurred\n");
+               pr_dbg("slaver: l3proto not prepared to track yet or error 
occurred (pf=%d)\n", pf);
                NF_CT_STAT_INC_ATOMIC(&init_net, error);
                NF_CT_STAT_INC_ATOMIC(&init_net, invalid);
                goto out;

> 
> Thanks,
> zhanghailiang

Thanks,

Dave
> 
> 
> >Dave
> >
> >Dr. David Alan Gilbert (1):
> >   COLO: Add primary side rolling statistics
> >
> >  hmp.c                         | 12 ++++++++++++
> >  include/migration/migration.h |  3 +++
> >  migration/colo.c              | 15 +++++++++++++++
> >  migration/migration.c         | 30 ++++++++++++++++++++++++++++++
> >  qapi-schema.json              | 11 ++++++++++-
> >  5 files changed, 70 insertions(+), 1 deletion(-)
> >
> 
> 
--
Dr. David Alan Gilbert / address@hidden / Manchester, UK



reply via email to

[Prev in Thread] Current Thread [Next in Thread]