[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-devel] [RFC 0/1] Rolling stats on colo
From: |
Wen Congyang |
Subject: |
Re: [Qemu-devel] [RFC 0/1] Rolling stats on colo |
Date: |
Mon, 9 Mar 2015 10:37:00 +0800 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.4.0 |
On 03/07/2015 02:30 AM, Dr. David Alan Gilbert wrote:
> * zhanghailiang (address@hidden) wrote:
>> On 2015/3/5 21:31, Dr. David Alan Gilbert (git) wrote:
>>> From: "Dr. David Alan Gilbert" <address@hidden>
>>
>> Hi Dave,
>>
>>>
>>> Hi,
>>> I'm getting COLO running on a couple of our machines here
>>> and wanted to see what was actually going on, so I merged
>>> in my recent rolling-stats code:
>>>
>>> http://lists.gnu.org/archive/html/qemu-devel/2015-03/msg00648.html
>>>
>>> with the following patch, and now I get on the primary side,
>>> info migrate shows me:
>>>
>>> capabilities: xbzrle: off rdma-pin-all: off auto-converge: off zero-blocks:
>>> off colo: on
>>> Migration status: colo
>>> total time: 0 milliseconds
>>> colo checkpoint (ms): Min/Max: 0, 10000 Mean: -1.1415868e-13 (Weighted:
>>> 4.3136025e-158) Count: 4020 Values: address@hidden, address@hidden,
>>> address@hidden, address@hidden, address@hidden, address@hidden,
>>> address@hidden, address@hidden, address@hidden, address@hidden
>>> colo paused time (ms): Min/Max: 55, 2789 Mean: 63.9 (Weighted: 76.243584)
>>> Count: 4019 Values: address@hidden, address@hidden, address@hidden,
>>> address@hidden, address@hidden, address@hidden, address@hidden,
>>> address@hidden, address@hidden, address@hidden
>>> colo checkpoint size: Min/Max: 18351, 2.1731606e+08 Mean: 150096.4
>>> (Weighted: 127195.56) Count: 4020 Values: address@hidden, address@hidden,
>>> address@hidden, address@hidden, address@hidden, address@hidden,
>>> address@hidden, address@hidden, address@hidden, address@hidden
>>>
>>> which suggests I've got a problem with the packet comparison; but that's
>>> a separate issue I'll look at.
>>>
>>
>> There is an obvious mistake we have made in proxy, the macro
>> 'IPS_UNTRACKED_BIT' in colo-patch-for-kernel.patch should be 14,
>> so please fix it before do the follow test. Sorry for this low-grade
>> mistake, we should do full test before issue it. ;)
>
> No, that's OK; we all make them.
>
> However, that didn't cure my problem; but after a bit of experimentation I
> now have
> COLO working pretty well; thanks for the help!
>
> 1) I had to disable IPv6 in the guest; it doesn't look like the
> conntrack is coping with IPv6 ICMPV6, and on our test network
> we're getting a few 10s of those each second, so it's constant
> miscompares (they seem to be neighbour broadcasts and multicast
> stuff).
>
> 2) It looks like virtio-net is sending ARPs - possibly every time
> that a snapshot is loaded; it's not the 'qemu' announce-self code,
> (I added some debug there and it's not being called); and ARPs
> cause a miscompare - so you get a continuous streem of miscompares
> because a miscompare triggers a new snapshot, that sends more ARPs.
> I solved this by switching to e1000.
>
> 3) The other problem with virtio is it's occasionally triggering a
> 'virtio: error trying to map MMIO memory' from qemu; I'm not sure
> why, the state COLO sends over should always be consistent.
I don't meet this problem. Can you provide your command line?
Primary or secondary qemu reports this error message?
>
> 4) With the e1000 setup; connections are generally fairly responsive,
> but sshing into the guest takes *ages* (10s of seconds). I'm not sure
> why, because a curl to a web server seems OK (less than a second)
> and once the ssh is open it's pretty responsive.
>
> 5) I've seen one instance of;
> 'qemu-system-x86_64: block/raw-posix.c:836: handle_aiocb_rw: Assertion
> `p - buf == aiocb->aio_nbytes' failed.'
> on the primary side.
It is a known bug in quorum. You can try this patch:
http://lists.nongnu.org/archive/html/qemu-devel/2015-01/msg04507.html
Thanks
Wen Congyang
>
> Stats for a mostly idle guest are now showing:
>
> colo checkpoint (ms): Min/Max: 0, 10004 Mean: 1592.1 (Weighted: 1806.214)
> Count: 227 Values: address@hidden, address@hidden, address@hidden,
> address@hidden, address@hidden, address@hidden, address@hidden,
> address@hidden, address@hidden, address@hidden
> colo paused time (ms): Min/Max: 58, 2975 Mean: 90.3 (Weighted: 94.109752)
> Count: 227 Values: address@hidden, address@hidden, address@hidden,
> address@hidden, address@hidden, address@hidden, address@hidden,
> address@hidden, address@hidden, address@hidden
> colo checkpoint size: Min/Max: 212252, 1.9241972e+08 Mean: 5569622.6
> (Weighted: 4826386.5) Count: 227 Values: address@hidden, address@hidden,
> address@hidden, address@hidden, address@hidden, address@hidden,
> address@hidden, address@hidden, address@hidden, address@hidden
>
> So, one checkpoint every ~1.5 seconds; that's just with an
> ssh connected and a script doing a 'curl' to it's http
> repeatedly. Running 'top' on the ssh with a fast refresh
> brings the checkpoints much faster; I guess that's because
> the output of top is quite random.
>
>> To be honest, the proxy part in github is not integrated, we have cut it
>> just for easy review and understand, so there may be some mistakes.
>
> Yes, that's OK; and I've had a few kernel crashes; normally
> when the qemu crashes, the kernel doesn't really like it;
> but that's OK, I'm sure it will get better.
>
> I added the following to make my debug easier; which is how
> I found the IPv6 problem.
>
> diff --git a/xt_PMYCOLO.c b/xt_PMYCOLO.c
> index 9e50b62..13c0b48 100644
> --- a/xt_PMYCOLO.c
> +++ b/xt_PMYCOLO.c
> @@ -1072,7 +1072,7 @@ resolve_master_ct(struct sk_buff *skb, unsigned int
> dataoff,
> h = nf_conntrack_find_get(&init_net, NF_CT_DEFAULT_ZONE, &tuple);
>
> if (h == NULL) {
> - pr_dbg("can't find master's ct for slaver packet\n");
> + pr_dbg("can't find master's ct for slaver packet (pf/l3num=%d
> protonum=%d)\n", l3num, protonum);
> return NULL;
> }
>
> @@ -1092,7 +1092,7 @@ nf_conntrack_slaver_in(u_int8_t pf, unsigned int
> hooknum,
> /* rcu_read_lock()ed by nf_hook_slow */
> l3proto = __nf_ct_l3proto_find(pf);
> if (l3proto->get_l4proto(skb, skb_network_offset(skb), &dataoff,
> &protonum) <= 0) {
> - pr_dbg("slaver: l3proto not prepared to track yet or error
> occurred\n");
> + pr_dbg("slaver: l3proto not prepared to track yet or error
> occurred (pf=%d)\n", pf);
> NF_CT_STAT_INC_ATOMIC(&init_net, error);
> NF_CT_STAT_INC_ATOMIC(&init_net, invalid);
> goto out;
>
>>
>> Thanks,
>> zhanghailiang
>
> Thanks,
>
> Dave
>>
>>
>>> Dave
>>>
>>> Dr. David Alan Gilbert (1):
>>> COLO: Add primary side rolling statistics
>>>
>>> hmp.c | 12 ++++++++++++
>>> include/migration/migration.h | 3 +++
>>> migration/colo.c | 15 +++++++++++++++
>>> migration/migration.c | 30 ++++++++++++++++++++++++++++++
>>> qapi-schema.json | 11 ++++++++++-
>>> 5 files changed, 70 insertions(+), 1 deletion(-)
>>>
>>
>>
> --
> Dr. David Alan Gilbert / address@hidden / Manchester, UK
> .
>
- [Qemu-devel] [RFC 0/1] Rolling stats on colo, Dr. David Alan Gilbert (git), 2015/03/05
- [Qemu-devel] [RFC 1/1] COLO: Add primary side rolling statistics, Dr. David Alan Gilbert (git), 2015/03/05
- Re: [Qemu-devel] [RFC 0/1] Rolling stats on colo, zhanghailiang, 2015/03/05
- Re: [Qemu-devel] [RFC 0/1] Rolling stats on colo, zhanghailiang, 2015/03/10
- Re: [Qemu-devel] [RFC 0/1] Rolling stats on colo, Dr. David Alan Gilbert, 2015/03/11
- Re: [Qemu-devel] [RFC 0/1] Rolling stats on colo, zhanghailiang, 2015/03/11
- Re: [Qemu-devel] [RFC 0/1] Rolling stats on colo, Dr. David Alan Gilbert, 2015/03/11