Re: [Qemu-devel] [RFC 0/1] Rolling stats on colo

qemu-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [RFC 0/1] Rolling stats on colo

From:	Wen Congyang
Subject:	Re: [Qemu-devel] [RFC 0/1] Rolling stats on colo
Date:	Mon, 9 Mar 2015 17:01:01 +0800
User-agent:	Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.4.0

On 03/09/2015 04:55 PM, Dr. David Alan Gilbert wrote:
> * Wen Congyang (address@hidden) wrote:
>> On 03/07/2015 02:30 AM, Dr. David Alan Gilbert wrote:
>>> * zhanghailiang (address@hidden) wrote:
>>>> On 2015/3/5 21:31, Dr. David Alan Gilbert (git) wrote:
>>>>> From: "Dr. David Alan Gilbert" <address@hidden>
>>>>
>>>> Hi Dave,
>>>>
>>>>>
>>>>> Hi,
>>>>>   I'm getting COLO running on a couple of our machines here
>>>>> and wanted to see what was actually going on, so I merged
>>>>> in my recent rolling-stats code:
>>>>>
>>>>> http://lists.gnu.org/archive/html/qemu-devel/2015-03/msg00648.html
>>>>>
>>>>> with the following patch, and now I get on the primary side,
>>>>> info migrate shows me:
>>>>>
>>>>> capabilities: xbzrle: off rdma-pin-all: off auto-converge: off 
>>>>> zero-blocks: off colo: on
>>>>> Migration status: colo
>>>>> total time: 0 milliseconds
>>>>> colo checkpoint (ms): Min/Max: 0, 10000 Mean: -1.1415868e-13 (Weighted: 
>>>>> 4.3136025e-158) Count: 4020 Values: address@hidden, address@hidden, 
>>>>> address@hidden, address@hidden, address@hidden, address@hidden, 
>>>>> address@hidden, address@hidden, address@hidden, address@hidden
>>>>> colo paused time (ms): Min/Max: 55, 2789 Mean: 63.9 (Weighted: 76.243584) 
>>>>> Count: 4019 Values: address@hidden, address@hidden, address@hidden, 
>>>>> address@hidden, address@hidden, address@hidden, address@hidden, 
>>>>> address@hidden, address@hidden, address@hidden
>>>>> colo checkpoint size: Min/Max: 18351, 2.1731606e+08 Mean: 150096.4 
>>>>> (Weighted: 127195.56) Count: 4020 Values: address@hidden, address@hidden, 
>>>>> address@hidden, address@hidden, address@hidden, address@hidden, 
>>>>> address@hidden, address@hidden, address@hidden, address@hidden
>>>>>
>>>>> which suggests I've got a problem with the packet comparison; but that's
>>>>> a separate issue I'll look at.
>>>>>
>>>>
>>>> There is an obvious mistake we have made in proxy, the macro 
>>>> 'IPS_UNTRACKED_BIT' in colo-patch-for-kernel.patch should be 14,
>>>> so please fix it before do the follow test. Sorry for this low-grade 
>>>> mistake, we should do full test before issue it. ;)
>>>
>>> No, that's OK; we all make them.
>>>
>>> However, that didn't cure my problem; but after a bit of experimentation I 
>>> now have
>>> COLO working pretty well; thanks for the help!
>>>
>>>    1) I had to disable IPv6 in the guest; it doesn't look like the
>>>    conntrack is coping with IPv6 ICMPV6, and on our test network
>>>    we're getting a few 10s of those each second, so it's constant
>>>    miscompares (they seem to be neighbour broadcasts and multicast
>>>    stuff).
>>>
>>>    2) It looks like virtio-net is sending ARPs - possibly every time
>>>    that a snapshot is loaded;  it's not the 'qemu' announce-self code,
>>>    (I added some debug there and it's not being called); and ARPs
>>>    cause a miscompare - so you get a continuous streem of miscompares
>>>    because a miscompare triggers a new snapshot, that sends more ARPs.
>>>    I solved this by switching to e1000.
>>>
>>>    3) The other problem with virtio is it's occasionally triggering a
>>>    'virtio: error trying to map MMIO memory' from qemu;  I'm not sure
>>>    why, the state COLO sends over should always be consistent.
>>
>> I don't meet this problem. Can you provide your command line?
>> Primary or secondary qemu reports this error message?
> 
> It's the secondary;
> 
> ./try/bin/qemu-system-x86_64 -enable-kvm -nographic \
>      -boot c -m 2048 -smp 2 -S \
>      -netdev tap,id=hn0,script=$PWD/ifup-slave,\
> downscript=no,colo_script=$PWD/colo-proxy/colo-proxy-script.sh,colo_nicname=em4
>  \
>      -device virtio-net-pci,mac=52:54:64:61:05:31,id=net-pci0,netdev=hn0 \
>      -drive 
> driver=blkcolo,export=colo1,backing.file.filename=./Fedora-x86_64-20-20140407-sda.raw,backing.driver=raw,if=virtio\
>      -incoming tcp:0:8888
> 
>>>    4) With the e1000 setup; connections are generally fairly responsive,
>>>    but sshing into the guest takes *ages* (10s of seconds).  I'm not sure
>>>    why, because a curl to a web server seems OK (less than a second)
>>>    and once the ssh is open it's pretty responsive.
>>>
>>>    5) I've seen one instance of; 
>>>       'qemu-system-x86_64: block/raw-posix.c:836: handle_aiocb_rw: 
>>> Assertion `p - buf == aiocb->aio_nbytes' failed.'
>>>       on the primary side.
>>
>> It is a known bug in quorum. You can try this patch:
>> http://lists.nongnu.org/archive/html/qemu-devel/2015-01/msg04507.html
> 
> OK, I'll try it; although I've only hit that bug once.

You can also use qcow2 to avoid this problem.

Thanks
Wen Congyang

> 
>>
>> Thanks
>> Wen Congyang
> 
> Thanks for the reply,
> 
> Dave
>>
>>>
>>> Stats for a mostly idle guest are now showing:
>>>
>>> colo checkpoint (ms): Min/Max: 0, 10004 Mean: 1592.1 (Weighted: 1806.214) 
>>> Count: 227 Values: address@hidden, address@hidden, address@hidden, 
>>> address@hidden, address@hidden, address@hidden, address@hidden, 
>>> address@hidden, address@hidden, address@hidden
>>> colo paused time (ms): Min/Max: 58, 2975 Mean: 90.3 (Weighted: 94.109752) 
>>> Count: 227 Values: address@hidden, address@hidden, address@hidden, 
>>> address@hidden, address@hidden, address@hidden, address@hidden, 
>>> address@hidden, address@hidden, address@hidden
>>> colo checkpoint size: Min/Max: 212252, 1.9241972e+08 Mean: 5569622.6 
>>> (Weighted: 4826386.5) Count: 227 Values: address@hidden, address@hidden, 
>>> address@hidden, address@hidden, address@hidden, address@hidden, 
>>> address@hidden, address@hidden, address@hidden, address@hidden
>>>
>>> So, one checkpoint every ~1.5 seconds; that's just with an
>>> ssh connected and a script doing a 'curl' to it's http
>>> repeatedly.   Running 'top' on the ssh with a fast refresh
>>> brings the checkpoints much faster; I guess that's because
>>> the output of top is quite random.
>>>
>>>> To be honest, the proxy part in github is not integrated, we have cut it 
>>>> just for easy review and understand, so there may be some mistakes.
>>>
>>> Yes, that's OK; and I've had a few kernel crashes; normally 
>>> when the qemu crashes, the kernel doesn't really like it;
>>> but that's OK, I'm sure it will get better.
>>>
>>> I added the following to make my debug easier; which is how
>>> I found the IPv6 problem.
>>>
>>> diff --git a/xt_PMYCOLO.c b/xt_PMYCOLO.c
>>> index 9e50b62..13c0b48 100644
>>> --- a/xt_PMYCOLO.c
>>> +++ b/xt_PMYCOLO.c
>>> @@ -1072,7 +1072,7 @@ resolve_master_ct(struct sk_buff *skb, unsigned int 
>>> dataoff,
>>>         h = nf_conntrack_find_get(&init_net, NF_CT_DEFAULT_ZONE, &tuple);
>>>  
>>>         if (h == NULL) {
>>> -               pr_dbg("can't find master's ct for slaver packet\n");
>>> +               pr_dbg("can't find master's ct for slaver packet 
>>> (pf/l3num=%d protonum=%d)\n", l3num, protonum);
>>>                 return NULL;
>>>         }
>>>  
>>> @@ -1092,7 +1092,7 @@ nf_conntrack_slaver_in(u_int8_t pf, unsigned int 
>>> hooknum,
>>>         /* rcu_read_lock()ed by nf_hook_slow */
>>>         l3proto = __nf_ct_l3proto_find(pf);
>>>         if (l3proto->get_l4proto(skb, skb_network_offset(skb), &dataoff, 
>>> &protonum) <= 0) {
>>> -               pr_dbg("slaver: l3proto not prepared to track yet or error 
>>> occurred\n");
>>> +               pr_dbg("slaver: l3proto not prepared to track yet or error 
>>> occurred (pf=%d)\n", pf);
>>>                 NF_CT_STAT_INC_ATOMIC(&init_net, error);
>>>                 NF_CT_STAT_INC_ATOMIC(&init_net, invalid);
>>>                 goto out;
>>>
>>>>
>>>> Thanks,
>>>> zhanghailiang
>>>
>>> Thanks,
>>>
>>> Dave
>>>>
>>>>
>>>>> Dave
>>>>>
>>>>> Dr. David Alan Gilbert (1):
>>>>>   COLO: Add primary side rolling statistics
>>>>>
>>>>>  hmp.c                         | 12 ++++++++++++
>>>>>  include/migration/migration.h |  3 +++
>>>>>  migration/colo.c              | 15 +++++++++++++++
>>>>>  migration/migration.c         | 30 ++++++++++++++++++++++++++++++
>>>>>  qapi-schema.json              | 11 ++++++++++-
>>>>>  5 files changed, 70 insertions(+), 1 deletion(-)
>>>>>
>>>>
>>>>
>>> --
>>> Dr. David Alan Gilbert / address@hidden / Manchester, UK
>>> .
>>>
>>
> --
> Dr. David Alan Gilbert / address@hidden / Manchester, UK
> .
>

[Prev in Thread]

Current Thread

[Next in Thread]

[Qemu-devel] [RFC 0/1] Rolling stats on colo, Dr. David Alan Gilbert (git), 2015/03/05
- [Qemu-devel] [RFC 1/1] COLO: Add primary side rolling statistics, Dr. David Alan Gilbert (git), 2015/03/05
- Re: [Qemu-devel] [RFC 0/1] Rolling stats on colo, zhanghailiang, 2015/03/05
  - Re: [Qemu-devel] [RFC 0/1] Rolling stats on colo, zhanghailiang, 2015/03/05
  - Re: [Qemu-devel] [RFC 0/1] Rolling stats on colo, Dr. David Alan Gilbert, 2015/03/06
    - Re: [Qemu-devel] [RFC 0/1] Rolling stats on colo, Wen Congyang, 2015/03/08
    - Re: [Qemu-devel] [RFC 0/1] Rolling stats on colo, Dr. David Alan Gilbert, 2015/03/09
    - Re: [Qemu-devel] [RFC 0/1] Rolling stats on colo, Wen Congyang <=
    - Re: [Qemu-devel] [RFC 0/1] Rolling stats on colo, zhanghailiang, 2015/03/10
    - Re: [Qemu-devel] [RFC 0/1] Rolling stats on colo, Dr. David Alan Gilbert, 2015/03/11
    - Re: [Qemu-devel] [RFC 0/1] Rolling stats on colo, zhanghailiang, 2015/03/11
    - Re: [Qemu-devel] [RFC 0/1] Rolling stats on colo, Dr. David Alan Gilbert, 2015/03/11

Prev by Date: Re: [Qemu-devel] [RFC 0/1] Rolling stats on colo
Next by Date: Re: [Qemu-devel] [PATCH V3 08/14] virtio-pci: switch to use bus specific queue limit
Previous by thread: Re: [Qemu-devel] [RFC 0/1] Rolling stats on colo
Next by thread: Re: [Qemu-devel] [RFC 0/1] Rolling stats on colo
Index(es):
- Date
- Thread