qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [RFC] COLO HA Project proposal


From: Dong, Eddie
Subject: Re: [Qemu-devel] [RFC] COLO HA Project proposal
Date: Fri, 4 Jul 2014 15:55:06 +0000

> > Thanks Dave:
> >     Whether the randomness value/branch/code path the PVM and SVM
> may
> > have, It is only a performance issue. COLO never assumes the PVM and
> > SVM has same internal Machine state.  From correctness p.o.v, as if
> > the PVM and SVM generate Identical response, we can view the SVM is a
> > valid replica of PVM, and the SVM can take over When the PVM suffers
> > from hardware failure. We can view the client is all the way talking
> > with the SVM, without the notion of PVM.  Of course, if the SVM dies, we
> can regenerate a copy of PVM with a new checkpoint too.
> >     The SOCC paper has the detail recovery model :)
> 
> I've had a read; I think the bit I was asking about was what you labelled 'D' 
> in
> that papers fig.4 - so I think that does explain it for me.

Very good :)

> But I also have some more questions:
> 
>   1) 5.3.3 Web server
>     a) In fig 11 it shows Remus's performance dropping off with the number
> of threads - why is that? Is it
>        just an increase in the amount of memory changes in each
> snapshot?

I didn't dig into details of them, but document the throughput we observed.
I felt a bit stranger too, memory dirty page set may be larger than small 
connection
Case, but I am not sure and that is the data we saw :(

>     b) Is fig 11/12 measured with all of the TCP optimisations shown in fig
> 13 on?

Yes.

> 
>   2) Did you manage to overcome the issue shown in 5.6 with newer guest
> kernels degredation - could you just fall
>      back to micro checkpointing if the guests diverge too quickly?

In general, I would say the COLO performance for these 2 workloads is pretty 
good, and 
I actually didn't list the subsection 5.6 initially. It is the conference 
sepherd who ask me to 
add this paragraph to make the paper to be balanced :)

In summary, COLO can have very good MP-guest performance comparing with Remus, 
with 
the payment of potential optimization/modification effort to guest TCP/IP 
stack. One solution may
Not work for all workloads, but it provides a large room for OSVs to provide 
customized solution
for a specific usage -- which I think is very good for open source biz model: 
make money through 
consultant. Huawei technology Ltd. announced to support COLO in there cloud OS, 
Probably for specific usage too.

> 
>   3) Was the link between the two servers for synchronisation a low-latency
> dedicated connection?

We use 10 Gbps NIC in the paper, and yes it is dedicated link, but the solution 
itself doesn't 
require dedicated link.

> 
>   4) Did you try an ftp PUT benchmark using external storage - i.e. that
> wouldn't have the local disc
>      synchronisation overhead?

Not yet.
External network shared storage works, but today the performance may be not 
that good, 
because our optimization so far is still very limited. It is just an initial 
effort to make the 2
common workloads happy. We believe there are large room ahead to make the 
response of 
TCP/IP stack more predictable. Once the basic COLO stuff is ready for product 
and accepted
by the industry, it is possible we may impact TCP community to have this kind 
of predictability
 in mind for the future protocol development, which will greatly help the 
performance.


Thx Eddie



reply via email to

[Prev in Thread] Current Thread [Next in Thread]