qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH v3 0/4] Curling: KVM Fault Tolerance


From: Jules
Subject: Re: [Qemu-devel] [PATCH v3 0/4] Curling: KVM Fault Tolerance
Date: Wed, 23 Oct 2013 08:08:55 +0800

> On Tue, Oct 15, 2013 at 03:26:19PM +0800, Jules Wang wrote:
> > v2 -> v3:
> > * add documentation of new option in qapi-schema.
> > 
> > * long option name: ft -> fault-tolerant
> > 
> > v1 -> v2:
> > * cmdline: migrate curling:tcp:<address>:<port> 
> >        ->  migrate -f tcp:<address>:<port>
> > 
> > * sender: use QEMU_VM_FILE_MAGIC_FT as the header of the migration
> >           to indicate this is a ft migration.
> > 
> > * receiver: look for the signature: 
> >             QEMU_VM_EOF_MAGIC + QEMU_VM_FILE_MAGIC_FT(64bit total)
> >             which indicates the end of one migration.
> > --
> > Jules Wang (4):
> >   Curling: add doc
> >   Curling: cmdline interface.
> >   Curling: the sender
> >   Curling: the receiver
> 

First of all, thanks for your superb and spot-on comments.

> It would be helpful to clarify the status of Curling in the cover letter
> email so reviewers know what to expect.

OK, but I'm not quite clear about how to clarify the status, would you
pls give me an example? 
> 
> This series does not address I/O or failover.  I guess you are aware of
> the missing topics that I mentioned, here are my thoughts on them:
> 
> I/O needs to be held back until the destination host has acknowledged
> receiving the last full migration state.  The outside world cannot
> witness state changes in the guest until the migration state has been
> successfully transferred to the destination host.  Otherwise the guest
> may appear to act incorrectly when resuming execution from the last
> snapshot.
> 
> The time period used by the FT sender thread determines how much latency
> is added to I/O requests.

Yes, there is the latency. That is inevitable.

I guess you mean the following situation:
If a msg 'hello' is sent to the chat room server just a few seconds
before the failover happens, there is a possibility that the msg will be
sent to the others twice or be lost.

Am I right?

> 
> Failover functionality is missing from these patches.  We cannot simply
> start executing on the destination host when the migration connection
> ends.  If the guest disk image is located on shared storage then
> split-brain occurs when a network error terminates the migration
> connection - 

> will both hosts begin accessing the shared disk? 
YES
> 

I have a simple way to handle that. In one word, the third point
--gateway.

Both the sender and the receiver check the connectivity to the gateway
every X seconds. Let's use A and B stand for whether the sender and the
receiver are connected to the gateway respectively.

When the connection between the sender and the receiver is down.
A && B is false.

If A is false, the vm instance at the sender will be stopped.
If B is false, the vm instance at the receiver will not be started.

a.A false  B false: 0 vm run
b.A false  B true: 1 vm run 
c.A true   B false: 1 vm run
d.A true   B true : 1 vm run (normal case)

It becomes complicated when we consider the state transitions in
these four states.
  
I suggest adding this feature to libvirt instead of qemu.


> What is your plan to address these issues?
> 
> Stefan
> 







reply via email to

[Prev in Thread] Current Thread [Next in Thread]