qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [RFC PATCH v2 01/12] mc: add documentation for micro-ch


From: Dr. David Alan Gilbert
Subject: Re: [Qemu-devel] [RFC PATCH v2 01/12] mc: add documentation for micro-checkpointing
Date: Thu, 20 Feb 2014 16:32:57 +0000
User-agent: Mutt/1.5.21 (2010-09-15)

* Michael R. Hines (address@hidden) wrote:
> On 02/20/2014 06:09 PM, Dr. David Alan Gilbert wrote:
> >* Michael R. Hines (address@hidden) wrote:
> >>On 02/19/2014 07:27 PM, Dr. David Alan Gilbert wrote:
> >>>I was just wondering if a separate 'max buffer size' knob would allow
> >>>you to more reasonably bound memory without setting policy; I don't think
> >>>people like having potentially x2 memory.
> >>Note: Checkpoint memory is not monotonic in this patchset (which
> >>is unique to this implementation). Only if the guest actually dirties
> >>100% of it's memory between one checkpoint to the next will
> >>the host experience 2x memory usage for a short period of time.
> >Right, but that doesn't really help - if someone comes along and says
> >'How much memory do I need to be able to run an mc system?' the only
> >safe answer is 2x, otherwise we're adding a reason why the previously
> >stable guest might OOM.
> 
> Yes, exactly. Running MC is expensive and will probably always be
> more or less to some degree. Saving memory and having 100%
> fault tolerance are (at times) sometimes mutually exclusive.
> Expectations have to be managed here.

I'm happy to use more memory to get FT, all I'm trying to do is see
if it's possible to put a lower bound than 2x on it while still maintaining
full FT, at the expense of performance in the case where it uses
a lot of memory.

> The bottom line is: if you put a *hard* constraint on memory usage,
> what will happen to the guest when that garbage collection you mentioned
> shows up later and runs for several minutes? How about an hour?
> Are we just going to block the guest from being allowed to start a
> checkpoint until the memory usage goes down just for the sake of avoiding
> the 2x memory usage?

Yes, or move to the next checkpoint sooner than the N milliseconds when
we see the buffer is getting full.

> If you block the guest from being checkpointed,
> then what happens if there is a failure during that extended period?
> We will have saved memory at the expense of availability.

If the active machine fails during this time then the secondary carries
on from it's last good snapshot in the knowledge that the active
never finished the new snapshot and so never uncorked it's previous packets.

If the secondary machine fails during this time then tha active drops
it's nascent snapshot and carries on.

However, what you have made me realise is that I don't have an answer
for the memory usage on the secondary; while the primary can pause
it's guest until the secondary ack's the checkpoint, the secondary has
to rely on the primary not to send it huge checkpoints.

> The customer that is expecting 100% fault tolerance and the provider
> who is supporting it need to have an understanding that fault tolerance
> is not free and that constraining memory usage will adversely affect
> the VM's ability to be protected.
> 
> Do I understand your expectations correctly? Is fault tolerance
> something you're willing to sacrifice?

As above, no I'm willing to sacrifice performance but not fault tolerance.
(It is entirely possible that others would want the other trade off, i.e.
some minimum performance is worse than useless, so if we can't maintain
that performance then dropping FT leaves us in a more-working position).

> >>The patch has a 'slab' mechanism built in to it which implements
> >>a water-mark style policy that throws away unused portions of
> >>the 2x checkpoint memory if later checkpoints are much smaller
> >>(which is likely to be the case if the writable working set size changes).
> >>
> >>However, to answer your question: Such a knob could be achieved, but
> >>the same could be achieved simply by tuning the checkpoint frequency
> >>itself. Memory usage would thus be a function of the checkpoint frequency.
> >>If the guest application was maniacal, banging away at all the memory,
> >>there's very little that can be done in the first place, but if the
> >>guest application
> >>was mildly busy, you don't want to throw away your ability to be fault
> >>tolerant - you would just need more frequent checkpoints to keep up with
> >>the dirty rate.
> >I'm not convinced; I can tune my checkpoint frequency until normal operation
> >makes a reasonable trade off between mc frequency and RAM usage,
> >but that doesn't prevent it running away when a garbage collect or some
> >other thing suddenly dirties a load of ram in one particular checkpoint.
> >Some management tool that watches ram usage etc can also help tune
> >it, but in the end it can't stop it taking loads of RAM.
> 
> That's correct. See above comment....
> 
> >
> >>Once the application died down - the water-mark policy would kick in
> >>and start freeing checkpoint memory. (Note: this policy happens on
> >>both sides in the patchset because the patch has to be fully compatible
> >>with RDMA memory pinning).
> >>
> >>What is *not* exposed, however, is the watermark knobs themselves,
> >>I definitely think that needs to be exposed - that would also get
> >>you a similar
> >>control to 'max buffer size' - you could place a time limit on the
> >>slab list in the patch or something like that.......
> >>
> >>
> >>>>Good question in general - I'll add it to the FAQ. The patch implements
> >>>>a basic 'transaction' mechanism in coordination with an outbound I/O
> >>>>buffer (documented further down). With these two things in
> >>>>places, split-brain is not possible because the destination is not 
> >>>>running.
> >>>>We don't allow the destination to resume execution until a committed
> >>>>transaction has been acknowledged by the destination and only until
> >>>>then do we allow any outbound network traffic to be release to the
> >>>>outside world.
> >>>Yeh I see the IO buffer, what I've not figured out is how:
> >>>   1) MC over TCP/IP gets an acknowledge on the source to know when
> >>>      it can unplug it's buffer.
> >>Only partially correct (See the steps on the wiki). There are two I/O
> >>buffers at any given time which protect against a split-brain scenario:
> >>One buffer for the current checkpoint that is being generated (running VM)
> >>and one buffer for the checkpoint that is being committed in a transaction.
> >>
> >>>   2) Lets say the MC connection fails, so that ack never arrives,
> >>>      the source must assume the destination has failed and release it's
> >>>      packets and carry on.
> >>Only the packets for Buffer A are released for the current committed
> >>checkpoint after a completed transaction. The packets for Buffer B
> >>(the current running VM) are still being held up until the next
> >>transaction starts.
> >>Later once the transaction completes and A is released, B becomes the
> >>new A and a new buffer is installed to become the new Buffer B for
> >>the current running VM.
> >>
> >>
> >>>      The destination must assume the source has failed and take over.
> >>The destination must also receive an ACK. The ack goes both ways.
> >>
> >>Once the source and destination both acknowledge a completed
> >>transation does the source VM resume execution - and even then
> >>it's packets are still being buffered until the next transaction starts.
> >>(That's why it's important to checkpoint as frequently as possible).
> >I think I understand normal operation - my question here is about failure;
> >what happens when neither side gets any ACKs.
> 
> Well, that's simple: If there is a failure of the source, the destination
> will simply revert to the previous checkpoint using the same mode
> of operation. The lost ACKs that you're curious about only
> apply to the checkpoint that is in progress. Just because a
> checkpoint is in progress does not mean that the previous checkpoint
> is thrown away - it is already loaded into the destination's memory
> and ready to be activated.

I still don't see why, if the link between them fails, the destination
doesn't fall back it it's previous checkpoint, AND the source carries
on running - I don't see how they can differentiate which of them has failed.

> >>>   3) If we're relying on TCP/IP timeout that's quite long.
> >>>
> >>Actually, my experience is been that TCP seems to have more than
> >>one kind of timeout - if receiver is not responding *at all* - it seems that
> >>TCP has a dedicated timer for that. The socket API immediately
> >>sends back an error code and the patchset closes the conneciton
> >>on the destination and recovers.
> >How did you test that?
> >My experience is that if a host knows that it has no route to the destination
> >(e.g. it has no route to try, that matches the destination, because someone
> >took the network interface away) you immediately get a 'no route to host',
> >however if an intermediate link disappears then it takes a while to time out.
> 
> We have a script architecture (not on github) which runs MC in a tight
> loop hundreds of times and kills the source QEMU and timestamps how
> quickly the
> destination QEMU loses the TCP socket connection receives an error code
> from the kernel - every single time, the destination resumes nearly
> instantaneously.
> I've not empirically seen a case where the socket just hangs or doesn't
> change state.
> 
> I'm not very familiar with the internal linux TCP/IP stack
> implementation itself,
> but I have not had a problem with the dependability of the linux socket
> not being able to shutdown the socket as soon as possible.

OK, that only covers a very small range of normal failures.
When you kill the destination QEMU the host OS knows that QEMU is dead
and sends a packet back closing the socket, hence the source knows
the destination is dead very quickly.
If:
   a) The destination machine was to lose power or hang
   b) Or a network link fail  (other than the one attached to the source
      possibly)

the source would have to do a full TCP timeout.

To test a,b I'd use an iptables rule somewhere to cause the packets to
be dropped (not rejected).  Stopping the qemu in gdb might be good enough.

> The RDMA implementation uses a manual keepalive mechanism that
> I had to write from scratch - but I never ported this to the TCP
> implementation
> simply because the failures always worked fine without it.

Dave
--
Dr. David Alan Gilbert / address@hidden / Manchester, UK



reply via email to

[Prev in Thread] Current Thread [Next in Thread]