qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH 4/5] disk_deadlines: add control of requests tim


From: Kevin Wolf
Subject: Re: [Qemu-devel] [PATCH 4/5] disk_deadlines: add control of requests time expiration
Date: Thu, 10 Sep 2015 13:39:20 +0200
User-agent: Mutt/1.5.21 (2010-09-15)

Am 10.09.2015 um 12:27 hat Stefan Hajnoczi geschrieben:
> On Tue, Sep 08, 2015 at 04:48:24PM +0200, Kevin Wolf wrote:
> > Am 08.09.2015 um 16:23 hat Denis V. Lunev geschrieben:
> > > On 09/08/2015 04:05 PM, Kevin Wolf wrote:
> > > >Am 08.09.2015 um 13:27 hat Denis V. Lunev geschrieben:
> > > >>interesting point. Yes, it flushes all requests and most likely
> > > >>hangs inside waiting requests to complete. But fortunately
> > > >>this happens after the switch to paused state thus
> > > >>the guest becomes paused. That's why I have missed this
> > > >>fact.
> > > >>
> > > >>This (could) be considered as a problem but I have no (good)
> > > >>solution at the moment. Should think a bit on.
> > > >Let me suggest a radically different design. Note that I don't say this
> > > >is necessarily how things should be done, I'm just trying to introduce
> > > >some new ideas and broaden the discussion, so that we have a larger set
> > > >of ideas from which we can pick the right solution(s).
> > > >
> > > >The core of my idea would be a new filter block driver 'timeout' that
> > > >can be added on top of each BDS that could potentially fail, like a
> > > >raw-posix BDS pointing to a file on NFS. This way most pieces of the
> > > >solution are nicely modularised and don't touch the block layer core.
> > > >
> > > >During normal operation the driver would just be passing through
> > > >requests to the lower layer. When it detects a timeout, however, it
> > > >completes the request it received with -ETIMEDOUT. It also completes any
> > > >new request it receives with -ETIMEDOUT without passing the request on
> > > >until the request that originally timed out returns. This is our safety
> > > >measure against anyone seeing whether or how the timed out request
> > > >modified data.
> > > >
> > > >We need to make sure that bdrv_drain() doesn't wait for this request.
> > > >Possibly we need to introduce a .bdrv_drain callback that replaces the
> > > >default handling, because bdrv_requests_pending() in the default
> > > >handling considers bs->file, which would still have the timed out
> > > >request. We don't want to see this; bdrv_drain_all() should complete
> > > >even though that request is still pending internally (externally, we
> > > >returned -ETIMEDOUT, so we can consider it completed). This way the
> > > >monitor stays responsive and background jobs can go on if they don't use
> > > >the failing block device.
> > > >
> > > >And then we essentially reuse the rerror/werror mechanism that we
> > > >already have to stop the VM. The device models would be extended to
> > > >always stop the VM on -ETIMEDOUT, regardless of the error policy. In
> > > >this state, the VM would even be migratable if you make sure that the
> > > >pending request can't modify the image on the destination host any more.
> > > >
> > > >Do you think this could work, or did I miss something important?
> > > >
> > > >Kevin
> > > could I propose even more radical solution then?
> > > 
> > > My original approach was based on the fact that
> > > this could should be maintainable out-of-stream.
> > > If the patch will be merged - this boundary condition
> > > could be dropped.
> > > 
> > > Why not to invent 'terror' field on BdrvOptions
> > > and process things in core block layer without
> > > a filter? RB Tree entry will just not created if
> > > the policy will be set to 'ignore'.
> > 
> > 'terror' might not be the most fortunate name... ;-)
> > 
> > The reason why I would prefer a filter driver is so the code and the
> > associated data structures are cleanly modularised and we can keep the
> > actual block layer core small and clean. The same is true for some other
> > functions that I would rather move out of the core into filter drivers
> > than add new cases (e.g. I/O throttling, backup notifiers, etc.), but
> > which are a bit harder to actually move because we already have old
> > interfaces that we can't break (we'll probably do it anyway eventually,
> > even if it needs a bit more compatibility code).
> > 
> > However, it seems that you are mostly touching code that is maintained
> > by Stefan, and Stefan used to be a bit more open to adding functionality
> > to the core, so my opinion might not be the last word.
> 
> I've been thinking more about the correctness of this feature:
> 
> QEMU cannot cancel I/O because there is no Linux userspace API for doing
> so.  Linux AIO's io_cancel(2) syscall is a nop since file systems don't
> implement a kiocb_cancel_fn.  Sending a signal to a task blocked in
> O_DIRECT preadv(2)/pwritev(2) doesn't work either because the task is in
> uninterruptible sleep.
> 
> The only way to make sure a request has finished is to wait for
> completion.  If we treat a request as failed/cancelled but it's actually
> still pending at a layer of the storage stack:
> 1. Read requests may modify guest memory.
> 2. Write requests may modify disk sectors.
> 
> Today the guest times out and tries to do IDE/ATA recovery, for example.
> This causes QEMU to eventually call the synchronous bdrv_drain_all()
> function and the guest hangs.  Also, if the guest mounts the file system
> read-only in response to the timeout, then game over.
> 
> The disk-deadlines feature lets QEMU detect timeouts before the guest so
> we can pause the guest.  The part I have been thinking about is that the
> only option is to wait until the request completes.
> 
> We cannot abandon the timed out request because we'll face #1 or #2
> above.  This means it doesn't make sense to retry the request like
> rerror=/werror=.  rerror=/werror= can retry safely because the original
> request has failed but that is not the case for timed out requests.
> 
> This also means that live migration isn't safe, at least if a write
> request is pending.  If the guest migrates, the pending write request on
> the source host could still complete after live migration handover,
> corrupting the disk.
> 
> Getting back to these patches: I think the implementation is correct in
> that the only policy is to wait for timed out requests to complete and
> then resume the guest.
> 
> However, these patches need to violate the constraint that guest memory
> isn't dirtied when the guest is paused.  This is an important constraint
> for the correctness of live migration, since we need to be able to track
> all changes to guest memory.
> 
> Just wanted to post this in case anyone disagrees.

You're making a few good points here.

I thought that migration with a pending write request could be safe with
some additional knowledge because if you know that the write is hanging
because the connection to the NFS server is down and you make sure that
it remains disconnected, that would work. However, the hanging request
is already in the kernel, so you could never bring the connection up
again without rebooting the host, which is clearly not a realistic
assumption.

Never thought of the constraints of live migration either, so it seems
reads requests are equally problematic.

So it appears that the filter driver would have to add a migration
blocker whenever it sees any request time out, and only clear it again
when all pending requests have completed.

Kevin



reply via email to

[Prev in Thread] Current Thread [Next in Thread]