qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH 1/2] Fix Block Hotplug race with drive_del()


From: Markus Armbruster
Subject: Re: [Qemu-devel] [PATCH 1/2] Fix Block Hotplug race with drive_del()
Date: Thu, 11 Nov 2010 15:50:09 +0100
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/23.1 (gnu/linux)

Ryan Harper <address@hidden> writes:

> * Markus Armbruster <address@hidden> [2010-11-11 04:48]:
>> Ryan Harper <address@hidden> writes:
>> 
>> > * Markus Armbruster <address@hidden> [2010-11-10 11:40]:
>> >> Ryan Harper <address@hidden> writes:
>> >> 
>> >> > * Markus Armbruster <address@hidden> [2010-11-10 06:48]:
>> >> >> One real question, and a couple of nits.
>> >> >> 
>> >> >> Ryan Harper <address@hidden> writes:
>> >> >> 
>> >> >> > Block hot unplug is racy since the guest is required to acknowlege 
>> >> >> > the ACPI
>> >> >> > unplug event; this may not happen synchronously with the device 
>> >> >> > removal command
>> >> >> 
>> >> >> Well, I wouldn't call unplug "racy".  It just takes an unpredictable
>> >> >> length of time, possibly forever.  To make a race, you need to throw in
>> >> >> a client assuming (incorrectly) that unplug is instantaneous, as
>> >> >> described in your next paragraph.
>> >> >> 
>> >> >> Moreover, all PCI unplug is that way, not just block.
>> >> >> 
>> >> >> > This series aims to close a gap where by mgmt applications that 
>> >> >> > assume the
>> >> >> > block resource has been removed without confirming that the guest has
>> >> >> > acknowledged the removal may re-assign the underlying device to a 
>> >> >> > second guest
>> >> >> > leading to data leakage.
>> >> >> 
>> >> >> Yes, the incorrect assumption is a problem.  But with that fixed (in 
>> >> >> the
>> >> >> management application), we run right into the next problem: there is 
>> >> >> no
>> >> >> way for the management application to reliably disconnect the guest 
>> >> >> from
>> >> >> a block device.  And that's the problem you're fixing.
>> >> >
>> >> > Yeah, that's the right way to word it; providing a method to forcibly
>> >> > disconnect the guest from the host device.
>> >> >> 
>> >> >> > This series introduces a new montor command to decouple asynchornous 
>> >> >> > device
>> >> >> 
>> >> >> Typos "montor" and "asynchornous".  You might want to use a spell
>> >> >> checker :)
>> >> >> 
>> >> >> Lines are a bit long.  Recommend wrap at column 70.
>> >> >> 
>> >> >> > removal from restricting guest access to a block device.  We do this 
>> >> >> > by creating
>> >> >> > a new monitor command drive_del which maps to a bdrv_unplug() 
>> >> >> > command which
>> >> >> > does a qemu_aio_flush; bdrv_flush() and bdrv_close().  Once 
>> >> >> > complete, subsequent
>> >> >> > IO is rejected from the device and the guest will get IO errors but 
>> >> >> > continue to
>> >> >> > function.  In addition to preventing further IO, we clean up state 
>> >> >> > pointers
>> >> >> > between host (BlockDriverState) and guest (DeviceInfo).
>> >> >> >
>> >> >> > A subsequent device removal command can be issued to remove the 
>> >> >> > device, to which
>> >> >> > the guest may or maynot respond, but as long as the unplugged bit is 
>> >> >> > set, no IO
>> >> >> 
>> >> >> "maynot" is not a word.
>> >> >> 
>> >> >> > will be sumbitted.
>> >> >> 
>> >> >> This suggests to drive_del before device_del, which makes the device
>> >> >> goes through a "broken device" state on its way to unplug.  If the 
>> >> >> guest
>> >> >> accesses the device in that state, it gets I/O errors.  Not nice.
>> >> >> 
>> >> >> Instead, I'd recommend device_del, wait for the device to go away,
>> >> >> drive_del on time out.  If the guest reacts to the ACPI unplug 
>> >> >> promptly,
>> >> >> it's never exposed to the "broken device" state.  Note: if the 
>> >> >> drive_del
>> >> >> fails because the device doesn't exist, we lost the race with the
>> >> >> automatic destruction, which is harmless.  Ignore that error.
>> >> >
>> >> > Honestly, other than describing what happens if you sever the connection
>> >> > when the guest isn't aware of it; I don't want to try to capture how the
>> >> > mgmt layer implements the removal.  
>> >> >
>> >> > One may want to force the disconnect before attempting to remove the
>> >> > device; or the other way around; that's really the mgmt layer's call.
>> >> 
>> >> Fair enough.
>> >> 
>> >> >> > Signed-off-by: Ryan Harper <address@hidden>
>> >> >> > ---
>> >> >> >  block.c         |    7 +++++++
>> >> >> >  block.h         |    1 +
>> >> >> >  blockdev.c      |   36 ++++++++++++++++++++++++++++++++++++
>> >> >> >  blockdev.h      |    1 +
>> >> >> >  hmp-commands.hx |   18 ++++++++++++++++++
>> >> >> >  5 files changed, 63 insertions(+), 0 deletions(-)
>> >> >> >
>> >> >> > diff --git a/block.c b/block.c
>> >> >> > index 6b505fb..c76a796 100644
>> >> >> > --- a/block.c
>> >> >> > +++ b/block.c
>> >> >> > @@ -1328,6 +1328,13 @@ void bdrv_set_removable(BlockDriverState *bs, 
>> >> >> > int removable)
>> >> >> >      }
>> >> >> >  }
>> >> >> >  
>> >> >> > +void bdrv_unplug(BlockDriverState *bs)
>> >> >> > +{
>> >> >> > +    qemu_aio_flush();
>> >> >> > +    bdrv_flush(bs);
>> >> >> > +    bdrv_close(bs);
>> >> >> > +}
>> >> >> > +
>> >> >> 
>> >> >> Unless we expect more users, I'd inline this into its only caller.
>> >> >> Matter of taste.
>> >> >
>> >> > Works for me.
>> >> >
>> >> >> 
>> >> >> >  int bdrv_is_removable(BlockDriverState *bs)
>> >> >> >  {
>> >> >> >      return bs->removable;
>> >> >> > diff --git a/block.h b/block.h
>> >> >> > index 78ecfac..581414c 100644
>> >> >> > --- a/block.h
>> >> >> > +++ b/block.h
>> >> >> > @@ -171,6 +171,7 @@ void bdrv_set_on_error(BlockDriverState *bs, 
>> >> >> > BlockErrorAction on_read_error,
>> >> >> >                         BlockErrorAction on_write_error);
>> >> >> >  BlockErrorAction bdrv_get_on_error(BlockDriverState *bs, int 
>> >> >> > is_read);
>> >> >> >  void bdrv_set_removable(BlockDriverState *bs, int removable);
>> >> >> > +void bdrv_unplug(BlockDriverState *bs);
>> >> >> >  int bdrv_is_removable(BlockDriverState *bs);
>> >> >> >  int bdrv_is_read_only(BlockDriverState *bs);
>> >> >> >  int bdrv_is_sg(BlockDriverState *bs);
>> >> >> > diff --git a/blockdev.c b/blockdev.c
>> >> >> > index 6cb179a..ee8c2ec 100644
>> >> >> > --- a/blockdev.c
>> >> >> > +++ b/blockdev.c
>> >> >> > @@ -14,6 +14,8 @@
>> >> >> >  #include "qemu-option.h"
>> >> >> >  #include "qemu-config.h"
>> >> >> >  #include "sysemu.h"
>> >> >> > +#include "hw/qdev.h"
>> >> >> > +#include "block_int.h"
>> >> >> >  
>> >> >> >  static QTAILQ_HEAD(drivelist, DriveInfo) drives = 
>> >> >> > QTAILQ_HEAD_INITIALIZER(drives);
>> >> >> >  
>> >> >> > @@ -597,3 +599,37 @@ int do_change_block(Monitor *mon, const char 
>> >> >> > *device,
>> >> >> >      }
>> >> >> >      return monitor_read_bdrv_key_start(mon, bs, NULL, NULL);
>> >> >> >  }
>> >> >> > +
>> >> >> > +int do_drive_del(Monitor *mon, const QDict *qdict, QObject 
>> >> >> > **ret_data)
>> >> >> > +{
>> >> >> > +    const char *id = qdict_get_str(qdict, "id");
>> >> >> > +    BlockDriverState *bs;
>> >> >> > +    Property *prop;
>> >> >> > +
>> >> >> > +    bs = bdrv_find(id);
>> >> >> > +    if (!bs) {
>> >> >> > +        qerror_report(QERR_DEVICE_NOT_FOUND, id);
>> >> >> > +        return -1;
>> >> >> > +    }
>> >> >> > +
>> >> >> > +    /* quiesce block driver; prevent further io */
>> >> >> > +    bdrv_unplug(bs);
>> >> >> > +
>> >> >> > +    /* clean up guest state from pointing to host resource by
>> >> >> > +     * finding and removing DeviceState "drive" property */
>> >> >> > +    for (prop = bs->peer->info->props; prop && prop->name; prop++) {
>> >> >> > +        if ((prop->info->type == PROP_TYPE_DRIVE) && 
>> >> >> > +            (*(BlockDriverState **)qdev_get_prop_ptr(bs->peer, 
>> >> >> > prop) == bs)) {
>> >> >> > +            if (prop->info->free) {
>> >> >> > +                prop->info->free(bs->peer, prop);
>> >> >> > +            }
>> >> 
>> >> Your use of prop->info->free() in this context is wrong.  More below.
>> >> 
>> >> >> 
>> >> >> Does this null the drive property?  I doubt it.  Quick check in the
>> >> >> debugger?
>> >> >> 
>> >> >> The free callbacks generally don't zap the properties, because they run
>> >> >> from qdev_free().
>> >> >
>> >> > To be honest; I didn't see anything that looked like "remove this
>> >> > property" in the qdev api.  Any pointers?
>> >> 
>> >> The closest we have is indeed the Property method free(), but that's not
>> >> quite right.  It's really only for use by qdev_free().
>> >> 
>> >> > should I be calling qdev_free() on the dev?
>> >> 
>> >> No, because then the whole device is gone, not just the property :)
>> >> 
>> >> >                                              I don't quite understand
>> >> > the distinction between the info list of properties and the device
>> >> > itself, nor specifically what we need to remove in the drive_del()
>> >> > operation versus the device_del() portion.
>> >> 
>> >> device_del / qdev_free() destroy a qdev, such as a "virtio-blk-pci"
>> >> device (C type VirtIOPCIProxy).
>> >> 
>> >> drive_del destroys something else, namely the block device host part
>> >> (BlockDriverState + DeviceInfo).  Obviously, it needs to zap all
>> >> pointers to the host part along with it.  Specifically, it needs to zap
>> >> the device's pointer to it.
>> >> 
>> >> Example: if a "virtio-blk-pci" device is using drive "foo", then
>> >> "drive_del foo" needs to zap its member block.bs.
>> >> 
>> >> Complication: we don't (want to) know what kind of device exactly is
>> >> using the drive.  But we do know that a drive property must be
>> >> describing it.
>> >> 
>> >> So we search the properties (for (prop...)) for a drive property
>> >> (prop->info->type == PROP_TYPE_DRIVE) that points to this drive (... ==
>> >> bs).
>> >> 
>> >> Result:
>> >> 
>> >>     BlockDriverState *bs;
>> >>     Property *prop;
>> >>     BlockDriverState **ptr;
>> >> [...]
>> >>     for (prop = bs->peer->info->props; prop && prop->name; prop++) {
>> >>         if ((prop->info->type == PROP_TYPE_DRIVE)) {
>> >>             ptr = qdev_get_prop_ptr(dev, prop);
>> >>             if (*ptr == bs) {
>> >>                 bdrv_detach(bs, bs->peer);
>> >
>> > Invoking the free method on the drive property does do detach:
>> >
>> > free_drive
>> > {
>> >     BlockDriverState **ptr = qdev_get_prop_ptr(dev, prop);
>> >
>> >     if (*ptr) {
>> >         bdrv_detach(*ptr, dev);
>> >         blockdev_auto_del(*ptr);
>> >     }
>> > }
>> >
>> > and the bdrv_delete()
>> >
>> > takes out the bs pointer.
>> 
>> Which pointer?  Which bdrv_delete()?
>
> I suppose it's the BlockDriverState returned from bdrv_find() since I'm
> invoking bdrv_delete(bs);  
>
> And I suppose qdev_get_prop_ptr() is returning a different ptr to the
> same bs; in which case we'll still need the null you had suggested?

Yes.

The qdev_get_prop_ptr() returns a pointer to the pointer to the bs you
started with.

In other words:

* bs->peer points from bs to the qdev using this drive

* The qdev state contains a pointer back to bs.

  Example: for virtio-blk-pci, that's VirtIOPCIProxy member block.bs.

* a drive property describes that pointer, and qdev_get_prop_ptr()
  returns a pointer to that pointer in the qdev state.

  Example: for a virtio-blk-pci, it returns
  &DO_UPCAST(VirtIOPCIProxy, pci_dev.qdev, bs->peer)->block.bs.

To disconnect drive from qdev, we need to zap both bs->peer and the
pointer to bs in the qdev state.

Clear?

[...]



reply via email to

[Prev in Thread] Current Thread [Next in Thread]