qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] RFC: Operation Blockers in QEMU Block Nodes


From: Jeff Cody
Subject: Re: [Qemu-devel] RFC: Operation Blockers in QEMU Block Nodes
Date: Wed, 16 Dec 2015 13:45:11 -0500
User-agent: Mutt/1.5.21 (2010-09-15)

On Wed, Dec 16, 2015 at 09:31:26AM -0700, Eric Blake wrote:
> On 12/15/2015 11:25 PM, Jeff Cody wrote:
> > Background:
> > ------------
> > Block jobs, and other QAPI operations, may modify and impact the
> > BlockDriverState graph in QEMU.  In order to support multiple
> > operations safely, we need a mechanism to block and gate operations,
> > 
> > We currently have op blockers, that are attached to each BDS.
> > However, in practice we check this on the device level, rather than on
> > the granularity of the individual BDS.  Also, due to limitations of
> > the current system, we only allow a single block job at a time.
> > 
> > 
> > Proposed New Design Features:
> > ------------------------------
> > This design would supersede the current op blocker system.
> > 
> > Rather than have the op blockers attached directly to each BDS, Block
> > Job access will be done through a separate Node Access Control system.
> > 
> > This new system will:
> > 
> >     * Allow / Disallow operations to BDSs, (generally initiated by QAPI
> >       actions; i.e. BDS node operations other than guest read/writes)
> 
> Doesn't suspending/resuming the guest count as a case where guest
> read/writes affect allowed operations?  That is, a VM in state RUNNING
> counts as one that Requires read-data and write-data, and once it
> transitions to paused, the Requires can go away.
>

That's a good question.  I'm not sure, and I tend to think of the VM
guest as a special case, but I don't know that everyone agrees with
that.

My reasoning on why it is special can be summed up as: the
guest is concerned with the _device_ (i.e. BlockBackend), rather than
the BDS nodes (which are invisible to the guest).

For instance, if the guest had Require/Allow flags set, when we do a
live snapshot, those flags wouldn't necessarily apply to the BDS, but
we would want those to move to the active layer again (but not really,
because again we are concerned with the BlockBackend device moreso
than the BDS).

So I think if we wanted to encapsulate the guest, rather than trying
to do it vis-a-vis the active layer, we should extend this proposed
NAC scheme to also have a BBAC (BlockBackend Access Control) layer as
well.

> > 
> >     * Not be limited to "block jobs" (e.g. block-commit, block-stream,
> >       etc..) - will also apply to other operations, such as
> >       blockdev-add, change-backing-file, etc.
> > 
> >     * Allow granularity in options, and provide the safety needed to
> >       support multiple block jobs and operations.
> > 
> >     * Reference each BDS by its node-name
> > 
> >     * Be independent of the bdrv_states graph (i.e. does not reside in
> >       the BDS structure)
> > 
> > 
> > Node Access Control: Jobs
> > --------------------------
> > Every QAPI/HMP initiated operation is considered a "job", regardless
> > if it is implemented via coroutines (e.g. block jobs such as
> > block-commit), or handled synchronously in a QAPI handler (e.g.
> > change-backing-file, blockdev-add).
> > 
> > A job consists of:
> >     * Unique job ID
> >     * A list of all node-names affected by the operation
> 
> Do we need to hold some sort of lock when computing the list of
> node-names to be affected, since some of the very operations involved
> are those that would change the set of node-names impacted? That is, no
> operation can change the graph while another operation is collecting the
> list of nodes to be tied to a job.
> 

Yes, good point, I think you are right: a block-level global mutex of
sorts would be needed here.  I think we could get away with this being
around graph reconfiguration, and computing the list of node-names
affected by an operation.

> >     * Action flags (see below) for each node in the above list
> > 
> > 
> > Node Access Control: Action Flags
> > -----------------------------------
> > Every operation must set action flag pairs, for every node affected by
> > the operation.
> > 
> > Action flags are set as a Require/Allow pair.  If the Require
> > flag is set, then the operation requires the ability to take the
> > specific action.  If the Allow flag is set, then the operation will
> > allow other operations to perform same action in parallel.
> > 
> > The default is to prohibit, not allow, parallel actions.
> > 
> > The proposed actions are:
> > 
> >     1. Modify - Visible Data
> >         * This is modification that would be detected by an external
> >           read (for instance, a hypothetical qemu-io read from the
> >           specific node), inclusive of its backing files.  A classic
> >           example would be writing data into another node, as part of
> >           block-commit.
> > 
> >     2. Modify - Metadata
> >         * This is a write that is not necessarily visible to an external
> >           read, but still modifies the underlying image format behind the
> >           node.  I.e., change-backing-file to an identical backing file.
> >           (n.b.: if changing the backing file to a non-identical
> >           backing file, the "Write - Visible Data" action would also
> >           be required).
> 
> Will these tie in to Kevin's work on advisory qcow2 locking (where we
> try to detect the case of two concurrent processes both opening the same
> qcow2 file for writes)?
>

I don't think so - I think this is a level up (at the node level, not
the file level).  But also, this is all within the context of inside a
running QEMU process rather than system-level multi-process file access
protection (which is still important, but not the objective of this
proposal).

> > 
> >     3. Graph Reconfiguration
> >         * Removing, adding, or reordering nodes in the bdrv_state
> >           graph. (n.b.: Write - Visible Data may need to be set, if
> >           appropriate)
> > 
> >     4. Read - Visible Data
> >         * Read data visible to an external read.
> > 
> >     5. Read - Metadata
> >         * Read node metadata
> > 
> > 
> > Node Access Control: Tracking Jobs
> > ----------------------------------
> > Each new NAC job will have a unique ID. Associated with that ID is
> > a list of each BDS node affected by the operation, alongside its
> > corresponding Action Flags.
> > 
> > When a new NAC job is created, a separate list of node-name flags is
> > updated to provide easy go/no go decisions.
> > 
> > An operation is prohibited if:
> > 
> >     * A "Require" Action Flag is set, and
> >     * The logical AND of the "Allow" Action Flag of all NAC-controlled
> >       operations for that node is 0.
> > 
> >     - OR -
> > 
> >     * The operation does not set the "Allow" Action Flag, and
> >     * The logical OR of the corresponding "Require" Action Flag of all
> >       NAC-controlled operations for that node is 1.
> > 
> 
> Makes sense.
> 
> > When a NAC controlled job completes, the node-name flag list is
> > updated, and the corresponding NAC job removed from the job list.
> > 
> > 
> > General Notes:
> > -----------------
> > * This is still a "honor" system, in that each handler / job is
> > responsible for acquiring the NAC permission, and properly identifying
> > all nodes affected correctly by their operation.
> > 
> > This should be done before any action is taken by the handler - that
> > is, it should be clean to abort the operation if the NAC does not give
> > consent.
> > 
> > * It may be possible to later expand the NAC system, to provide
> >   handles for use by low-level operations in block.c.
> > 
> > * Thoughts?  There are still probably some holes in the scheme that need
> > to be plugged.
> >
> 
> Looking forward to seeing some patches.
> 

Thanks for reviewing!

-Jeff



reply via email to

[Prev in Thread] Current Thread [Next in Thread]