Re: [Qemu-devel] RFC: Operation Blockers in QEMU Block Nodes

qemu-devel
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-devel] RFC: Operation Blockers in QEMU Block Nodes

From:	Jeff Cody
Subject:	Re: [Qemu-devel] RFC: Operation Blockers in QEMU Block Nodes
Date:	Sat, 19 Dec 2015 00:42:00 -0500
User-agent:	Mutt/1.5.21 (2010-09-15)
On Fri, Dec 18, 2015 at 03:19:25PM +0100, Kevin Wolf wrote:
> Am 16.12.2015 um 07:25 hat Jeff Cody geschrieben:
> > Background:
> > ------------
> > Block jobs, and other QAPI operations, may modify and impact the
> > BlockDriverState graph in QEMU.  In order to support multiple
> > operations safely, we need a mechanism to block and gate operations,
> > 
> > We currently have op blockers, that are attached to each BDS.
> > However, in practice we check this on the device level, rather than on
> > the granularity of the individual BDS.  Also, due to limitations of
> > the current system, we only allow a single block job at a time.
> 
> As I already said on IRC, I think a really important part in order to
> make proper op blockers workable is that we actually enforce that you
> have to request permission before you can operate on an image. That is,
> if you were to write to an image, but you haven't requested write
> permission from the op blocker system before, you'd get an assertion
> failure.
> 
> The reason for this is that it's a long standing issue of the current
> system that it's not only possible, but even easy to forget adding the
> blocker code or updating it when the surrounding code changes. The
> result is a mechanism that is too restrictive for many cases on the one
> hand, but fails to actually provide the protection we need on the other
> hand (for example, bs->file is still unprotected).
> 
> Blockers are subtle and at the same time pervasive enough that I'm
> almost sure we wouldn't implement them correctly if we don't force
> ourselves to do it right by letting qemu crash whenever we don't.
>

I agree with this.  The (too brief) blurb at the end of my email was
the kernel of my thoughts:

 "It may be possible to later expand the NAC system, to provide
 handles for use by low-level operations in block.c."

I think to do enforced permissions, we would need to introduce
something like handles (or tokens) for operations, so that we would
fail if the handle/token was not provided for a given operation (or
had insufficient permissions).

I mentioned "later" because I think if the code itself is flexible to
it, it may be better to grow the block layer into it.  But you are
right, doing it right away may be best.


> > Proposed New Design Features:
> > ------------------------------
> > This design would supersede the current op blocker system.
> > 
> > Rather than have the op blockers attached directly to each BDS, Block
> > Job access will be done through a separate Node Access Control system.
> > 
> > This new system will:
> > 
> >     * Allow / Disallow operations to BDSs, (generally initiated by QAPI
> >       actions; i.e. BDS node operations other than guest read/writes)
> > 
> >     * Not be limited to "block jobs" (e.g. block-commit, block-stream,
> >       etc..) - will also apply to other operations, such as
> >       blockdev-add, change-backing-file, etc.
> > 
> >     * Allow granularity in options, and provide the safety needed to
> >       support multiple block jobs and operations.
> > 
> >     * Reference each BDS by its node-name
> > 
> >     * Be independent of the bdrv_states graph (i.e. does not reside in
> >       the BDS structure)
> 
> I agree that the BDS structure is probably not the right place (except
> possibly for caching the permission bitmasks), but I also wouldn't say
> that it should be completely independent structures. Otherwise keeping
> things in sync when the graph changes might turn out hard.
> 

And having some linking into the BDS will also make it more efficient,
and allow it to probe some for insights (such as child / parent node
relationships to check for adequate permissions, etc..).

I just want something that makes the NAC a distinct entity, that is
not necessarily constrained by the BDS graph (especially since we can
now unambiguously address all nodes independent of the graph).


> At the moment I can see two fundamentally different kinds of operations
> that are interesting for op blockers:
> 
> 1. I/O requests
> 
>    We need a place to check I/O requests that knows both the user (and
>    the permissions it has acquired) and the node that is accessed.
> 
>    Any I/O request from outside the block layer (which includes guest
>    devices, NBD servers and block jobs) should come through a
>    BlockBackend. Even though we don't have it yet, we want to have a
>    single BB per user. These requests could be dealt with by calling
>    into the NAC whenever a BDS is attach to or detached from a BB;
>    assertions can be checked in the appropriate blk_* functions.
> 

Can you elaborate more on "we want to have a single BB per user."?

I certainly agree that for most roles and operations, it is true that
the BB should be the representation.  However, most block jobs will
benefit from specifying node-names in addition to the BB, for easier
node identification.  I'd go as far as requiring it, if we are
implementing a new set of QAPI functions to go along with NAC and
simultaneous block jobs.

And I think ultimately, it is the block job handler that is going to
know what permissions are required for the job, and this will happen
closer to the BDS layer than the BB layer.

I have a suspicion we may be in agreement here, and I am just caught
up in definitions.

>    But there are also references between BDSes in the middle of the
>    graph, like a qcow2 image referencing (and sending I/O requests to)
>    its bs->file and bs->backing. It's important that we represent the
>    requirements of such nodes in the NAC. As permissions are always for
>    a specific user for a specific node, neither the parent node nor the
>    child node are the correct place to manage the permissions. It seems
>    that instead BdrvChild might be where they belong. The NAC needs to
>    be informed whenever a child node is attached or detached from a BDS.

The terminology "parent", "child", and then "BdrvChild" can get to be
confusing.  Just to clarify, when you say...

"The NAC needs to be informed whenever a child node is attached or
detached from a BDS."

...am I correct in interpreting this as "when a BdrvChild element is
added to or removed from the 'children' or 'parents' lists in the
BDS"? I am going to operate on the assumption this is what you meant.

So, a couple of things:

Do we need something separate for that, or can we let "Modify Visible
Data" along with "Modify Metadata" cover it?  Or maybe this is
stretching the definition of "Metadata" too far.

If the change does not affect the visible data, then we can use
"Modify Metadata" only as the Require.  If it will affect visible
data, then we would need "Modify Visible Data" as well as "Modify
Metadata" Requires.  

Or maybe we can create a new flag pair for altering BdrvChild elements
in a BDS ("Modify Constituents")?  More on this at [1].

> 
>    As it happens, I've already been considering for a while to convert
>    BlockBackend to using BdrvChild as well, so we could possibly unify
>    both cases.
> 

I do think BdrvChild could be interesting.

>    The problem we have with BdrvChild is that I/O requests don't go
>    through BdrvChild functions, so there is no place to add the
>    assertions we need. We could possibly still add them to blk_*, but
>    that would leave internal I/O requests unchecked.
> 

Another clarification request:  are you talking here about individual
I/O that happens (for instance, bdrv_pwrite()), and using the proposed
NAC for that?  I'm not sure from your email or our IRC discussions if
this is the case (likely my fault, not yours).

If we are indeed talking about low-level individual writes, then that
poses an issue.  My thinking on node-level protection was for
manipulation of nodes in one form or another, by internal processes.
Meaning, the operation knows ahead of time how it can classify the set
of low-level block calls as a whole - e.g., if the operation will
visibly modify the node data.

If we are talking about protecting at the individual arbitrary
bdrv_pwrite() level, however, then we don't know anything about how
the data impacts the node.  We would need to somehow classify the
visibility and type of each write, to see if the caller had adequate
permission.

> 2. Graph reconfiguration
> 
>    The problematic part here is that it's generally done by users who
>    don't have a direct reference to all the nodes that are affected from
>    the changes to the graph.
> 
>    For example, the commit job wants to drop all intermediate nodes
>    after completing the copy - but even after properly converting it to
>    BlockBackend, it would still only have a BB for the top and the base
>    node, but not to any of the intermediate nodes that it drops.

That's only problematic if there are other users of the intermediate
nodes, right?

> 
>    We don't seem to have convenient places like attaching/detaching from
>    nodes where we could put the permission requests. Also we don't seem
>    to have a central place for assertions; the best we can do is adding
>    them to some functions like bdrv_drop_intermediates().
>

The other thing to keep in mind is we will want to keep interactions
(such as block jobs) transactional, so failing out in
bdrv_drop_intermediates() isn't really what we want to do for normal
operation (it is fine for asserts, though, to catch mistakes).

We should know at the beginning of a job, before we start the actual
operation, that we have the permissions and lockouts that we need to
complete the job (or at least, that we can if we block on other
outstanding operations).

> Or maybe the problem isn't really "I/O requests" vs. "reconfiguration",
> but "nodes we have a reference for" vs. "nodes without explicit
> reference".
> 
> > Node Access Control: Jobs
> > --------------------------
> > Every QAPI/HMP initiated operation is considered a "job", regardless
> > if it is implemented via coroutines (e.g. block jobs such as
> > block-commit), or handled synchronously in a QAPI handler (e.g.
> > change-backing-file, blockdev-add).
> > 
> > A job consists of:
> >     * Unique job ID
> >     * A list of all node-names affected by the operation
> >     * Action flags (see below) for each node in the above list
> > 
> > 
> > Node Access Control: Action Flags
> > -----------------------------------
> > Every operation must set action flag pairs, for every node affected by
> > the operation.
> > 
> > Action flags are set as a Require/Allow pair.  If the Require
> > flag is set, then the operation requires the ability to take the
> > specific action.  If the Allow flag is set, then the operation will
> > allow other operations to perform same action in parallel.
> > 
> > The default is to prohibit, not allow, parallel actions.
> > 
> > The proposed actions are:
> > 
> >     1. Modify - Visible Data
> >         * This is modification that would be detected by an external
> >           read (for instance, a hypothetical qemu-io read from the
> >           specific node), inclusive of its backing files.  A classic
> >           example would be writing data into another node, as part of
> >           block-commit.
> > 
> >     2. Modify - Metadata
> >         * This is a write that is not necessarily visible to an external
> >           read, but still modifies the underlying image format behind the
> >           node.  I.e., change-backing-file to an identical backing file.
> >           (n.b.: if changing the backing file to a non-identical
> >           backing file, the "Write - Visible Data" action would also
> >           be required).
> 
> Consider the case of block-streaming from B to C:
> 
>     A <- B <- C <--- virtio-blk
> 
> The block job writes data into C, but doesn't change the visible data.
> Would this be covered by this action? If so, I think "Metadata" is a
> misnomer because we actually copy data around.
> 

Yes, but not for the reason implied.  It isn't because we don't change
data that there is a "Metadata" requirement -- I didn't mean for
"Metadata == Non-Visible Modification".  The "Metadata" would be a
requirement here because we change the metadata in (C), by means of
changing its backing file to (A) at the end of the operation.

"Modify Metadata" may also involve "Modify Visible Data", if the
metadata changed affects visible data (for instance, changing the
backing file to a non-identical backing file).  The two flags are not
meant to be mutually exclusive.

(And as a side note, this is part of the reason I used the terminology
"Modify" rather than "Write".  Some operations may not be a write, but
may also still alter the perceived data of the node).

Here is a breakdown of the actions I see in your example, in the order
they happen:


     A <- B <- C <--- virtio-blk


    1. Data is read from (B) and written into (C)

         - This is not a "Write Visible Data" operation, because a read
           from (C) will return the same results as it did prior to the
           operation.

    2. The backing file in (C) is modified to reflect (A) as the new
       backing file

         - This is what would require the "Write Metadata".  Not
           because it doesn't require "Write Visible Data", but
           because we change the backing file of (C).

    3. Node (B) is dropped from the internal BDS graph, and (C)
       becomes the direct overlay of (A).

          - This would require the "Graph Reconfiguration"


Here is how I would see the full set of Require / Allow flags for each
node involved in your hypothetical block job:

                     Nodes involved: (A), (B), (C).

                ----------------------------------------------
                |   Node A     |   Node B     |   Node C     |
                | Req  : Allow | Req  : Allow | Req  : Allow |
=============================================================|
Modify Visible  |   0  :  1    |   0  :  0    |   0  :  1    |
Modify Metadata |   0  :  1    |   0  :  1    |   1  :  0    |
Graph Reconfig  |   1  :  0    |   0  :  1    |   1  :  0    |
Read Visible    |   0  :  1    |   1  :  1    |   0  :  1    |
Read Metadata   |   1  :  1    |   1  :  1    |   1  :  1    |


> >     3. Graph Reconfiguration
> >         * Removing, adding, or reordering nodes in the bdrv_state
> >           graph. (n.b.: Write - Visible Data may need to be set, if
> >           appropriate)
> 
> We need to define this so that it refers to an operation on a specific
> node. I think we can take this as changing the child node references of
> a given BDS.
>

[1]

Maybe add a new permission for BdrvChild operations rather than Graph
Reconfiguration?  Maybe "Modify Constituents" or something similar.

My thoughts are that Graph Reconfiguration is the relationship of a
node (and all the constituents that comprise the node, such as backing
files) to other nodes in the BDS graph.  And I'm not sure that
changing (for instance) bs->file necessarily prevents moving the bs
node around in the graph.


> >     4. Read - Visible Data
> >         * Read data visible to an external read.
> > 
> >     5. Read - Metadata
> >         * Read node metadata
> 
> Is this a useful operation? I think everyone who does anything with an
> image is automatically reading metadata.
>

So far, I don't think it is.  I include it here for completeness, but
it is the sort of thing that may very well just fall out in
implementation because everything requires it but no one disallows it.

> 
> <thinking aloud>
> 
> I think there is something missing. Initially I thought it's just
> another action, but I'm not so sure any more. Let's check a block-commit
> job this time, commit C back into A:
> 
>          +---------- NBD server
>          v
>     A <- B <- C <--- virtio-blk
> 
> Now as we all know, committing data from C back into A makes B invalid,
> so this should be prohibited. "Read - Visible Data" could possibly
> provide this, but the bs->backing read in C would use the same action,
> so it's impossible to allow the backing file relationship, but not the
> NBD server.

In this case, to prevent the block-commit from doing something bad,
it seems like the block-commit handler should fail when trying to
obtain the Require on "Modify - Visible Data" for (B), because the NBD
server would already have a NAC id open on (B) with the Allow for
"Modify - Visible Data" set to 0.  

> This is the part that could possibly be solved if the commit
> job converted all intermediate bs->backing links to something weaker,
> like "Read Backing File", which doesn't care whether the image is
> self-consistent.
> 
> However, we could instead have two backing file relationships. Then one
> of them must be okay (the B <- C one) and the other one must block
> the commit job. And in fact, if it's the block-commit job that changes
> the link to weaker semantics, we would achieve this.
> 
> Hm... Does this make any sense? I'm not sure if this would turn out
> completely correct.
> 
> </thinking aloud>
>

There may be other cases to consider like this -- but for
block-commit, we can actually make this somewhat safe.  The idea for
this stems from an IRC conversation with Eric Blake a while back, and
I touched on it briefly at KVM Forum [2].

We introduce a new commit command ("block-safe-commit" or
"block-commit-and-stream").

The idea is prior to overwriting data in (A), we stream (A) into (B).
The virtio-blk view doesn't care, because (C+B) is pushed into (A),
and (C) then just points to (A), and (B) is out of the picture.  But
the NBD server view is still fine as well, because (B) now won't see
the new data in (A), as (B) is independent of (A).

If we wanted to keep (B) as sparse as possible, we could be craftier
still, and only write sectors from (A) into (B) if they are going to
be altered by (C).

[2] http://qemu.rocks/jtc-kvm2015/#/4/1




-Jeff
[Prev in Thread]
Current Thread
[Next in Thread]
[Qemu-devel] RFC: Operation Blockers in QEMU Block Nodes, Jeff Cody, 2015/12/16
- Re: [Qemu-devel] RFC: Operation Blockers in QEMU Block Nodes, Eric Blake, 2015/12/16
  - Re: [Qemu-devel] RFC: Operation Blockers in QEMU Block Nodes, Jeff Cody, 2015/12/16
- Re: [Qemu-devel] RFC: Operation Blockers in QEMU Block Nodes, Stefan Hajnoczi, 2015/12/18
- Re: [Qemu-devel] RFC: Operation Blockers in QEMU Block Nodes, Kevin Wolf, 2015/12/18
  - Re: [Qemu-devel] RFC: Operation Blockers in QEMU Block Nodes, Jeff Cody <=
Prev by Date: Re: [Qemu-devel] [PATCH] SeaBios: Fix reset procedure reentrancy problem on qemu-kvm platform
Next by Date: Re: [Qemu-devel] [PATCH] SeaBios: Fix reset procedure reentrancy problem on qemu-kvm platform
Previous by thread: Re: [Qemu-devel] RFC: Operation Blockers in QEMU Block Nodes
Next by thread: [Qemu-devel] [PATCH v2 1/1] i386: expose floppy-related objects in SSDT
Index(es):
- Date
- Thread