Re: [Qemu-devel] [PATCH v2 03/11] block: Filtered children access functi

qemu-devel
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-devel] [PATCH v2 03/11] block: Filtered children access functi

From:	Max Reitz
Subject:	Re: [Qemu-devel] [PATCH v2 03/11] block: Filtered children access functions
Date:	Wed, 14 Nov 2018 20:52:02 +0100
User-agent:	Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.3.0
On 12.11.18 23:17, Eric Blake wrote:
> On 8/9/18 5:31 PM, Max Reitz wrote:
>> What bs->file and bs->backing mean depends on the node.  For filter
>> nodes, both signify a node that will eventually receive all R/W
>> accesses.  For format nodes, bs->file contains metadata and data, and
>> bs->backing will not receive writes -- instead, writes are COWed to
>> bs->file.  Usually.
>>
>> In any case, it is not trivial to guess what a child means exactly with
>> our currently limited form of expression.  It is better to introduce
>> some functions that actually guarantee a meaning:
>>
>> - bdrv_filtered_cow_child() will return the child that receives requests
>>    filtered through COW.  That is, reads may or may not be forwarded
>>    (depending on the overlay's allocation status), but writes never go to
>>    this child.
>>
>> - bdrv_filtered_rw_child() will return the child that receives requests
>>    filtered through some very plain process.  Reads and writes issued to
>>    the parent will go to the child as well (although timing, etc. may be
>>    modified).
>>
>> - All drivers but quorum (but quorum is pretty opaque to the general
>>    block layer anyway) always only have one of these children: All read
>>    requests must be served from the filtered_rw_child (if it exists), so
>>    if there was a filtered_cow_child in addition, it would not receive
>>    any requests at all.
>>    (The closest here is mirror, where all requests are passed on to the
>>    source, but with write-blocking, write requests are "COWed" to the
>>    target.  But that just means that the target is a special child that
>>    cannot be introspected by the generic block layer functions, and that
>>    source is a filtered_rw_child.)
>>    Therefore, we can also add bdrv_filtered_child() which returns that
>>    one child (or NULL, if there is no filtered child).
>>
>> Also, many places in the current block layer should be skipping filters
>> (all filters or just the ones added implicitly, it depends) when going
>> through a block node chain.  They do not do that currently, but this
>> patch makes them.
> 
> The description makes sense; now on to the code.
> 
>>
>> Signed-off-by: Max Reitz <address@hidden>
>> ---
>>   qapi/block-core.json           |   4 +
>>   include/block/block.h          |   1 +
>>   include/block/block_int.h      |  33 +++++-
>>   block.c                        | 184 ++++++++++++++++++++++++++++-----
>>   block/backup.c                 |   8 +-
>>   block/block-backend.c          |  16 ++-
>>   block/commit.c                 |  36 ++++---
>>   block/io.c                     |  27 ++---
>>   block/mirror.c                 |  37 ++++---
>>   block/qapi.c                   |  26 ++---
>>   block/stream.c                 |  15 ++-
>>   blockdev.c                     |  84 ++++++++++++---
>>   migration/block-dirty-bitmap.c |   4 +-
>>   nbd/server.c                   |   8 +-
>>   qemu-img.c                     |  12 ++-
>>   15 files changed, 363 insertions(+), 132 deletions(-)
>>
>> diff --git a/qapi/block-core.json b/qapi/block-core.json
>> index f20efc97f7..a71df88eb2 100644
>> --- a/qapi/block-core.json
>> +++ b/qapi/block-core.json
>> @@ -2248,6 +2248,10 @@
>>   # On successful completion the image file is updated to drop the
>> backing file
>>   # and the BLOCK_JOB_COMPLETED event is emitted.
> 
> Context: this is part of block-stream.
> 
>>   #
>> +# In case @device is a filter node, block-stream modifies the first
>> non-filter
>> +# overlay node below it to point to base's backing node (or NULL if
>> @base was
>> +# not specified) instead of modifying @device itself.
> 
> That is, if we have:
> 
> base <- filter1 <- active <- filter2
> 
> and request a block-stream with "top":"filter2", it is no different in
> effect than if we had requested "top":"active", since filter nodes can't
> be stream targets.  Makes sense.
> 
> What happens if we request "base":"filter1"? Do we want to require base
> to be a non-filter node?

Well, then you get this after streaming:

base <- active <- filter2

There is no good reason why you'd stream to remove filters (just doing a
reopen should be enough), but why not.  We can make the backing pointer
point to any child, so it doesn't matter what the child is.  The problem
is that we can only write backing file strings into actual COW overlay
nodes, so it does matter what the parent is.

>> +++ b/include/block/block_int.h
>> @@ -91,6 +91,7 @@ struct BlockDriver {
>>        * certain callbacks that refer to data (see block.c) to their
>> bs->file if
>>        * the driver doesn't implement them. Drivers that do not wish
>> to forward
>>        * must implement them and return -ENOTSUP.
>> +     * Note that filters are not allowed to modify data.
> 
> They can modify offsets and timing, but not data?  Even if it is an
> encryption filter?  I'm trying to figure out if LUKS behaves like a filter.

It doesn't.  It's a format.

First of all, LUKS has metadata, so it definitely is a format.

Second, even if it didn't, I think it is a very, very useful convention
to declare filters as things that do not modify data.  If a block driver
does modify data, there is absolutely no point in handling it any
different than a normal format driver.

>> +++ b/block.c
>> @@ -532,11 +532,12 @@ int bdrv_create_file(const char *filename,
>> QemuOpts *opts, Error **errp)
>>   int bdrv_probe_blocksizes(BlockDriverState *bs, BlockSizes *bsz)
>>   {
>>       BlockDriver *drv = bs->drv;
>> +    BlockDriverState *filtered = bdrv_filtered_rw_bs(bs);
> 
> Is it worth a micro-optimization of not calling this...
> 
>>         if (drv && drv->bdrv_probe_blocksizes) {
>>           return drv->bdrv_probe_blocksizes(bs, bsz);
> 
> ...until after checking drv->bdrv_probe_blocksizes?

I don't know? :-)

I wouldn't think so, as bdrv_filtered_rw_bs() is just so simple.

>> -    } else if (drv && drv->is_filter && bs->file) {
>> -        return bdrv_probe_blocksizes(bs->file->bs, bsz);
>> +    } else if (filtered) {
>> +        return bdrv_probe_blocksizes(filtered, bsz);
>>       }
> 
> But I don't mind if you leave it as written.
> 
> Is blkdebug a filter, or something else?

I would have said it's a filter.

> That's a case of something
> that DOES change block sizes in relation to the child that it is
> filtering.  If we have qcow2 -> blkdebug -> file, and the qcow2 format
> layer wants to know the blocksizes of its child, does it really always
> want the sizes of 'file' rather than the (possibly changed) sizes of
> 'blkdebug'?

Hm.  See, that's why this series is so difficult, because all these
questions keep popping up. :-)

This is a very good question indeed.  I think for all filters but
blkdebug it makes sense to just pass this through to the filtered child,
because this should fundamentally go down to the protocol layer anyway.

However, when looking at who uses this function at all, it appears that
this is just used for guest device configuration (so the guest device's
cluster size matches the hosts).  qcow2 doesn't support this at all, so
if you use qcow2, you'll just get the default of BDRV_SECTOR_SIZE.  If
you want to override the auto-detection, you can set a device-level
option.  So I don't think we need support in blkdebug to emulate a
different block size, because of two reasons:

First, it wouldn't be a test of the block layer, because the block layer
really doesn't care about this (internally).

So second, it would only be a test of a guest.  But if you want to test
that, you can always just set the device-level option.

>>         return -ENOTSUP;
>> @@ -551,11 +552,12 @@ int bdrv_probe_blocksizes(BlockDriverState *bs,
>> BlockSizes *bsz)
>>   int bdrv_probe_geometry(BlockDriverState *bs, HDGeometry *geo)
>>   {
>>       BlockDriver *drv = bs->drv;
>> +    BlockDriverState *filtered = bdrv_filtered_rw_bs(bs);
>>         if (drv && drv->bdrv_probe_geometry) {
>>           return drv->bdrv_probe_geometry(bs, geo);
>> -    } else if (drv && drv->is_filter && bs->file) {
>> -        return bdrv_probe_geometry(bs->file->bs, geo);
>> +    } else if (filtered) {
>> +        return bdrv_probe_geometry(filtered, geo);
>>       }
> 
> At least you're consistent on skipping filters.

I tried my best to come up with something that makes sense.  Sometimes
it made me nearly go insane.

>> @@ -4068,7 +4074,19 @@ BlockDriverState *bdrv_lookup_bs(const char
>> *device,
>>   bool bdrv_chain_contains(BlockDriverState *top, BlockDriverState *base)
>>   {
>>       while (top && top != base) {
>> -        top = backing_bs(top);
>> +        top = bdrv_filtered_bs(top);
>> +    }
>> +
>> +    return top != NULL;
>> +}
>> +
>> +/* Same as bdrv_chain_contains(), but skip implicitly added R/W filter
>> + * nodes and do not move past explicitly added R/W filters. */
>> +bool bdrv_legacy_chain_contains(BlockDriverState *top,
>> BlockDriverState *base)
>> +{
>> +    top = bdrv_skip_implicit_filters(top);
>> +    while (top && top != base) {
>> +        top = bdrv_skip_implicit_filters(bdrv_filtered_cow_bs(top));
>>       }
> 
> Is there a goal of getting rid of bdrv_legacy_chain_contains() in the
> future?  If so, should the commit message and/or code comments mention
> that?

The only thing that's using it is qmp_block_commit.  I think the
long-term goal is to get rid of the commit job and replace it by
blockdev-copy, which would make the use of that function moot, but I
suppose we have to keep it around as long as block-commit is there.

>>         return top != NULL;
>> @@ -4140,20 +4158,24 @@ int bdrv_has_zero_init_1(BlockDriverState *bs)
>>     int bdrv_has_zero_init(BlockDriverState *bs)
>>   {
>> +    BlockDriverState *filtered;
>> +
>>       if (!bs->drv) {
>>           return 0;
>>       }
>>         /* If BS is a copy on write image, it is initialized to
>>          the contents of the base image, which may not be zeroes.  */
>> -    if (bs->backing) {
>> +    if (bdrv_filtered_cow_child(bs)) {
>>           return 0;
> 
> Not for this patch, but should we ask the filtered_cow_child if it is
> known to be all-zero content before blindly returning 0 here? Some
> children may be able to efficiently report if they have all-zero content
> [for example, see the recent thread about NBD performace drop due to
> explicitly zeroing the remote device, which could be skipped if it is
> known that the remote device started life uninitialized]

The question is, why would you have an empty backing file?

>>       }
>>       if (bs->drv->bdrv_has_zero_init) {
>>           return bs->drv->bdrv_has_zero_init(bs);
>>       }
>> -    if (bs->file && bs->drv->is_filter) {
>> -        return bdrv_has_zero_init(bs->file->bs);
>> +
>> +    filtered = bdrv_filtered_rw_bs(bs);
>> +    if (filtered) {
>> +        return bdrv_has_zero_init(filtered);
>>       }
> 
> You argued earlier that a filter can't change contents - but is that
> just guest-visible contents? If LUKS is a filter node, then a file that
> is zero-initialized is NOT zero-initialized after passing through LUKS
> encryption (decrypting the zeros returns garbage; conversely, writing
> zeros into LUKS results in random-looking bits in the file).  I guess
> I'm leaning more and more towards LUKS is not a filter, but a format.

Yeah.  I think it's just not useful to consider LUKS a filter, because
if filters can change data -- then what good is having the "filter"
category?

>> @@ -4198,8 +4220,9 @@ int bdrv_get_info(BlockDriverState *bs,
>> BlockDriverInfo *bdi)
>>           return -ENOMEDIUM;
>>       }
>>       if (!drv->bdrv_get_info) {
>> -        if (bs->file && drv->is_filter) {
>> -            return bdrv_get_info(bs->file->bs, bdi);
>> +        BlockDriverState *filtered = bdrv_filtered_rw_bs(bs);
>> +        if (filtered) {
>> +            return bdrv_get_info(filtered, bdi);
> 
> Is this right for blkdebug?

I think it is.  If it wants to intercept this function, it's free to
implement .bdrv_get_info.

The alternative is returning -ENOTSUP, and I don't see how that's any
better than passing data through from the child.

>> @@ -5487,3 +5519,105 @@ bool
>> bdrv_can_store_new_dirty_bitmap(BlockDriverState *bs, const char *name,
>>         return drv->bdrv_can_store_new_dirty_bitmap(bs, name,
>> granularity, errp);
>>   }
>> +
>> +/*
>> + * Return the child that @bs acts as an overlay for, and from which
>> data may be
>> + * copied in COW or COR operations.  Usually this is the backing file.
>> + */
>> +BdrvChild *bdrv_filtered_cow_child(BlockDriverState *bs)
>> +{
>> +    if (!bs || !bs->drv) {
>> +        return NULL;
>> +    }
>> +
>> +    if (bs->drv->is_filter) {
>> +        return NULL;
>> +    }
> 
> Here, filters end the search...

Yes, because COW parents have is_filter == false...

>> +
>> +    return bs->backing;
>> +}
>> +
>> +/*
>> + * If @bs acts as a pass-through filter for one of its children,
>> + * return that child.  "Pass-through" means that write operations to
>> + * @bs are forwarded to that child instead of triggering COW.
>> + */
>> +BdrvChild *bdrv_filtered_rw_child(BlockDriverState *bs)
>> +{
>> +    if (!bs || !bs->drv) {
>> +        return NULL;
>> +    }
>> +
>> +    if (!bs->drv->is_filter) {
>> +        return NULL;
>> +    }
> 
> ...while here, non-filters end the search. I think I follow your
> semantics (we were abusing bs->backing for filters, and your code is now
> trying to distinguish what was really meant)

...while R/W filter parents have is_filter == true.  So that's why it's
the opposite.

>> +
>> +    return bs->backing ?: bs->file;
>> +}
>> +
>> +/*
>> + * Return any filtered child, independently on how it reacts to write
> 
> s/on/of/

Indeed.

>> + * accesses and whether data is copied onto this BDS through COR.
>> + */

[...]

>> +++ b/block/io.c
>> @@ -120,6 +120,7 @@ static void bdrv_merge_limits(BlockLimits *dst,
>> const BlockLimits *src)
>>   void bdrv_refresh_limits(BlockDriverState *bs, Error **errp)
>>   {
>>       BlockDriver *drv = bs->drv;
>> +    BlockDriverState *cow_bs = bdrv_filtered_cow_bs(bs);
>>       Error *local_err = NULL;
>>         memset(&bs->bl, 0, sizeof(bs->bl));
>> @@ -148,13 +149,13 @@ void bdrv_refresh_limits(BlockDriverState *bs,
>> Error **errp)
>>           bs->bl.max_iov = IOV_MAX;
>>       }
>>   -    if (bs->backing) {
>> -        bdrv_refresh_limits(bs->backing->bs, &local_err);
>> +    if (cow_bs) {
>> +        bdrv_refresh_limits(cow_bs, &local_err);
>>           if (local_err) {
>>               error_propagate(errp, local_err);
>>               return;
>>           }
>> -        bdrv_merge_limits(&bs->bl, &bs->backing->bs->bl);
>> +        bdrv_merge_limits(&bs->bl, &cow_bs->bl);
> 
> Is this doing the right things with blkdebug?

First, blkdebug doesn't have a COW child, does it?

Second, we still always invoke the driver's implementation (if there is
one).  All of the code at the beginning of the function just chooses
some defaults.  So blkdebug can still override everything.

But there is indeed something wrong here.  And that is: What is with R/W
filter drivers that use bs->backing?  After this patch, they won't get
any defaults.

So I think the change that is needed is:
- The bs->file branch should be transformed into a bdrv_storage_bs()
  branch (this is done by the next patch already, good)
- The bs->backing branch should be transformed into a bdrv_filtered_bs()
  branch

Then we have the following cases:
- R/W filters will go into the second branch rather than the first, but
  that's OK, because the code is the same anyway.
  (But all filters that used bs->backing already did go into the second
  branch, so...)
- COW nodes (with both a storage child and a filtered child) will
  continue to go into both branches and get a joined result.
- Non-COW format nodes will continue to go into the first branch.

Before we have bs_storage_bs() (that is, before the next patch), I think
it's OK to make filter nodes that use bs->file go into both branches
(because bs->file is set for them, so they'll go into the first branch,
and then, as filters, they'll go into the second branch).


So I think all that's needed is s/cow_bs/filtered_bs/ and
s/bdrv_filtered_cow_bs/bdrv_filtered_bs/.

Max
signature.asc
Description: OpenPGP digital signature
[Prev in Thread]
Current Thread
[Next in Thread]
Re: [Qemu-devel] [PATCH v2 03/11] block: Filtered children access functions, Eric Blake, 2018/11/12
- Re: [Qemu-devel] [PATCH v2 03/11] block: Filtered children access functions, Max Reitz <=
Prev by Date: Re: [Qemu-devel] [PATCH] cpus: run work items for all vCPUs if single-threaded
Next by Date: Re: [Qemu-devel] [PATCH RFC 3/6] qapi: rewrite string-input-visitor
Previous by thread: Re: [Qemu-devel] [PATCH v2 03/11] block: Filtered children access functions
Next by thread: Re: [Qemu-devel] [PATCH v2 04/11] block: Storage child access function
Index(es):
- Date
- Thread