qemu-block
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-block] [Qemu-devel] [PATCH v4 21/23] block: Align block status


From: Eric Blake
Subject: Re: [Qemu-block] [Qemu-devel] [PATCH v4 21/23] block: Align block status requests
Date: Mon, 2 Oct 2017 18:51:21 -0500
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.3.0

On 10/02/2017 03:24 PM, John Snow wrote:
> 
> 
> On 09/13/2017 12:03 PM, Eric Blake wrote:
>> Any device that has request_alignment greater than 512 should be
>> unable to report status at a finer granularity; it may also be
>> simpler for such devices to be guaranteed that the block layer
>> has rounded things out to the granularity boundary (the way the
>> block layer already rounds all other I/O out).  Besides, getting
>> the code correct for super-sector alignment also benefits us
>> for the fact that our public interface now has byte granularity,
>> even though none of our drivers have byte-level callbacks.
>>
>> Add an assertion in blkdebug that proves that the block layer
>> never requests status of unaligned sections, similar to what it
>> does on other requests (while still keeping the generic helper
>> in place for when future patches add a throttle driver).  Note
>> that iotest 177 already covers this (it would fail if you use
>> just the blkdebug.c hunk without the io.c changes).  Meanwhile,
>> we can drop assertions in callers that no longer have to pass
>> in sector-aligned addresses.
>>
>> There is a mid-function scope added for 'int count', for a
>> couple of reasons: first, an upcoming patch will add an 'if'
>> statement that checks whether a driver has an old- or new-style
>> callback, and can conveniently use the same scope for less
>> indentation churn at that time.  Second, since we are trying
>> to get rid of sector-based computations, wrapping things in
>> a scope makes it easier to group and see what will be deleted
>> in a final cleanup patch once all drivers have been converted
>> to the new-style callback.
>>
>> Signed-off-by: Eric Blake <address@hidden>
>>

>> @@ -1815,28 +1816,45 @@ static int64_t coroutine_fn 
>> bdrv_co_block_status(BlockDriverState *bs,
>>      }
>>
>>      bdrv_inc_in_flight(bs);
>> -    /*
>> -     * TODO: Rather than require aligned offsets, we could instead
>> -     * round to the driver's request_alignment here, then touch up
>> -     * count afterwards back to the caller's expectations.
>> -     */
>> -    assert(QEMU_IS_ALIGNED(offset | bytes, BDRV_SECTOR_SIZE));
>> -    bytes = MIN(bytes, BDRV_REQUEST_MAX_BYTES);
>> -    ret = bs->drv->bdrv_co_get_block_status(bs, offset >> BDRV_SECTOR_BITS,
>> -                                            bytes >> BDRV_SECTOR_BITS, 
>> &count,
>> -                                            &local_file);
>> -    if (ret < 0) {
>> -        *pnum = 0;
>> -        goto out;
>> +
>> +    /* Round out to request_alignment boundaries */
>> +    align = MAX(bs->bl.request_alignment, BDRV_SECTOR_SIZE);
> 
> There's something funny to me about an alignment request getting itself
> aligned...

Pre-patch, we are asserting that all callers are passing in
sector-aligned requests (even though we've switched the interface to
allow byte-based granularity, none of the callers are yet taking
advantage of that), then passing on sector-aligned requests to the
driver (regardless of whether the driver has 1-byte alignment, like
posix-file, or is using 4k alignment, like some block devices).
Post-patch, we are going with the larger of the driver's preferred
alignment, and our minimum of 512 (mainly because until series 4, we
have no way to pass byte values on to the driver, even if the driver
otherwise supports smaller alignments).  This MAX() disappears later in
series 4, once the driver callback is made byte-based, first by becoming
conditional on whether the driver is old sector-based or new byte-based
callback:
 https://lists.gnu.org/archive/html/qemu-devel/2017-09/msg03814.html
and then altogether when the sector-based is deleted:
 https://lists.gnu.org/archive/html/qemu-devel/2017-09/msg03833.html

So, this patch, coupled with the new driver callback in series 4, is
what lets us introduce clients that are able to finally pass in values
that are not sector-aligned; and the rounding up of alignment here is a
stop-gap measure to keep things working until the transition is complete.

> 
>> +    aligned_offset = QEMU_ALIGN_DOWN(offset, align);
>> +    aligned_bytes = ROUND_UP(offset + bytes, align) - aligned_offset;
>> +
>> +    {
>> +        int count; /* sectors */
>> +
>> +        assert(QEMU_IS_ALIGNED(aligned_offset | aligned_bytes,
>> +                               BDRV_SECTOR_SIZE));
>> +        ret = bs->drv->bdrv_co_get_block_status(
>> +            bs, aligned_offset >> BDRV_SECTOR_BITS,
>> +            MIN(INT_MAX, aligned_bytes) >> BDRV_SECTOR_BITS, &count,
> 
> I guess under the belief that INT_MAX will be strictly less than
> BDRV_REQUEST_MAX_BYTES, or some other reason I'm missing for the change?

INT_MAX is larger than BDRV_REQUEST_MAX_BYTES.  aligned_bytes, however,
may be larger than BDRV_REQUEST_MAX_BYTES (or even larger than INT_MAX).
 Once series 4 introduces the driver callback with a 64-bit length, then
we can pass in aligned_bytes as-is; but until then, while we are stuck
with a 32-bit sector length callback, we are artificially capping the
user's request to not exceed what we can reliably expect to work through
the driver callback.

It's not much of a real loss - our interface is already "tell me the
status of this offset, and of as many subsequent offsets that you can
easily check have the same status, up to my limit, and return the number
of like-typed offsets in pnum".  The caller can't tell the difference
between a driver that can give a full answer all the way to the
requested limit, and a driver that caps all answers at a single cluster
per call (we already document that a caller must be prepared to see the
same status for two subsequent calls in a row, rather than a single call
with a pnum covering both offsets).

> 
>> +            &local_file);
>> +        if (ret < 0) {
>> +            *pnum = 0;
>> +            goto out;
>> +        }
>> +        *pnum = count * BDRV_SECTOR_SIZE;
> 
> Is it asking for trouble to be updating pnum here before we undo our
> alignment corrections? For readability reasons and preventing an
> accidental context-based oopsy-daisy.

As in, write the code to make all calculations in a temporary, and then
assign *pnum only at the end?  I suppose I can tweak the code along
those lines, but I'm not sure it will make the end result any more legible.

> 
>> +    }
>> +
>> +    /* Clamp pnum and ret to original request */
>> +    assert(QEMU_IS_ALIGNED(*pnum, align));
> 
> Oh, do we guarantee this? I guess we do..

My overriding argument here is that a driver should never expose block
status that changes mid-alignment (for drivers that support an alignment
of 1, that's trivially true; but for drivers that refuse to read or
write anything smaller than 512 bytes at a time, it also stands to
reason that the driver won't report allocation status that differs
smaller than 512 bytes at a time).

> 
>> +    *pnum -= offset - aligned_offset;
> 
> can pnum prior to adjustment ever be less than offset - aligned_offset?
> i.e., can this underflow?

No, underflow should not be possible.  The difference between offset and
aligned_offset is at most (align - 1), and we already argued that the
driver cannot be returning results that lie in the middle of 'align'.
We have the corner case of a caller requesting status on 0 bytes, but
that should already be handled at the front of the function before we
call the driver, so a driver should never be called with a limit smaller
than align, and should never return success unless pnum is at least align.

Maybe an assertion would help, though.

> 
> (Can we fail to actually inquire about the range the caller was
> interested in by aligning down too much and observing a difference in
> allocation status between the alignment pre-range and the actual range?)

Again, the argument of alignment is that we are widening from the
caller's area of interest into the granularity supported by the driver;
since the driver can't report two different block status for different
portions of the granularity, we should never be in the situation where
aligning down too much sees the wrong status of the pre-range area that
differs from the status of the area in question.

> 
>> +    if (aligned_offset >> BDRV_SECTOR_BITS != offset >> BDRV_SECTOR_BITS &&
>> +        ret & BDRV_BLOCK_OFFSET_VALID) {
>> +        ret += QEMU_ALIGN_DOWN(offset - aligned_offset, BDRV_SECTOR_SIZE);
>> +    }
> 
> Alright, and if the starting sectors are different (Wait, why is it
> sectors now instead of the requested alignment? Is this safe for all
> formats?) we adjust the return value forward a little bit to match the
> difference.

The inherent problem with the bdrv_co_get_status() interface is that we
CANNOT report finer granularity than 512-byte mapping, even though we
now take byte offset/length input (we have only 64 bits to report an
answer, and those bits must include the upper 55 bits of the offset and
a lower 9 bits that are used as flags - we don't have the luxury of
reporting a full 64-bit mapping).  The solution is that we document that
the answer is an offset modulo 512: if we return an offset, it is the
offset of the start of the sector matching the byte in question.  (If I
ask for the status of byte 2, and the answer includes offset 0x1000,
then I know that I read offset 0x1002 to get the contents of the byte I
asked about)

So that answers what happens for sub-sector results.  The question then
is what do we do with drivers that require larger-than-512 alignment,
such as block devices that require 4k-byte allocation.

Pre-patch, we don't care about the driver's granularity - our question
was always 512-byte aligned, so our answer is also, and we don't have to
fudge anything back into place here in io.c (but we DO have to do the
fudging elsewhere; for example, qcow2.c:qcow2_co_get_block_status()
calls qcow2_get_cluster_offset on a number that is rounded down to a
cluster boundary, and has to 'cluster_offset |= (index_in_cluster <<
BDRV_SECTOR_BITS)' to get back to the right sector boundary).
Post-patch, because we know obey the driver's bs->bl.request_alignment,
a driver that requires 4k questions is in the same boat where we may
have rounded down, and then have to add back to the answer to get to the
right sector (the driver's answer already means we have pnum contiguous
bytes starting at the aligned offset).


> 
>> +    if (*pnum > bytes) {
>> +        *pnum = bytes;
>>      }
> 
> Assuming this clamps the aligned_bytes range down to the bytes range, in
> case it's contiguous beyond what the caller asked for.
> 
>> -    *pnum = count * BDRV_SECTOR_SIZE;
>>
>>      if (ret & BDRV_BLOCK_RAW) {
>>          assert(ret & BDRV_BLOCK_OFFSET_VALID && local_file);
>>          ret = bdrv_co_block_status(local_file, mapping,
>> -                                   ret & BDRV_BLOCK_OFFSET_MASK,
>> +                                   (ret & BDRV_BLOCK_OFFSET_MASK) |
>> +                                   (offset & ~BDRV_BLOCK_OFFSET_MASK),
>>                                     *pnum, pnum, &local_file);
>> -        assert(QEMU_IS_ALIGNED(*pnum, BDRV_SECTOR_SIZE));
>>          goto out;
>>      }
>>
>> @@ -1860,7 +1878,8 @@ static int64_t coroutine_fn 
>> bdrv_co_block_status(BlockDriverState *bs,
>>          int64_t file_pnum;
>>
>>          ret2 = bdrv_co_block_status(local_file, mapping,
>> -                                    ret & BDRV_BLOCK_OFFSET_MASK,
>> +                                    (ret & BDRV_BLOCK_OFFSET_MASK) |
>> +                                    (offset & ~BDRV_BLOCK_OFFSET_MASK),
>>                                      *pnum, &file_pnum, NULL);
>>          if (ret2 >= 0) {
>>              /* Ignore errors.  This is just providing extra information, it
>> diff --git a/block/blkdebug.c b/block/blkdebug.c
>> index 46e53f2f09..f54fe33cae 100644
>> --- a/block/blkdebug.c
>> +++ b/block/blkdebug.c
>> @@ -628,6 +628,17 @@ static int coroutine_fn 
>> blkdebug_co_pdiscard(BlockDriverState *bs,
>>      return bdrv_co_pdiscard(bs->file->bs, offset, bytes);
>>  }
>>
>> +static int64_t coroutine_fn blkdebug_co_get_block_status(
>> +    BlockDriverState *bs, int64_t sector_num, int nb_sectors, int *pnum,
>> +    BlockDriverState **file)
>> +{
>> +    assert(QEMU_IS_ALIGNED(sector_num | nb_sectors,
>> +                           DIV_ROUND_UP(bs->bl.request_alignment,
>> +                                        BDRV_SECTOR_SIZE)));
>> +    return bdrv_co_get_block_status_from_file(bs, sector_num, nb_sectors,
>> +                                              pnum, file);
>> +}
>> +
>>  static void blkdebug_close(BlockDriverState *bs)
>>  {
>>      BDRVBlkdebugState *s = bs->opaque;
>> @@ -897,7 +908,7 @@ static BlockDriver bdrv_blkdebug = {
>>      .bdrv_co_flush_to_disk  = blkdebug_co_flush,
>>      .bdrv_co_pwrite_zeroes  = blkdebug_co_pwrite_zeroes,
>>      .bdrv_co_pdiscard       = blkdebug_co_pdiscard,
>> -    .bdrv_co_get_block_status = bdrv_co_get_block_status_from_file,
>> +    .bdrv_co_get_block_status = blkdebug_co_get_block_status,
>>
>>      .bdrv_debug_event           = blkdebug_debug_event,
>>      .bdrv_debug_breakpoint      = blkdebug_debug_breakpoint,
>>
> 
> Looks good overall but I have some comprehension issues in my own head
> about the adjustment math and why the various alignments are safe.

I may add a couple more asserts and/or comments in the next spin,
documenting what conditions hold each step along the way.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org

Attachment: signature.asc
Description: OpenPGP digital signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]