qemu-block
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Outreachy project task: Adding QEMU block layer APIs resembling Linu


From: Sam Li
Subject: Re: Outreachy project task: Adding QEMU block layer APIs resembling Linux ZBD ioctls.
Date: Thu, 2 Jun 2022 18:28:45 +0800

Stefan Hajnoczi <stefanha@gmail.com> 于2022年6月2日周四 16:05写道:
>
> On Thu, 2 Jun 2022 at 06:43, Sam Li <faithilikerun@gmail.com> wrote:
> >
> > Hi Stefan,
> >
> > Stefan Hajnoczi <stefanha@gmail.com> 于2022年6月1日周三 19:43写道:
> > >
> > > On Wed, 1 Jun 2022 at 06:47, Damien Le Moal
> > > <damien.lemoal@opensource.wdc.com> wrote:
> > > >
> > > > On 6/1/22 11:57, Sam Li wrote:
> > > > > Hi Stefan,
> > > > >
> > > > > Stefan Hajnoczi <stefanha@gmail.com> 于2022年5月30日周一 19:19写道:
> > > > >
> > > > >
> > > > >>
> > > > >> On Mon, 30 May 2022 at 06:09, Sam Li <faithilikerun@gmail.com> wrote:
> > > > >>>
> > > > >>> Hi everyone,
> > > > >>> I'm Sam Li, working on the Outreachy project which is to add zoned
> > > > >>> device support to QEMU's virtio-blk emulation.
> > > > >>>
> > > > >>> For the first goal, adding QEMU block layer APIs resembling Linux 
> > > > >>> ZBD
> > > > >>> ioctls, I think the naive approach would be to introduce a new 
> > > > >>> stable
> > > > >>> struct zbd_zone descriptor for the library function interface. More
> > > > >>> specifically, what I'd like to add to the BlockDriver struct are:
> > > > >>> 1. zbd_info as zone block device information: includes numbers of
> > > > >>> zones, size of logical blocks, and physical blocks.
> > > > >>> 2. zbd_zone_type and zbd_zone_state
> > > > >>> 3. zbd_dev_model: host-managed zbd, host-aware zbd
> > > > >>> With those basic structs, we can start to implement new functions as
> > > > >>> bdrv*() APIs for BLOCK*ZONE ioctls.
> > > > >>>
> > > > >>> I'll start to finish this task based on the above description. If
> > > > >>> there is any problem or something I may miss in the design, please 
> > > > >>> let
> > > > >>> me know.
> > > > >>
> > > > >> Hi Sam,
> > > > >> Can you propose function prototypes for the new BlockDriver callbacks
> > > > >> needed for zoned devices?
> > > > >
> > > > > I have made some modifications based on Damien's device in design part
> > > > > 1 and added the function prototypes in design part 2. If there is any
> > > > > problem or part I missed, please let me know.
> > > > >
> > > > > Design of Block Layer APIs in BlockDriver:
> > > > > 1. introduce a new stable struct zbd_zone descriptor for the library
> > > > > function interface.
> > > > >   a. zbd_info as zone block device information: includes numbers of
> > > > > zones, size of blocks, write granularity in byte(minimal write size
> > > > > and alignment
> > > > >     - write granularity: 512e SMRs: writes in units of physical block
> > > > > size, 4096 bytes; NVMe ZNS write granularity is equal to the block
> > > > > size.
> > > > >     - zone descriptor: start, length, capacity, write pointer, zone 
> > > > > type
> > > > >   b. zbd_zone_type
> > > > >     - zone type: conventional, sequential write required, sequential
> > > > > write preferred
> > > > >   c. zbd_dev_model: host-managed zbd, host-aware zbd
> > > >
> > > > This explanation is a little hard to understand. It seems to be mixing 
> > > > up
> > > > device level information and per-zone information. I think it would be a
> > > > lot simpler to write a struct definition to directly illustrate what you
> > > > are planning.
> > > >
> > > > It is something like this ?
> > > >
> > > > struct zbd_zone {
> > > >         enum zone_type  type;
> > > >         enum zone_cond  cond;
> > > >         uint64_t        start;
> > > >         uint32_t        length;
> > > >         uint32_t        cap;
> > > >         uint64_t        wp;
> > > > };
> > > >
> > > > strcut zbd_dev {
> > > >         enum zone_model model;
> > > >         uint32_t        block_size;
> > > >         uint32_t        write_granularity;
> > > >         uint32_t        nr_zones
> > > >         struct zbd_zone *zones; /* array of zones */
> > > > };
> > > >
> > > > If yes, then my comments are as follows.
> > > >
> > > > For the device struct: It may be good to have also the maximum number of
> > > > open zones and the maximum number of active zones.
> > > >
> > > > For the zone struct: You may need to add a read-write lock per zone to 
> > > > be
> > > > able to write lock zones to ensure a sequential write pattern (virtio
> > > > devices can be multi-queue and so writes may be coming in from different
> > > > contexts) and to correctly emulate zone append operations with an atomic
> > > > update of the wp field.
> > > >
> > > > These need to be integrated into the generic block driver interface in
> > > > include/block/block_int-common.h or include/block/block-common.h.
> > >
> > > QEMU's block layer has a few ways of exposing information about block 
> > > devices:
> > >
> > >     int (*bdrv_get_info)(BlockDriverState *bs, BlockDriverInfo *bdi);
> > >     ImageInfoSpecific *(*bdrv_get_specific_info)(BlockDriverState *bs,
> > > Error **errp);
> > >
> > > These fetch information from the BlockDriver and are good when a small
> > > amount of data is reported occassionally and consumed by the caller.
> > >
> > > For data that is continuously accessed or that could be large, it may
> > > be necessary for the data to reside inside BlockDriverState so that it
> > > can be accessed in place (without copying):
> > >
> > >     void (*bdrv_refresh_limits)(BlockDriverState *bs, Error **errp);
> > >
> > > QEMU uses this for the BlockLimits struct (BlockDriverState::bl) that
> > > is continuously accessed by the block layer while processing I/O
> > > requests. The "refresh" function updates the data in case the
> > > underlying storage device has changed somehow. If no update function
> > > is necessary then data can simply be populated during .bdrv_open() and
> > > no new BlockDriver callback needs to be added.
> > >
> > > So in the simplest case BlockDriverState can be extended with a struct
> > > zbd_dev field that is populated during .bdrv_open(). If the
> > > BlockDriver doesn't support zones then the zbd_dev.nr_zones field is 0
> > > or the model field indicates that this is not a zoned storage device.
> > >
> > > However, a BlockBackend (not BlockDriverState!) API will be needed to
> > > expose this data to users like the hw/block/virtio-blk.c emulation
> > > code or the qemu-io-cmds.c utility that can be used for testing. A
> > > BlockBackend has a root pointer to a BlockDriverState graph (for
> > > example, qcow2 on top of file-posix). It will be necessary to
> > > propagate zoned storage information from the leaf BlockDriverState all
> > > the way up to the BlockBackend. In simple cases the BB root points
> > > directly to the file-posix BDS that has Linux ZBD support but the
> > > design needs to account for additional BDS graph nodes.
> >
> > I think a simple way to think BlockBackend APIs is to use following
> > callbacks: blk_aio_zone_mgmt() -> blk_aio_prwv() +
> > blk_aio_zone_mgmt_entry() -> blk_co_do_zone_mgmt() -> blk_zone_mgmt().
> > The last function call will call bdrv_co_zone_mgmt() in
> > block/file-posix.c. If I understand the additional case correctly, the
> > BlockBackend API can expose the zone information to the virtio-blk
> > emulation now.
>
> Yes!
>
> block/raw-format.c also needs to implement .bdrv_co_zone_mgmt() by
> calling bdrv_co_zone_mgmt(bs->file, ...). This is because the
> raw-format.c driver usually sits on top of file-posix.c and has to
> pass through requests.
>
> There are filter block drivers like block/throttle.c,
> block/blkdebug.c, etc (git grep is_filter block/) that will also need
> to be modified to pass through requests in the same way.
>

Are the filter block drivers also on top of file-posix.c but below
block-backend.c? I read that the filter block drivers, and formats are
designed to be manageable pieces so as to make block device
configuration easier and clearer.

> Based on what I've read in Dmitry's virtio-blk spec proposal, the
> BlockBackend API could look something like:
>
> typedef struct { ...model, zone_sectors, max_open_zones, etc... } 
> BlockZoneInfo;
> void blk_get_zone_info(BlockBackend *blk, BlockZoneInfo *info);
>
> virtio-blk.c calls this to fill out configuration space fields and
> determine whether the BlockBackend is a zoned device.
>
> Then there are 3 commands that happen in the I/O code path:
>
> typedef struct { ... } BlockZoneDescriptor;
> BlockAIOCB *blk_aio_zone_report(BlockBackend *blk, int64_t offset,
> BlockZoneDescriptor *zones, size_t max_zones, BlockCompletionFunc *cb,
> void *opaque);
>
> typedef enum { ... } BlockZoneMgmtCmd;
> BlockAIOCB *blk_aio_zone_mgmt_send(BlockBackend *blk, int64_t offset,
> BlockZoneMgmtCmd cmd, bool all, BlockCompletionFunc *cb, void
> *opaque);
>
> typedef void BlockZoneAppendCompletionFunc(void *opaque, int ret,
> int64_t new_wp);
> BlockAIOCB *blk_aio_zone_append(BlockBackend *blk, int64_t offset,
> QEMUIOVector *qiov, BlockZoneAppendCompletionFunc *cb, void *opaque);
>
> > Besides, comparing blk_aio_flush() with blk_flush() in block-backend.c
> > which lead to include/block/block-io.h, we may need consider the case
> > when calling block layer API from non-coroutine context. Meanwhile,
> > using bdrv_co_writev()/bdrv_co_readv() instead of read-write lock per
> > zone may be a option too.
>
> Yes, device emulation code usually uses the aio versions of the
> BlockBackend I/O functions (read, write, flush). The QEMU block layer
> runs the aio I/O request inside a coroutine and usually also exposes
> coroutine versions of the same functions. For example, block jobs
> (e.g. storage mirroring, backup, and migration background tasks)
> usually call the coroutine versions of the BlockBackend APIs instead
> of the aio ones.
>
> qemu-io-cmds.c will want synchronous versions of the aio commands
> (blk_zone_report(), blk_zone_mgmt_send(), blk_zone_append()) that
> block until the command completes. This is because the qemu-io utility
> typically executes one command at a time and it's written mostly in
> blocking style rather than async callbacks or coroutines.
> docs/devel/block-coroutine-wrapper.rst describes how to generate
> synchronous versions of coroutine functions.
>
> Do you want to start implementing blk_get_zone_info()? This will
> require blk_*(), bdrv_*(), and BlockDriver (block/file-posix.c)
> functions.

I want to implement the smallest part that can be tested first and
then move on to the next part. And I want to test zone report
operation first. Does the qemu io-test require the following part to
work: bdrv_co_zone_report in file-pisix.c, blk_get_zone_info() in
block-backend.c, blk_aio_zone_report() in io code path and modify some
test in test/qemu-iotests? If it does, then yes.

Have a nice day!

Sam
>
> Stefan



reply via email to

[Prev in Thread] Current Thread [Next in Thread]