[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [PATCH v7 3/4] qcow2: add zoned emulation capability
From: |
Sam Li |
Subject: |
Re: [PATCH v7 3/4] qcow2: add zoned emulation capability |
Date: |
Mon, 23 Sep 2024 15:40:54 +0200 |
Hi Damien,
Damien Le Moal <dlemoal@kernel.org> 于2024年9月23日周一 15:22写道:
>
> On 2024/09/23 13:06, Sam Li wrote:
>
> [...]
>
> >>> @@ -2837,6 +3180,19 @@ qcow2_co_pwritev_part(BlockDriverState *bs,
> >>> int64_t offset, int64_t bytes,
> >>> qiov_offset += cur_bytes;
> >>> trace_qcow2_writev_done_part(qemu_coroutine_self(), cur_bytes);
> >>> }
> >>> +
> >>> + if (bs->bl.zoned == BLK_Z_HM) {
> >>> + index = start_offset / zone_size;
> >>> + wp = &bs->wps->wp[index];
> >>> + if (!QCOW2_ZT_IS_CONV(*wp)) {
> >>> + /* Advance the write pointer when the write completes */
> >>
> >> Updating the write pointer after I/O does not prevent other write
> >> requests from beginning at the same offset as this request. Multiple
> >> write request coroutines can run concurrently and only the first one
> >> should succeed. The others should fail if they are using the same
> >> offset.
> >>
> >> The comment above says "Real drives change states before it can write to
> >> the zone" and I think it's appropriate to update the write pointer
> >> before performing the write too. The qcow2 zone emulation code is
> >> different from the file-posix.c passthrough code. We are responsible for
> >> maintaining zoned metadata state and cannot wait for the result of the
> >> I/O to tell us what happened.
>
> Yes, correct. The wp MUST be updated when issuing the IO, with the assumption
> that the write IO will succeed (errors are rare !).
>
> > The problem of updating the write pointer before IO completion is the
> > failure case. It can't be predicted in advance if an IO fails or not.
> > When write I/O fails, the wp should not be updated.
>
> Correct, if an IO fails, the wp should not be updated. However, that is not
> difficult to deal with:
> 1) under the zone lock, advance the wp position when issuing the write IO
> 2) When the write IO completes with success, nothing else needs to be done.
> 3) When *any* write IO completes with error you need to:
> - Lock the zone
> - Do a report zone for the target zone of the failed write to get the
> current
> wp location
> - Update bs->wps->wp[index] using that current wp location
> - Unlock the zone
>
> With that, one may get a few errors if multiple async writes are being issued,
> but that behavior is consistent with the same happening with a real drive. So
> no
> issue. And since the report zones gets you the current wp location, the user
> can
> restart writing from that location once it has dealt with all the previous
> write
> failures.
I see. To allow the concurrent writes, the lock will only be used on
the failure path while processing append writes.
>
> > The alternative way is to hold the wps lock as is also required for wp
> > accessing. Therefore only one of multiple concurrent write requests
> > will succeed.
>
> That is a very simple solution that avoids the above error recovery, but that
> would be very bad for performance (especially for a pure sequential write
> workload as we would limit IOs to quue depth 1). So if we can avoid this
> simple
> approach, that would be a lot better.
Yeah, I'll drop this approach. Although, it reminds me of how
file-posix driver emulates zone_append. It holds the lock whenever
accessing wps. Does that limit IOs to QD 1 too? If so, it can be
improved.
-- one zone_append starts
>> wp_lock()
>>> IO processing
>>>> wp_update
>>>>> wp_unlock()
-- ends
https://github.com/qemu/qemu/blob/master/block/file-posix.c#L2492
Sam
>
>
> --
> Damien Le Moal
> Western Digital Research