[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [PATCH v2 1/2] block: zero data data corruption using prealloc-filte
From: |
Andrey Drobyshev |
Subject: |
Re: [PATCH v2 1/2] block: zero data data corruption using prealloc-filter |
Date: |
Tue, 16 Jul 2024 17:27:48 +0300 |
User-agent: |
Mozilla Thunderbird |
On 7/16/24 4:32 PM, Denis V. Lunev wrote:
> On 7/12/24 13:55, Vladimir Sementsov-Ogievskiy wrote:
>> On 12.07.24 12:46, Andrey Drobyshev wrote:
>>> From: "Denis V. Lunev" <den@openvz.org>
>>>
>>> We have observed that some clusters in the QCOW2 files are zeroed
>>> while preallocation filter is used.
>>>
>>> We are able to trace down the following sequence when prealloc-filter
>>> is used:
>>> co=0x55e7cbed7680 qcow2_co_pwritev_task()
>>> co=0x55e7cbed7680 preallocate_co_pwritev_part()
>>> co=0x55e7cbed7680 handle_write()
>>> co=0x55e7cbed7680 bdrv_co_do_pwrite_zeroes()
>>> co=0x55e7cbed7680 raw_do_pwrite_zeroes()
>>> co=0x7f9edb7fe500 do_fallocate()
>>>
>>> Here coroutine 0x55e7cbed7680 is being blocked waiting while coroutine
>>> 0x7f9edb7fe500 will finish with fallocate of the file area. OK. It is
>>> time to handle next coroutine, which
>>> co=0x55e7cbee91b0 qcow2_co_pwritev_task()
>>> co=0x55e7cbee91b0 preallocate_co_pwritev_part()
>>> co=0x55e7cbee91b0 handle_write()
>>> co=0x55e7cbee91b0 bdrv_co_do_pwrite_zeroes()
>>> co=0x55e7cbee91b0 raw_do_pwrite_zeroes()
>>> co=0x7f9edb7deb00 do_fallocate()
>>>
>>> The trouble comes here. Coroutine 0x55e7cbed7680 has not advanced
>>> file_end yet and coroutine 0x55e7cbee91b0 will start fallocate() for
>>> the same area. This means that if (once fallocate is started inside
>>> 0x7f9edb7deb00) original fallocate could end and the real write will
>>> be executed. In that case write() request is handled at the same time
>>> as fallocate().
>>>
>>> The patch moves s->file_lock assignment before fallocate and that is
>>
>> text need to be updated
>>
>>> crucial. The idea is that all subsequent requests into the area
>>> being preallocation will be issued as just writes without fallocate
>>> to this area and they will not proceed thanks to overlapping
>>> requests mechanics. If preallocation will fail, we will just switch
>>> to the normal expand-by-write behavior and that is not a problem
>>> except performance.
>>>
>>> Signed-off-by: Denis V. Lunev <den@openvz.org>
>>> Tested-by: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com>
>>> ---
>>> block/preallocate.c | 8 +++++++-
>>> 1 file changed, 7 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/block/preallocate.c b/block/preallocate.c
>>> index d215bc5d6d..ecf0aa4baa 100644
>>> --- a/block/preallocate.c
>>> +++ b/block/preallocate.c
>>> @@ -383,6 +383,13 @@ handle_write(BlockDriverState *bs, int64_t
>>> offset, int64_t bytes,
>>> want_merge_zero = want_merge_zero && (prealloc_start <= offset);
>>> + /*
>>> + * Assign file_end before making actual preallocation. This will
>>> ensure
>>> + * that next request performed while preallocation is in
>>> progress will
>>> + * be passed without preallocation.
>>> + */
>>> + s->file_end = prealloc_end;
>>> +
>>> ret = bdrv_co_pwrite_zeroes(
>>> bs->file, prealloc_start, prealloc_end - prealloc_start,
>>> BDRV_REQ_NO_FALLBACK | BDRV_REQ_SERIALISING |
>>> BDRV_REQ_NO_WAIT);
>>> @@ -391,7 +398,6 @@ handle_write(BlockDriverState *bs, int64_t
>>> offset, int64_t bytes,
>>> return false;
>>> }
>>> - s->file_end = prealloc_end;
>>> return want_merge_zero;
>>> }
>>
>>
>> Hmm. But this way we set both s->file_end and s->zero_start prior to
>> actual write_zero operation. This means that next write-zero operation
>> may go fast-path (see preallocate_co_pwrite_zeroes()) and return
>> success, even before actual finish of preallocation write_zeroes
>> operation (which may also fail). Seems we need to update logic around
>> s->zero_start too.
>>
> Yes. This is not a problem at all. We go fast path and this new
> fast-pathed write request will stuck on overlapped request check.
> This if fine on success path.
>
> But error path is a trickier question.
>
> iris ~/src/qemu $ cat 1.c
> #include <stdio.h>
> #include <unistd.h>
> #include <string.h>
> #include <fcntl.h>
>
> int main()
> {
> int fd = open("file", O_RDWR | O_CREAT);
> char buf[4096];
>
> memset(buf, 'a', sizeof(buf));
> pwrite(fd, buf, sizeof(buf), 4096);
>
> return 0;
> }
> iris ~/src/qemu $
>
> This works just fine, thus error path would be also fine.
>
> Den
This would also be relatively easy to check by modifying our original
test case in 298 so that half of aio_write requests write actual data,
and other half cause write_zeroes operations. It doesn't seem to fail
with this v2 patch.
I'll modify the testcase accordingly in v3.