qemu-block
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-block] RFC: some problems with bdrv_get_block_status


From: Denis V. Lunev
Subject: Re: [Qemu-block] RFC: some problems with bdrv_get_block_status
Date: Fri, 28 Apr 2017 22:38:06 +0300
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.8.0

On 04/28/2017 09:31 PM, Eric Blake wrote:
> On 04/28/2017 11:17 AM, Denis V. Lunev wrote:
>> Hello, All!
>>
>> Recently we have experienced problems with very slow
>> bdrv_get_block_status call, which is massive called f.e.
>> from mirror_run().
>>
>> The problem was narrowed down to slow lseek64() system
>> call, which can take 1-2-3 seconds.
> I'm guessing you meant one-to-three (the range), not one-two-three
> (three separate digits), and just had an unfortunate abbreviation of
> 'to' turning into the phonetically-similar '2'.
First of all, Eric, thank you very much for the prompt answer.

yes. I have meant 1..3 seconds. Thank you for clarifying that.

>> address@hidden ~]# strace -f -p 77048 -T  -e lseek
>> Process 77048 attached with 14 threads
>> [pid 77053] lseek(23, 6359613440, SEEK_DATA) = 6360276992 <1.250323>
> That sounds like a bug in your choice of filesystem.  It's been
> mentioned before that lseek has some pathetically poor behavior (I think
> tmpfs was one of the culprits), but I maintain that it is better to
> hammer on the kernel folks to fix the poor behavior than it is to have
> to implement user-space workarounds in every single affected program.
>
> That said, a workaround of being able to request the avoidance of lseek
> has been brought up on this list before.
I have found that letter (Feb 2017). The problem there was with ZFS.
I have problems with EXT4. The problem comes not immediately
but after some time under load.

This call will also cost a lot for the case of NFSv4, which supports
SEEK_DATA and SEEK_HOLE, as it will require additional
network exchange. Thus this is not cheap for at least some cases
and thus it is dangerous to follow current way. May be other
filesystems are also affected.

>
>> The problem comes from this branch of the code
>>
>> bdrv_co_get_block_status
>>     .......
>>     if (bs->file &&
>>         (ret & BDRV_BLOCK_DATA) && !(ret & BDRV_BLOCK_ZERO) &&
>>         (ret & BDRV_BLOCK_OFFSET_VALID)) {
>>         int file_pnum;
>>
>>         ret2 = bdrv_co_get_block_status(bs->file, ret >> BDRV_SECTOR_BITS,
>>                                         *pnum, &file_pnum);
>>         if (ret2 >= 0) {
>>             /* Ignore errors.  This is just providing extra information, it
>>              * is useful but not necessary.
>>              */
>>             if (!file_pnum) {
>>                 /* !file_pnum indicates an offset at or beyond the EOF;
>> it is
>>                  * perfectly valid for the format block driver to point
>> to such
>>                  * offsets, so catch it and mark everything as zero */
>>                 ret |= BDRV_BLOCK_ZERO;
>>             } else {
>>                 /* Limit request to the range reported by the protocol
>> driver */
>>                 *pnum = file_pnum;
>>                 ret |= (ret2 & BDRV_BLOCK_ZERO);
>>             }
>>         }
>>     }
> I don't see anything wrong with this code. It's nice to know that we
> have data (qcow2 says we have allocated bytes due to this layer, so
> don't refer to the backing files), but when the underlying file can ALSO
> tell us that the underlying protocol is sparse and we are on a hole,
> then we know that we have BDRV_BLOCK_ZERO condition which can in turn
> enable other optimizations in quite a bit of the stack.  It IS valid to
> return BDRV_BLOCK_DATA | BDRV_BLOCK_ZERO (we don't have to limit
> ourselves to BDRV_BLOCK_DATA), and we should probably do so any time
> that such computation is cheap.
>
> I agree with your analysis that on a poor filesystem where lseek()
> behavior sucks that it is no longer cheap to determine where the holes are.
>
> But the proper workaround for those filesystems should NOT be made by
> undoing this part of bdrv_co_get_block_status(), but should rather be
> fixed in file-posix.c by the addition of a user-controllable knob on
> whether to skip lseek().  In other words, if we're going to work around
> the poor filesystem performance, the workaround should live in
> file-posix.c, not in the generic block/io.c.  Once the workaround is
> enabled, the 'ret2 = bdrv_co_get_block_status()' nested call should be
> lightning fast (because file-posix.c would no longer be using lseek when
> you've requested the workaround).
I understand that it is very good to know, but what benefit we will have
really:
- backup will skip cluster only if the cluster is all zeroes
- mirror skips the cluster only if it will be full zeroes
- the same applies for 'qemu-img convert'

I think we should somehow measure the benefit against the loss.
We have at least 2 (or probably 3) different filesystems which do
not work well with that.

I have seen the question to try disable 'detect-zeroes' and
this could look like a worthy option to put original behavior
under, i.e. we could perform lseek(SEEK_DATA) only if
'detect-zeroes' option is on, which well match the meaning
of that option and will save the flag namespace a bit.

Other option would be to split this API to one rough call which
works with whole blocks and one providing full details like
now.

>> Frankly speaking, this optimization should not give much.
>> If upper layer format (QCOW2) says that we have data
>> here, then nowadays in 99.9% we do have the data.
> You are correct that we have BDRV_BLOCK_DATA; but it IS useful to ALSO
> know if that data reads back as all zeros (so we can skip reading it).
> Just because qcow2 reports something as data does NOT rule out whether
> the data still reads as zeros.
>
>> Meanwhile this branch can cause problems. We would
>> need block cleared entirely to get the benefit for most
>> cases in mirror and backup operations.
>>
>> At my opinion it worth to drop this at all.
>>
>> Guys, do you have any opinion?
> Again, my opinion is to not change this part of block/io.c.  Rather,
> work should be expended on individual protocol drivers to quit wasting
> unnecessary time on reporting BDRV_BLOCK_ZERO if the time spent to
> determine that is not worth the effort (as it is when using lseek() on
> inefficient filesystems).  It is always safe for a protocol driver to
> report BDRV_BLOCK_DATA instead of BDRV_BLOCK_ZERO, if asking the
> question is too expensive (treating holes as data may pessimize other
> operations, but that's okay if the penalty for asking is worse than what
> optimizations are able to regain).
>
>> Den
>>
>> P.S. The kernel is one based on RedHat 3.10.0-514. The same
>>       problem was observed in 3.10.0-327 too.
> Red Hat is always two words (I'm allowed to point that out, as they
> employ me ;).  But if it really is a Red Hat kernel problem, be sure to
> use your support contract to point out the kernel developers that they
> really need to fix lseek() - and you'll need to give details on which
> filesystem you're using (as not all filesystems have that abysmal
> performance).
>
ok, I will file the Bugzilla for the case with the possible workaround
we are
working on right now. No problem. I just writing this as for now this
knowledge
(about zeroes/not zeroes) is not really used much.

Den



reply via email to

[Prev in Thread] Current Thread [Next in Thread]