qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH 09/19] Introduce event-tap.


From: Yoshiaki Tamura
Subject: Re: [Qemu-devel] [PATCH 09/19] Introduce event-tap.
Date: Thu, 20 Jan 2011 22:50:23 +0900

2011/1/20 Kevin Wolf <address@hidden>:
> Am 20.01.2011 11:39, schrieb Yoshiaki Tamura:
>> 2011/1/20 Kevin Wolf <address@hidden>:
>>> Am 20.01.2011 06:19, schrieb Yoshiaki Tamura:
>>>>>>>> +        return;
>>>>>>>> +    }
>>>>>>>> +
>>>>>>>> +    bdrv_aio_writev(bs, blk_req->reqs[0].sector, 
>>>>>>>> blk_req->reqs[0].qiov,
>>>>>>>> +                    blk_req->reqs[0].nb_sectors, blk_req->reqs[0].cb,
>>>>>>>> +                    blk_req->reqs[0].opaque);
>>>>>>>
>>>>>>> Same here.
>>>>>>>
>>>>>>>> +    bdrv_flush(bs);
>>>>>>>
>>>>>>> This looks really strange. What is this supposed to do?
>>>>>>>
>>>>>>> One point is that you write it immediately after bdrv_aio_write, so you
>>>>>>> get an fsync for which you don't know if it includes the current write
>>>>>>> request or if it doesn't. Which data do you want to get flushed to the 
>>>>>>> disk?
>>>>>>
>>>>>> I was expecting to flush the aio request that was just initiated.
>>>>>> Am I misunderstanding the function?
>>>>>
>>>>> Seems so. The function names don't use really clear terminology either,
>>>>> so you're not the first one to fall in this trap. Basically we have:
>>>>>
>>>>> * qemu_aio_flush() waits for all AIO requests to complete. I think you
>>>>> wanted to have exactly this, but only for a single block device. Such a
>>>>> function doesn't exist yet.
>>>>>
>>>>> * bdrv_flush() makes sure that all successfully completed requests are
>>>>> written to disk (by calling fsync)
>>>>>
>>>>> * bdrv_aio_flush() is the asynchronous version of bdrv_flush, i.e. run
>>>>> the fsync in the thread pool
>>>>
>>>> Then what I wanted to do is, call qemu_aio_flush first, then
>>>> bdrv_flush.  It should be like live migration.
>>>
>>> Okay, that makes sense. :-)
>>>
>>>>>>> The other thing is that you introduce a bdrv_flush for each request,
>>>>>>> basically forcing everyone to something very similar to writethrough
>>>>>>> mode. I'm sure this will have a big impact on performance.
>>>>>>
>>>>>> The reason is to avoid inversion of queued requests.  Although
>>>>>> processing one-by-one is heavy, wouldn't having requests flushed
>>>>>> to disk out of order break the disk image?
>>>>>
>>>>> No, that's fine. If a guest issues two requests at the same time, they
>>>>> may complete in any order. You just need to make sure that you don't
>>>>> call the completion callback before the request really has completed.
>>>>
>>>> We need to flush requests, meaning aio and fsync, before sending
>>>> the final state of the guests, to make sure we can switch to the
>>>> secondary safely.
>>>
>>> In theory I think you could just re-submit the requests on the secondary
>>> if they had not completed yet.
>>>
>>> But you're right, let's keep things simple for the start.
>>>
>>>>> I'm just starting to wonder if the guest won't timeout the requests if
>>>>> they are queued for too long. Even more, with IDE, it can only handle
>>>>> one request at a time, so not completing requests doesn't sound like a
>>>>> good idea at all. In what intervals is the event-tap queue flushed?
>>>>
>>>> The requests are flushed once each transaction completes.  So
>>>> it's not with specific intervals.
>>>
>>> Right. So when is a transaction completed? This is the time that a
>>> single request will take.
>>
>> The transaction is completed when the vm state is sent to the
>> secondary, and the primary receives the ack to it.  Please let me
>> know if the answer is too vague.  What I can tell is that it
>> can't be super fast.
>>
>>>>> On the other hand, if you complete before actually writing out, you
>>>>> don't get timeouts, but you signal success to the guest when the request
>>>>> could still fail. What would you do in this case? With a writeback cache
>>>>> mode we're fine, we can just fail the next flush (until then nothing is
>>>>> guaranteed to be on disk and order doesn't matter either), but with
>>>>> cache=writethrough we're in serious trouble.
>>>>>
>>>>> Have you thought about this problem? Maybe we end up having to flush the
>>>>> event-tap queue for each single write in writethrough mode.
>>>>
>>>> Yes, and that's what I'm trying to do at this point.
>>>
>>> Oh, I must have missed that code. Which patch/function should I look at?
>>
>> Maybe I miss-answered to your question.  The device may receive
>> timeouts.
>
> We should pay attention that the guest does not see timeouts. I'm not
> expecting that I/O will be super fast, and as long as it is only a
> performance problem we can live with it.
>
> However, as soon as the guest gets timeouts it reports I/O errors and
> eventually offlines the block device. At this point it's not a
> performance problem any more, but also a correctness problem.
>
> This is why I suggested that we flush the event-tap queue (i.e. complete
> the transaction) immediately after an I/O request has been issued
> instead of waiting for other events that would complete the transaction.

Right.  event-tap doesn't queue at specific interval.  It'll
schedule the transaction as bh once events are tapped .  The
purpose of the queue is store requests initiated while the
transaction.  So I believe current implementation should be doing
what you're expecting.  However, if the guest dirtied huge amount
of ram and initiated block requests, we may get timeouts even we
started transaction right away.

Yoshi

>> If timeouts didn't happen, the requests are flushed
>> one-by-one in writethrough because we're calling qemu_aio_flush
>> and bdrv_flush together.
>
> I think this is what we must do.
>
> Kevin
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to address@hidden
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>



reply via email to

[Prev in Thread] Current Thread [Next in Thread]