qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH v6 6/7] block/nbd-client: nbd reconnect


From: Kevin Wolf
Subject: Re: [Qemu-devel] [PATCH v6 6/7] block/nbd-client: nbd reconnect
Date: Fri, 7 Jun 2019 15:26:52 +0200
User-agent: Mutt/1.11.3 (2019-02-01)

Am 07.06.2019 um 14:02 hat Vladimir Sementsov-Ogievskiy geschrieben:
> 07.06.2019 11:06, Kevin Wolf wrote:
> > Am 07.06.2019 um 05:17 hat Eric Blake geschrieben:
> >> On 4/11/19 12:27 PM, Vladimir Sementsov-Ogievskiy wrote:
> >>> +static coroutine_fn void nbd_reconnect_loop(NBDConnection *con)
> >>> +{
> >>> +    NBDClientSession *s = nbd_get_client_session(con->bs);
> >>> +    uint64_t start_time_ns = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
> >>> +    uint64_t delay_ns = s->reconnect_delay * 1000000000UL;
> >>
> >> Do we have a #define constant for nanoseconds in a second to make this
> >> more legible than counting '0's?
> >>
> >>> +    uint64_t timeout = 1000000000UL; /* 1 sec */
> >>> +    uint64_t max_timeout = 16000000000UL; /* 16 sec */
> >>
> >> 1 * constant, 16 * constant
> >>
> >>> +
> >>> +    nbd_reconnect_attempt(con);
> >>> +
> >>> +    while (nbd_client_connecting(s)) {
> >>> +        if (s->state == NBD_CLIENT_CONNECTING_WAIT &&
> >>> +            qemu_clock_get_ns(QEMU_CLOCK_REALTIME) - start_time_ns > 
> >>> delay_ns)
> >>> +        {
> >>> +            s->state = NBD_CLIENT_CONNECTING_NOWAIT;
> >>> +            qemu_co_queue_restart_all(&s->free_sema);
> >>> +        }
> >>> +
> >>> +        bdrv_dec_in_flight(con->bs);
> >>> +        qemu_co_sleep_ns(QEMU_CLOCK_REALTIME, timeout);
> >>
> >> Another place where I'd like someone more familiar with coroutines to
> >> also have a look.
> > 
> > What's the exact question you'd like me to answer?
> > 
> > But anyway, bdrv_dec/inc_in_flight() around the sleep looks wrong to me.
> > Either the operation must be waited for in drain, then you can't
> > decrease the counter even during the sleep. Or drain doesn't have to
> > consider it, then why is the counter even increased in the first place?
> > 
> > The way it currently is, drain can return assuming that there are no
> > requests, but after the timeout the request automatically comes back
> > while the drain caller assumes that there is no more activity. This
> > doesn't look right.
> 
> Hmm.
> 
> This ind/dec around all lifetime of connection coroutine you added in
> 
> commit 5ad81b4946baf948c65cf7e1436d9b74803c1280
> Author: Kevin Wolf <address@hidden>
> Date:   Fri Feb 15 16:53:51 2019 +0100
> 
>      nbd: Restrict connection_co reentrance
> 
> 
> And now I try to connect in endless loop, with qemu_co_sleep_ns() inside.
> I need to get a change to bdrv_drain to complete, so I have to sometimes
> drop this in_flight request. But I understand your point.

Ah, right, I see. I think it's fine to add a second point where we
decrease the counter (though I would add a comment telling why we do
this) as long as the right conditions are met.

The right conditions are probably something like: Once drained, we
guarantee that we don't call any callbacks until the drained section
ends. In nbd_read_eof() this is true because we can't get an answer if
we had no request running.

Or actually... This assumes a nice compliant server that doesn't just
arbitrarily send unexpected messages. Is the existing code broken if we
connect to a malicious server?

> Will the following work better?
> 
> bdrv_dec_in_flight(con->bs);
> qemu_co_sleep_ns(...);
> while (s->drained) {
>    s->wait_drained_end = true;
>    qemu_coroutine_yield();
> }
> bdrv_inc_in_flight(con->bs);
> 
> ...
> .drained_begin() {
>     s->drained = true;
> }
> 
> .drained_end() {
>     if (s->wait_drained_end) {
>        s->wait_drained_end = false;
>        aio_co_wake(s->connection_co);
>     }
> }

This should make sure that we don't call any callbacks before the drain
section has ended.

We probably still have a problem because nbd_client_attach_aio_context()
reenters the coroutine with qemu_aio_coroutine_enter(), which will cause
an assertion failure if co->scheduled is set. So this needs to use a
version that can cancel a sleep.

I see that your patch currently simply ignores attaching a new context,
but then the coroutine stays in the old AioContext. Did I miss some
additional code that moves it to the new context somehow or will it just
stay where it was if the coroutine happens to be reconnecting when the
AioContext was supposed to change?

Kevin



reply via email to

[Prev in Thread] Current Thread [Next in Thread]