qemu-block
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-block] [PATCH 07/18] nbd: Minimal structured read for client


From: Eric Blake
Subject: Re: [Qemu-block] [PATCH 07/18] nbd: Minimal structured read for client
Date: Tue, 7 Feb 2017 14:14:08 -0600
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.7.0

On 02/03/2017 09:47 AM, Vladimir Sementsov-Ogievskiy wrote:
> Minimal implementation: always send DF flag, to not deal with fragmented
> replies.

This works well with your minimal server implementation, but I worry
that it will cause us to fall over when talking to a fully-compliant
server that chooses to send EOVERFLOW errors for any request larger than
64k when DF is set; it also makes it impossible to benefit from sparse
reads.  I guess that means we need to start thinking about followup
patches to flush out our implementation.  But maybe I can live with this
patch as is, since the goal of your series was not so much the full
power of structured reads, but getting to a point where we could use
structured reply for block status, even if it means your client can only
communicate with qemu-nbd as server for now, as long as we do get to the
rest of the patches for a full-blown structured read.

> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy <address@hidden>
> ---
>  block/nbd-client.c  |  47 +++++++++++----
>  block/nbd-client.h  |   2 +
>  include/block/nbd.h |  15 +++--
>  nbd/client.c        | 170 
> ++++++++++++++++++++++++++++++++++++++++++++++------
>  qemu-nbd.c          |   2 +-
>  5 files changed, 203 insertions(+), 33 deletions(-)

Hmm - no change to the testsuite. Structured reads seems like the sort
of thing that it would be nice to test with some canned server replies,
particularly with server behavior that is permitted by the NBD protocol
but does not happen by default in qemu-nbd.

> 
> diff --git a/block/nbd-client.c b/block/nbd-client.c
> index 3779c6c999..ff96bd1635 100644
> --- a/block/nbd-client.c
> +++ b/block/nbd-client.c
> @@ -180,13 +180,20 @@ static void nbd_co_receive_reply(NBDClientSession *s,
>      *reply = s->reply;
>      if (reply->handle != request->handle ||
>          !s->ioc) {
> +        reply->simple = true;
>          reply->error = EIO;

I don't think this is quite right - by setting reply->simple to true,
you are forcing the caller to treat this as the final packet related to
this request->handle, even though that might not be the case.

As it is, I wonder if this code is correct, even before your patch - the
server is allowed to give responses out-of-order (if we request multiple
reads without waiting for the first response) - I don't see how setting
reply->error to EIO if the request->handle indicates that we are
receiving an out-of-order response to some other packet, but that our
request is still awaiting traffic.

>      } else {
> -        if (qiov && reply->error == 0) {
> -            ret = nbd_wr_syncv(s->ioc, qiov->iov, qiov->niov, request->len,
> -                               true);
> -            if (ret != request->len) {
> -                reply->error = EIO;
> +        if (qiov) {
> +            if ((reply->simple ? reply->error == 0 :
> +                         reply->type == NBD_REPLY_TYPE_OFFSET_DATA)) {
> +                ret = nbd_wr_syncv(s->ioc, qiov->iov, qiov->niov, 
> request->len,
> +                                   true);

This works only because you used the DF flag.  If we allow fragmenting,
then you have to be careful to write the reply into the correct offset
of the iov.

> +                if (ret != request->len) {
> +                    reply->error = EIO;
> +                }
> +            } else if (!reply->simple &&
> +                       reply->type == NBD_REPLY_TYPE_OFFSET_HOLE) {
> +                qemu_iovec_memset(qiov, 0, 0, request->len);
>              }

Up to here, you didn't do any inspection for NBD_REPLY_FLAG_DONE (so you
don't know if this is the last packet the server is sending for this
reqeust->handle), and didn't do any special casing for
NBD_REPLY_TYPE_NONE or for the various error replies.  I'm not sure if
this will always do what you want.  In fact, I'm not even sure if
reply->error is set correctly for all structured packets.

>          }
>  
> @@ -227,6 +234,7 @@ int nbd_client_co_preadv(BlockDriverState *bs, uint64_t 
> offset,
>          .type = NBD_CMD_READ,
>          .from = offset,
>          .len = bytes,
> +        .flags = client->structured_reply ? NBD_CMD_FLAG_DF : 0,
>      };
>      NBDReply reply;
>      ssize_t ret;
> @@ -237,12 +245,30 @@ int nbd_client_co_preadv(BlockDriverState *bs, uint64_t 
> offset,
>      nbd_coroutine_start(client, &request);
>      ret = nbd_co_send_request(bs, &request, NULL);
>      if (ret < 0) {
> -        reply.error = -ret;
> -    } else {
> -        nbd_co_receive_reply(client, &request, &reply, qiov);
> +        goto out;
>      }
> +
> +    nbd_co_receive_reply(client, &request, &reply, qiov);
> +    if (reply.error != 0) {
> +        ret = -reply.error;
> +    }
> +
> +    if (!reply.simple) {
> +        while (!(reply.flags & NBD_REPLY_FLAG_DONE)) {
> +            nbd_co_receive_reply(client, &request, &reply, qiov);
> +            if (reply.error != 0) {
> +                ret = -reply.error;
> +            }
> +            if (reply.simple) {

Hmm. It looks like this part of the loop is only triggered if
nbd_co_receive_reply() detects a handle mismatch and slams reply.simple
to true.  As long as we use the DF flag, it looks like the server should
never send an error packet followed by a data packet, and your
particular server implementation always set the DONE flag on the error
packet, so it got past your testing.  But if we don't rely on the DF
flag, a server could reasonable send an ERROR_OFFSET packet for half the
buffer, followed by a data packet for the other half of the buffer,
which may wipe out reply.error from the error packet.

> +                ret = -EIO;
> +                goto out;
> +            }
> +        }
> +    }
> +
> +out:
>      nbd_coroutine_end(client, &request);
> -    return -reply.error;
> +    return ret;
>  }
>  
>  int nbd_client_co_pwritev(BlockDriverState *bs, uint64_t offset,
> @@ -408,7 +434,8 @@ int nbd_client_init(BlockDriverState *bs,
>                                  &client->nbdflags,
>                                  tlscreds, hostname,
>                                  &client->ioc,
> -                                &client->size, errp);
> +                                &client->size,
> +                                &client->structured_reply, errp);
>      if (ret < 0) {
>          logout("Failed to negotiate with the NBD server\n");
>          return ret;
> diff --git a/block/nbd-client.h b/block/nbd-client.h
> index f8d6006849..cba1f965bf 100644
> --- a/block/nbd-client.h
> +++ b/block/nbd-client.h
> @@ -32,6 +32,8 @@ typedef struct NBDClientSession {
>      NBDReply reply;
>  
>      bool is_unix;
> +
> +    bool structured_reply;
>  } NBDClientSession;
>  
>  NBDClientSession *nbd_get_client_session(BlockDriverState *bs);
> diff --git a/include/block/nbd.h b/include/block/nbd.h
> index 58b864f145..dae2e4bd03 100644
> --- a/include/block/nbd.h
> +++ b/include/block/nbd.h

Can you add the use of an order file to list .h files first in your
diffs?  See
https://lists.gnu.org/archive/html/qemu-devel/2016-12/msg00288.html for
tips.

> @@ -57,11 +57,16 @@ struct NBDRequest {
>  };
>  typedef struct NBDRequest NBDRequest;
>  
> -struct NBDReply {
> +typedef struct NBDReply {
> +    bool simple;
>      uint64_t handle;
>      uint32_t error;
> -};
> -typedef struct NBDReply NBDReply;
> +
> +    uint16_t flags;
> +    uint16_t type;
> +    uint32_t length;
> +    uint64_t offset;
> +} NBDReply;

I don't know if this is the best way to represent things; I might have
used a union type, since not all fields are valid in all reply packets.

>  
>  struct NBDSimpleReply {
>      /* uint32_t NBD_SIMPLE_REPLY_MAGIC */
> @@ -169,10 +174,10 @@ ssize_t nbd_wr_syncv(QIOChannel *ioc,
>  int nbd_receive_negotiate(QIOChannel *ioc, const char *name, uint16_t *flags,
>                            QCryptoTLSCreds *tlscreds, const char *hostname,
>                            QIOChannel **outioc,
> -                          off_t *size, Error **errp);
> +                          off_t *size, bool *structured_reply, Error **errp);
>  int nbd_init(int fd, QIOChannelSocket *sioc, uint16_t flags, off_t size);
>  ssize_t nbd_send_request(QIOChannel *ioc, NBDRequest *request);
> -ssize_t nbd_receive_reply(QIOChannel *ioc, NBDReply *reply);
> +int nbd_receive_reply(QIOChannel *ioc, NBDReply *reply);
>  int nbd_client(int fd);
>  int nbd_disconnect(int fd);
>  
> diff --git a/nbd/client.c b/nbd/client.c
> index 1c274f3012..9225f7e30d 100644
> --- a/nbd/client.c
> +++ b/nbd/client.c
> @@ -472,11 +472,10 @@ static QIOChannel *nbd_receive_starttls(QIOChannel *ioc,
>      return QIO_CHANNEL(tioc);
>  }
>  
> -
>  int nbd_receive_negotiate(QIOChannel *ioc, const char *name, uint16_t *flags,
>                            QCryptoTLSCreds *tlscreds, const char *hostname,
>                            QIOChannel **outioc,
> -                          off_t *size, Error **errp)
> +                          off_t *size, bool *structured_reply, Error **errp)
>  {
>      char buf[256];
>      uint64_t magic, s;
> @@ -584,6 +583,12 @@ int nbd_receive_negotiate(QIOChannel *ioc, const char 
> *name, uint16_t *flags,
>              if (nbd_receive_query_exports(ioc, name, errp) < 0) {
>                  goto fail;
>              }
> +
> +            if (structured_reply != NULL) {
> +                *structured_reply =
> +                    nbd_receive_simple_option(ioc, NBD_OPT_STRUCTURED_REPLY,
> +                                              false, NULL) == 0;

Okay, you're allowing the server to reject the option, in which case we
set structured_reply to false.  But re-reading patch 4/18,
nbd_receive_simple_option() can return -1 for multiple reasons, some of
them where it is still in sync (for sending more options), but others
where it is out of sync (such as failure to write, in which case the
connection MUST be dropped rather than trying to carry on).  I don't
think this handles errors correctly, and therefore I'm not even sure
that the refactoring in 4/18 is correct.

I think you may be better off with nbd_receive_simple_option() in 4/18
being tri-state: return -1 if the connection is unrecoverable (such as
after a write or read error, where we must not send or receive any more
data), 0 if the server replied with an error but the connection is still
in sync for trying something else, and 1 if the server replies with
success.  Then this code should check if the return is < 0 (kill
negotiation), == 0 (*structured_reply = false), or == 1
(*structured_reply = true).

> +            }
>          }
>          /* write the export name request */
>          if (nbd_send_option_request(ioc, NBD_OPT_EXPORT_NAME, -1, name,
> @@ -603,6 +608,14 @@ int nbd_receive_negotiate(QIOChannel *ioc, const char 
> *name, uint16_t *flags,
>              goto fail;
>          }
>          be16_to_cpus(flags);
> +
> +        if (!!structured_reply && *structured_reply &&

Why do you need !! to coerce to bool, when && also coerces to bool?

> +            !(*flags & NBD_CMD_FLAG_DF))
> +        {
> +            error_setg(errp, "Structured reply is negotiated, "
> +                             "but DF flag is not.");

No trailing '.' for error_setg() messages.

Also, I'm not quite sure the NBD protocol allows an implementation that
supports structured read but does not support the DF flag.  Maybe that's
an NBD spec bug that we should get clarified.  (Ideally, the server
always advertises the DF flag if structured replies are negotiated,
because the common implementation of user-space handshake followed by
kernel transmission phase works best when the already-existing
ioctl(NBD_SET_FLAGS) can then be used to tell the kernel to use/expect
structure replies.)

> +            goto fail;
> +        }
>      } else if (magic == NBD_CLIENT_MAGIC) {
>          uint32_t oldflags;
>  
> @@ -790,20 +803,33 @@ ssize_t nbd_send_request(QIOChannel *ioc, NBDRequest 
> *request)
>      return 0;
>  }
>  
> -ssize_t nbd_receive_reply(QIOChannel *ioc, NBDReply *reply)
> +static inline int read_sync_check(QIOChannel *ioc, void *buffer, size_t size)
>  {
> -    uint8_t buf[NBD_REPLY_SIZE];
> -    uint32_t magic;
>      ssize_t ret;
>  
> -    ret = read_sync(ioc, buf, sizeof(buf));
> +    ret = read_sync(ioc, buffer, size);
>      if (ret < 0) {
>          return ret;
>      }
> -
> -    if (ret != sizeof(buf)) {
> +    if (ret != size) {
>          LOG("read failed");
> -        return -EINVAL;
> +        return -EIO;
> +    }
> +
> +    return 0;
> +}
> +
> +/* nbd_receive_simple_reply
> + * Read simple reply except magic field (which should be already read)
> + */
> +static int nbd_receive_simple_reply(QIOChannel *ioc, NBDReply *reply)
> +{
> +    uint8_t buf[NBD_REPLY_SIZE - 4];
> +    ssize_t ret;
> +
> +    ret = read_sync_check(ioc, buf, sizeof(buf));

This patch is fairly long; I might have split the creation and initial
use of the read_sync_check() function into its own patch.

> +    if (ret < 0) {
> +        return ret;
>      }
>  
>      /* Reply
> @@ -812,9 +838,124 @@ ssize_t nbd_receive_reply(QIOChannel *ioc, NBDReply 
> *reply)
>         [ 7 .. 15]    handle
>       */
>  
> -    magic = ldl_be_p(buf);
> -    reply->error  = ldl_be_p(buf + 4);
> -    reply->handle = ldq_be_p(buf + 8);
> +    reply->error  = ldl_be_p(buf);
> +    reply->handle = ldq_be_p(buf + 4);
> +
> +    return 0;
> +}
> +
> +/* nbd_receive_structured_reply_chunk
> + * Read structured reply chunk except magic field (which should be already 
> read)
> + * Data for NBD_REPLY_TYPE_OFFSET_DATA is not read too.
> + * Length field of reply out parameter corresponds to unread part of reply.

Awkward wording; maybe:

Read the header portion of a structured reply chunk (except for magic,
which should already be read).  On success, the length field of reply
corresponds to number of bytes in the reply that still need to be read.

> + */
> +static int nbd_receive_structured_reply_chunk(QIOChannel *ioc, NBDReply 
> *reply)
> +{
> +    NBDStructuredReplyChunk chunk;
> +    ssize_t ret;
> +    uint16_t message_size;
> +
> +    ret = read_sync_check(ioc, (uint8_t *)&chunk + sizeof(chunk.magic),

This looks like some ugly pointer math.  I wonder if you'd be better
served by having a separate packed struct without the magic field,
particularly since you have to read the magic in first to decide whether
to dispatch to simple/structured for the rest of the structure.  In
other words, note how patch 2/18 omits a uint32_t magic in
NBDSimpleReply; maybe patch 3/18 could do the same with
NBDStructuredReplyChunk.  Then again, that would impact how you code
things for subcasses like NBDStructuredRead, so I don't know how much
churn to request here.

> +                          sizeof(chunk) - sizeof(chunk.magic));
> +    if (ret < 0) {
> +        return ret;
> +    }
> +
> +    reply->flags = be16_to_cpu(chunk.flags);
> +    reply->type = be16_to_cpu(chunk.type);
> +    reply->handle = be64_to_cpu(chunk.handle);
> +    reply->length = be32_to_cpu(chunk.length);
> +
> +    switch (reply->type) {
> +    case NBD_REPLY_TYPE_NONE:
> +        break;
> +    case NBD_REPLY_TYPE_OFFSET_DATA:
> +    case NBD_REPLY_TYPE_OFFSET_HOLE:
> +        ret = read_sync_check(ioc, &reply->offset, sizeof(reply->offset));
> +        if (ret < 0) {
> +            return ret;
> +        }
> +        be64_to_cpus(&reply->offset);
> +        reply->length -= sizeof(reply->offset);
> +        break;
> +    case NBD_REPLY_TYPE_ERROR:
> +    case NBD_REPLY_TYPE_ERROR_OFFSET:

Here, I think that you want:

default:
if (reply->type & 0x8000) {

rather than specific NBD_REPLY_TYPE_ERROR labels, so that you can
gracefully handle ALL error packets sent by a server even if you haven't
explicitly coded for them (a good server should probably not be sending
an error packet you don't recognize, but the protocol also went to good
efforts to describe a common behavior of all error packets).

/me reads on...

> +        ret = read_sync_check(ioc, &reply->error, sizeof(reply->error));
> +        if (ret < 0) {
> +            return ret;
> +        }
> +        be32_to_cpus(&reply->error);
> +
> +        ret = read_sync_check(ioc, &message_size, sizeof(message_size));
> +        if (ret < 0) {
> +            return ret;
> +        }
> +        be16_to_cpus(&message_size);
> +
> +        if (message_size > 0) {
> +            /* TODO: provide error message to user */
> +            ret = drop_sync(ioc, message_size);
> +            if (ret < 0) {
> +                return ret;
> +            }
> +        }
> +
> +        if (reply->type == NBD_REPLY_TYPE_ERROR_OFFSET) {
> +            /* drop 64bit offset */
> +            ret = drop_sync(ioc, 8);
> +            if (ret < 0) {
> +                return ret;
> +            }
> +        }
> +        break;
> +    default:
> +        if (reply->type & (1 << 15)) {

...oh, you did, mostly.

> +            /* unknown error */
> +            ret = drop_sync(ioc, reply->length);

Still, even if it is an unknown error, you should still be able to
populate reply->error, rather than ignoring it.

> +            if (ret < 0) {
> +                return ret;
> +            }
> +
> +            reply->error = NBD_EINVAL;

Meanwhile, you should also probably ensure that reply->error is non-zero
even if the server was buggy and sent an error flag without telling you
an error value (your choice of EINVAL seems reasonable).

> +            reply->length = 0;
> +        } else {
> +            /* unknown non-error reply type */
> +            return -EINVAL;
> +        }
> +    }
> +
> +    return 0;
> +}
> +
> +int nbd_receive_reply(QIOChannel *ioc, NBDReply *reply)
> +{
> +    uint32_t magic;
> +    int ret;
> +
> +    ret = read_sync_check(ioc, &magic, sizeof(magic));
> +    if (ret < 0) {
> +        return ret;
> +    }
> +
> +    be32_to_cpus(&magic);
> +
> +    switch (magic) {
> +    case NBD_SIMPLE_REPLY_MAGIC:
> +        reply->simple = true;
> +        ret = nbd_receive_simple_reply(ioc, reply);
> +        break;
> +    case NBD_STRUCTURED_REPLY_MAGIC:
> +        reply->simple = false;
> +        ret = nbd_receive_structured_reply_chunk(ioc, reply);
> +        break;
> +    default:
> +        LOG("invalid magic (got 0x%" PRIx32 ")", magic);
> +        return -EINVAL;

This part looks good.

> +    }
> +
> +    if (ret < 0) {
> +        return ret;
> +    }
>  
>      reply->error = nbd_errno_to_system_errno(reply->error);
>  
> @@ -827,10 +968,5 @@ ssize_t nbd_receive_reply(QIOChannel *ioc, NBDReply 
> *reply)
>            ", handle = %" PRIu64" }",
>            magic, reply->error, reply->handle);
>  
> -    if (magic != NBD_SIMPLE_REPLY_MAGIC) {
> -        LOG("invalid magic (got 0x%" PRIx32 ")", magic);
> -        return -EINVAL;
> -    }
>      return 0;
>  }
> -
> diff --git a/qemu-nbd.c b/qemu-nbd.c

Why are you deleting a blank line from the end of the file? Is that
rebase churn that should have been avoided in an earlier patch?  </me
goes and reads> - oh, it's been like that since commit 798bfe000 created
the file.  Thanks for cleaning it up (and I'm surprised checkpatch
doesn't flag it).

> index c734f627b4..de0099e333 100644
> --- a/qemu-nbd.c
> +++ b/qemu-nbd.c
> @@ -272,7 +272,7 @@ static void *nbd_client_thread(void *arg)
>  
>      ret = nbd_receive_negotiate(QIO_CHANNEL(sioc), NULL, &nbdflags,
>                                  NULL, NULL, NULL,
> -                                &size, &local_error);
> +                                &size, NULL, &local_error);

We could reasonably allow new-style structured reads when we're handling
the entire client side; but I agree that when qemu-nbd is used as the
handshaking phase before handing things over to the kernel that we can't
request structured reads, at least not until a) the kernel nbd module
implements structured read supports, and b) we have a way to tell that
the kernel is willing to accept structured read negotiation (and that's
where my earlier comments about the DF flag being a witness of
structured reads comes into play).

Umm, how does this patch compile?  You changed the signature of
nbd_receive_negotiate() to add a parameter, but did not modify the call
in block/nbd-client.c to pass the new parameter.  (So far, I've been
reviewing based just on email content; I also plan on applying and
testing your patches before all is said and done, but I sometimes
surprise myself with my ability to do a quality read-only review even
without spending time compiling things).

>      if (ret < 0) {
>          if (local_error) {
>              error_report_err(local_error);
> 

We'll definitely need a v2, but I'm grateful that you've tackled this work.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org

Attachment: signature.asc
Description: OpenPGP digital signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]