qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [Nbd] [PATCH 3/1] doc: Propose Structured Replies exten


From: Eric Blake
Subject: Re: [Qemu-devel] [Nbd] [PATCH 3/1] doc: Propose Structured Replies extension
Date: Tue, 29 Mar 2016 08:21:15 -0600
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.7.1

On 03/29/2016 02:24 AM, Alex Bligh wrote:
> 
> On 29 Mar 2016, at 04:56, Eric Blake <address@hidden> wrote:
> 
>> The existing transmission phase protocol is difficult to sniff,
>> because correct interpretation of the server stream requires
>> context from the client stream (or risks false positives if
>> data payloads happen to contain the protocol magic numbers).  It
>> also prohibits the ability to do short reads, or to return a
>> read error without also sending length bytes of data.
>>

>> +* `NBD_FLAG_C_STRUCTURED_REPLY` (bit 2)
>> +
>> +  This is a client flag; MUST NOT be set if the server did not set
> 
> *it* MUST not be set ...

Copy-and-paste from above; I can correct the place I copied from.

> 
>> +  `NBD_FLAG_STRUCTURED_REPLY`. If set, the server must use structured
>> +  replies to the `NBD_CMD_READ` command.
>> +
>> +* Transmission phase
>> +
>> +  The transmission phase includes a third message type: the structured
>> +  reply.  If `NBD_FLAG_C_STRUCTURED_REPLY` was negotiated, then the
>> +  normal server reply will never contain a data payload, and all
>> +  server replies that send data will be in the following form:
>> +
>> +  S: 32 bits, 0x668e33ef, magic (`NBD_STRUCTURED_REPLY_MAGIC`)  
>> +  S: 32 bits, error  
>> +  S: 64 bits, handle  
>> +  S: 32 bits, length of payload (unsigned)  
>> +  S: *length* bytes of payload data
> 
> Unless I'm missing something this doesn't entirely solve the problem.
> Imagine you are implementing NBD_CMD_READ (with structured reply)
> and are asked to read 4G of data. 1G in you find an error. You can't
> set the error ahead of time as you don't know (yet) there is an
> error. By the time you discover, you've already streamed 1G of data.
> What do you do now?

As the server, you can now either send 3G of (bogus) data followed by
the concluding normal response with error set, or you can immediately
send the normal response with error set and skip sending the remaining
3G of data.

I guess what I need to add is that in transmission phase, most commands
have exactly one response per request; but commands may document
scenarios where there will be multiple responses to a single request.
NBD_CMD_READ uses the multiple responses to make partial read and error
handling possible.

> 
> And this section seems at odds with the section below (starting
> "Conversely" which describes an offset/length scheme not detailed
> above.
> 
> I think you are saying that there can be one or more of these
> structured replies in reference to any request, in which
> case it should be made clearer, and not wait until NBD_CMD_READ.

Not for ANY request, but only for commands that document it.  I envision
that our proposal for NBD_CMD_GET_LBA_STATUS (or whatever we name it)
will always use a single structured reply (if error is set, the payload
is empty; otherwise, the payload is variable-sized but the length is
part of the structured reply header).  So far, only NBD_CMD_READ has a
reason for multiple replies.


>> +* `NBD_CMD_READ`
>> +
>> +  If `NBD_FLAG_C_STRUCTURED_REPLY` was not negotiated, then a read
>> +  request MUST always be answered by a single response (with magic
>> +  0x67446698 `NBD_REPLY_MAGIC`); the response MUST include length
>> +  bytes of data according to the client's request, although those
>> +  bytes MAY be invalid if an error is returned, and the connection
>> +  MUST if an error occurs after a header claiming no error.
> 
> Word(s) apparently missing after "MUST" on last line. "MUST be
> closed by the server" I think.

Yep. Will fix in v2.


>> +
>> +  The server MUST ensure that each read chunk lies within the original
>> +  offset and length of the original client request, MUST NOT send read
>> +  chunks that would cover the same offset more than once, and MUST
>> +  send at least one byte of data in addition to the offset field of
>> +  each read chunk.  The server MAY send read chunks out of order, and
>> +  may interleave other responses between read replies.
> 
> I can see why it's attractive from the server's point of view to
> support out of order replies - for instance an NBD server with a back
> end like Ceph could use this to launch requests simultaneously to
> multiple backend stores.
> 
> However, theoretically a server can now send single one byte packets
> back, which the client would have to reconstruct.

Yeah, but the reconstruction is easy; naively:

while response_magic == structured:
    copy len-8 bytes of data from response to given offset
response_magic == normal, read is complete

Detecting overlap or incomplete reads would requires more complexity in
the client, but I don't know that a client has to care (the protocol is
specifically written that a client MAY detect bad servers, but not MUST;
a client that assumes the server is well-behaved is still compliant).

However, you DO have a point that the server SHOULD send data in
reasonable-size chunks; and maybe I should propose a parallel extension
where, when negotiated between client and server, the server will
advertise minimum and preferred I/O sizes in response to the export name
request (for example, a server backed by a real block device may have a
minimum of 512 bytes or 4096 bytes, and a preferred size of 64k; while a
server backed by a normal file system may have a minimum of 1 byte);
then put in restrictions that a server SHOULD reject read/write requests
where offset and length are not multiples of the minimum, and that the
server SHOULD send read chunks aligned to the preferred size (with
exceptions for the head and tail of a larger buffer that meets minimum
alignment but not preferred alignment).

> 
> Also, given new commands aren't available unless you support structured
> replies, you now have to support reassembly of replies (if you want
> to use new features) even if all your reads are (e.g.) 1k.

Are you arguing that there should be a flag that controls whether reads
must be in-order vs. reassembled?  Reassembly must happen either way,
the question is whether having a way to allow out-of-order for
efficiency, vs. defaulting to in-order for easier computation, is worth it.

> 
> Can I suggest that the client should be able to specify a minimum
> 'power of 2' chunk boundary, e.g. if the client says '1k', and its
> requests do not cross a 1k boundary, the server will guarantee to
> return them as a single chunk?

If we want to negotiate minimum and preferred transaction sizes, it
should probably be done in a separate proposal.  For this proposal, I
think the best we can do is merely suggest that the server SHOULD keep
read chunks suitably blocked (larger blocks lead to less overhead).

> 
>>  The server
>> +  MUST NOT set the error field of a read chunk; if an error occurs, it
>> +  MAY immediately end the sequence of structured response messages,
>> +  MUST send the error in the concluding normal response, and SHOULD
>> +  keep the connection open.  The final non-structured response MUST
>> +  set an error unless the sum of data sent by all read chunks totals
>> +  the original client length request.
> 
> add "and data for the entire range requested has been supplied." (I
> know this is technically implied by the fact data cannot be duplicated).

Sure.  But keep in mind that if (when?) we add a flag for allowing the
server to skip read chunks on holes, we'll have to tweak the wording to
allow the server to send fewer chunks than the client's length, where
the client must then assume zeroes for all chunks not received.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org

Attachment: signature.asc
Description: OpenPGP digital signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]