qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] scsi-generic and max request size


From: Hannes Reinecke
Subject: Re: [Qemu-devel] scsi-generic and max request size
Date: Wed, 22 Dec 2010 14:54:54 +0100
User-agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.1.16) Gecko/20101125 SUSE/3.0.11 Thunderbird/3.0.11

On 12/21/2010 11:05 PM, Benjamin Herrenschmidt wrote:
>>> So back to square 1 ... my vscsi (and virtio-blk too btw) can
>>> technically pass a max size to the guest, but we don't have a way to
>>> interrogate scsi-generic (and the underlying block driver) which is the
>>> main issue (that plus the fact that the ioctl seems to be broken in
>>> "compat" mode for /dev/sg specifically)...
>>>
>> Ah, the warm and fuzzy feeling knowing to be not alone in this ...
>>
>> This is basically the same issue I brought up with the first
>> submission round of my megasas emulation.
> 
> heh.
> 
>> As we're passing scatter-gather lists directly to the underlying
>> device we might end up sending a request which is improperly
>> formatted. The linux block layer has three limits onto which a
>> request has to be formatted:
>> - Max length of the scatter-gather list (max_sectors)
>> - Max overall request size (max_segments)
> 
> Didn't you swap the 2 above ? max_sectors is the max overall req. size
> and max_segments the max number of SG elements afaik :-)
> 
Yeah, could be. 'twas only meant for illustration anyway.

>> - Max length of individual sg elements (max_segment_size)
> 
>> newer kernels export these limits; they have been exported with
>> commit c77a5710b7e23847bfdb81fcaa10b585f65c960a.
>> For older kernels, however, we're being left in the dark here.
> 
> Well, first of all, "sg" is not there so that doesn't help with the
> scsi-generic problem much, then parsing sysfs... yuck.
> 
Well, sort of. 'sg' doesn't have any block queue limits directly as the
block queue is attached to the block device (surprise, surprise :-).
But nevertheless any commands send via SG_IO are being placed on the
block queue, hence the same limits apply here, too.

>> So on newer kernel we probably could be doing a quick check on the
>> block queue limits and reformat the I/O if required.
> 
> Maybe but then, "sg" isn't there. We "could" I suppose use "sr" as an
> indication tho when we know it's a cdrom.
> 
If it were me I would be using
>> Instead of reformatting we could be sendiong each element of an eg
>> list individually. Thereby we would be introducing some slowdown as
>> the sg lists have to be reassembled again by the lower layers, but
>> we would be insulated from any sg list mismatch.
>> However, this won't cover requests with too large sg elements.
>> For those we could probably use some simple divide-by-two algorithm
>> on the element to make them fit.
> 
> How can we ? We need a single request to match a single sg list anyways
> no ?
> 
Yes, true. That's what I was trying to illustrate here.

> Let's say you get a READ10 from the guest for 200Kb and your underlying
> max_sectors is 128Kb. How do you want to "break that up" ? The only way
> would be to make it two different READ10's and that's a can of worms
> especially if you start putting tags into the picture...
> 
Precisely. Hence I didn't try to implement anything in that area :-)

>> But seeing we have to split the I/O requests anyway we might as well
>> use the divide-by-two algorithm for the sg lists, too.
>>
>> Easiest would be if we could just transfer the available bits and
>> push the request back to the guest as a partial completion.
>> Sadly the I/O stack on the guest will choose to interpret this as an
>> I/O error instead of retrying the remainder :-(
>>
>> So in the long run I fear we have to implement some sort of I/O
>> request splitting in Qemu, using the values from sysfs.
> 
> So in my case, I'm happy for the time being to continue doing bounce
> buffering and so my only problem at the moment is the max request size
> (aka max_sectors). Also I -can- tell the guest what my limitation is,
> it's part of the vscsi login protocol. I can look into doing DMA
> directly to the guest SG lists later maybe.
> 
> However, I can't quite figure out how to reliably obtain that
> information in my driver since on one hand, the ioctl doesn't seem to
> work in mixed 32/64-bit environments, and on the other hand, sysfs
> doesn't seem to have anything for "sg" in /sys/class/block... Besides,
> those are both Linux-isms... so we'd have to be extra careful there too.
> 
Yes. I've been bashing my head against this, too.

IMO the whole problem arises from the fact that we're deliberately
destroying information here.
Most modern HBAs are using separate codepaths for streaming/block I/O
anyway, but when using 'scsi-generic' we are forced to discard this
information. We have to fake a SCSI READ/WRITE command, and send it via
SG_IO to the underlying device and keep fingers crossed that we're not
exceeding any device limitations.

The whole problem would just go away if we could use the standard block
read()/write() calls here. Then the iovec would be placed _as
scatter-gather list_ on the request-queue and the block layer would take
care of the whole issue.

I've tried to advocate this approach once, but (again) was being told
that it's a misuse of scsi-generic and I should be using scsi-disk instead.

However, since Alex Graf is facing similar problems with the AHCI HBA of
his maybe we could retry again ...

Cheers,

Hannes
--
Dr. Hannes Reinecke                   zSeries & Storage
address@hidden                        +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Markus Rex, HRB 16746 (AG Nürnberg)



reply via email to

[Prev in Thread] Current Thread [Next in Thread]