[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-devel] [PATCH v3 4/4] virtio-blk: introduce multiread
From: |
Peter Lieven |
Subject: |
Re: [Qemu-devel] [PATCH v3 4/4] virtio-blk: introduce multiread |
Date: |
Fri, 30 Jan 2015 22:05:32 +0100 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.2.0 |
Am 30.01.2015 um 18:16 schrieb Max Reitz:
> On 2015-01-30 at 09:33, Peter Lieven wrote:
>> this patch finally introduces multiread support to virtio-blk. While
>> multiwrite support was there for a long time, read support was missing.
>>
>> The complete merge logic is moved into virtio-blk.c which has
>> been the only user of request merging ever since. This is required
>> to be able to merge chunks of requests and immediately invoke callbacks
>> for those requests. Secondly, this is required to switch to
>> direct invocation of coroutines which is planned at a later stage.
>>
>> The following benchmarks show the performance of running fio with
>> 4 worker threads on a local ram disk. The numbers show the average
>> of 10 test runs after 1 run as warmup phase.
>>
>> | 4k | 64k | 4k
>> MB/s | rd seq | rd rand | rd seq | rd rand | wr seq | wr rand
>> --------------+--------+---------+--------+---------+--------+--------
>> master | 1221 | 1187 | 4178 | 4114 | 1745 | 1213
>> multiread | 1829 | 1189 | 4639 | 4110 | 1894 | 1216
>>
>> Signed-off-by: Peter Lieven <address@hidden>
>> ---
>> hw/block/dataplane/virtio-blk.c | 8 +-
>> hw/block/virtio-blk.c | 288
>> +++++++++++++++++++++++++++-------------
>> include/hw/virtio/virtio-blk.h | 19 +--
>> trace-events | 1 +
>> 4 files changed, 210 insertions(+), 106 deletions(-)
>>
>> diff --git a/hw/block/dataplane/virtio-blk.c
>> b/hw/block/dataplane/virtio-blk.c
>> index 39c5d71..93472e9 100644
>> --- a/hw/block/dataplane/virtio-blk.c
>> +++ b/hw/block/dataplane/virtio-blk.c
>> @@ -96,9 +96,7 @@ static void handle_notify(EventNotifier *e)
>> event_notifier_test_and_clear(&s->host_notifier);
>> blk_io_plug(s->conf->conf.blk);
>> for (;;) {
>> - MultiReqBuffer mrb = {
>> - .num_writes = 0,
>> - };
>> + MultiReqBuffer mrb = {};
>> int ret;
>> /* Disable guest->host notifies to avoid unnecessary vmexits */
>> @@ -120,7 +118,9 @@ static void handle_notify(EventNotifier *e)
>> virtio_blk_handle_request(req, &mrb);
>> }
>> - virtio_submit_multiwrite(s->conf->conf.blk, &mrb);
>> + if (mrb.num_reqs) {
>> + virtio_submit_multireq(s->conf->conf.blk, &mrb);
>> + }
>> if (likely(ret == -EAGAIN)) { /* vring emptied */
>> /* Re-enable guest->host notifies and stop processing the
>> vring.
>> diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
>> index e04adb8..77890a0 100644
>> --- a/hw/block/virtio-blk.c
>> +++ b/hw/block/virtio-blk.c
>> @@ -34,6 +34,7 @@ VirtIOBlockReq *virtio_blk_alloc_request(VirtIOBlock *s)
>> req->dev = s;
>> req->qiov.size = 0;
>> req->next = NULL;
>> + req->mr_next = NULL;
>> return req;
>> }
>> @@ -84,20 +85,29 @@ static int virtio_blk_handle_rw_error(VirtIOBlockReq
>> *req, int error,
>> static void virtio_blk_rw_complete(void *opaque, int ret)
>> {
>> - VirtIOBlockReq *req = opaque;
>> + VirtIOBlockReq *next = opaque;
>> - trace_virtio_blk_rw_complete(req, ret);
>> + while (next) {
>> + VirtIOBlockReq *req = next;
>> + next = req->mr_next;
>> + trace_virtio_blk_rw_complete(req, ret);
>> - if (ret) {
>> - int p = virtio_ldl_p(VIRTIO_DEVICE(req->dev), &req->out.type);
>> - bool is_read = !(p & VIRTIO_BLK_T_OUT);
>> - if (virtio_blk_handle_rw_error(req, -ret, is_read))
>> - return;
>> - }
>> + if (req->qiov.nalloc != -1) {
>> + qemu_iovec_destroy(&req->qiov);
>> + }
>> - virtio_blk_req_complete(req, VIRTIO_BLK_S_OK);
>> - block_acct_done(blk_get_stats(req->dev->blk), &req->acct);
>> - virtio_blk_free_request(req);
>> + if (ret) {
>> + int p = virtio_ldl_p(VIRTIO_DEVICE(req->dev), &req->out.type);
>> + bool is_read = !(p & VIRTIO_BLK_T_OUT);
>> + if (virtio_blk_handle_rw_error(req, -ret, is_read)) {
>> + continue;
>> + }
>> + }
>> +
>> + virtio_blk_req_complete(req, VIRTIO_BLK_S_OK);
>> + block_acct_done(blk_get_stats(req->dev->blk), &req->acct);
>> + virtio_blk_free_request(req);
>> + }
>> }
>> static void virtio_blk_flush_complete(void *opaque, int ret)
>> @@ -291,24 +301,125 @@ static void virtio_blk_handle_scsi(VirtIOBlockReq
>> *req)
>> }
>> }
>> -void virtio_submit_multiwrite(BlockBackend *blk, MultiReqBuffer *mrb)
>> +static inline void
>
> Is the "inline" really worth it? This function is rather long...
I tried really hard to not add a regression for random reads and the difference
between good and bad here is sometimes just a matter of a few small changes.
If gcc inlines on its own anyway would you mind to keep the inline? The function
is actually only called from 2 places.
>
> (To my surprise, gcc actually does inline the function)
>
>> +virtio_submit_multireq2(BlockBackend *blk, MultiReqBuffer *mrb,
>
> Hm, that's not a very descriptive name... Maybe "submit_merged_requests" or
> something like that (it's static, so you can drop the virtio_ prefix if you
> want to) would be better?
Sure, you are right.
>
>> + int start, int num_reqs, int niov)
>> {
>> - int i, ret;
>> + QEMUIOVector *qiov = &mrb->reqs[start]->qiov;
>> + int64_t sector_num = mrb->reqs[start]->sector_num;
>> + int nb_sectors = mrb->reqs[start]->qiov.size / BDRV_SECTOR_SIZE;
>> + bool is_write = mrb->is_write;
>> +
>> + if (num_reqs > 1) {
>> + int i;
>> + struct iovec *_iov = qiov->iov;
>> + int _niov = qiov->niov;
>
> Identifiers with leading underscores are considered reserved, I'd rather
> avoid using them.
Okay. I will use tmp_iov and tmp_niov instead.
>
>> +
>> + qemu_iovec_init(qiov, niov);
>> +
>> + for (i = 0; i < _niov; i++) {
>> + qemu_iovec_add(qiov, _iov[i].iov_base, _iov[i].iov_len);
>> + }
>> - if (!mrb->num_writes) {
>> + for (i = start + 1; i < start + num_reqs; i++) {
>> + qemu_iovec_concat(qiov, &mrb->reqs[i]->qiov, 0,
>> + mrb->reqs[i]->qiov.size);
>> + mrb->reqs[i - 1]->mr_next = mrb->reqs[i];
>> + nb_sectors += mrb->reqs[i]->qiov.size / BDRV_SECTOR_SIZE;
>> + }
>> + assert(nb_sectors == qiov->size / BDRV_SECTOR_SIZE);
>> +
>> + trace_virtio_blk_submit_multireq(mrb, start, num_reqs, sector_num,
>> + nb_sectors, is_write);
>> + block_acct_merge_done(blk_get_stats(blk),
>> + is_write ? BLOCK_ACCT_WRITE : BLOCK_ACCT_READ,
>> + num_reqs - 1);
>> + }
>> +
>> + if (is_write) {
>> + blk_aio_writev(blk, sector_num, qiov, nb_sectors,
>> + virtio_blk_rw_complete, mrb->reqs[start]);
>> + } else {
>> + blk_aio_readv(blk, sector_num, qiov, nb_sectors,
>> + virtio_blk_rw_complete, mrb->reqs[start]);
>> + }
>> +}
>> +
>> +static int virtio_multireq_compare(const void *a, const void *b)
>> +{
>> + const VirtIOBlockReq *req1 = *(VirtIOBlockReq **)a,
>> + *req2 = *(VirtIOBlockReq **)b;
>> +
>> + /*
>> + * Note that we can't simply subtract sector_num1 from sector_num2
>> + * here as that could overflow the return value.
>> + */
>> + if (req1->sector_num > req2->sector_num) {
>> + return 1;
>> + } else if (req1->sector_num < req2->sector_num) {
>> + return -1;
>> + } else {
>> + return 0;
>> + }
>> +}
>> +
>> +void virtio_submit_multireq(BlockBackend *blk, MultiReqBuffer *mrb)
>> +{
>> + int i = 0, start = 0, num_reqs = 0, niov = 0, nb_sectors = 0,
>> + max_xfer_len = 0;
>
> Albeit not necessary, starting a new "int" line won't take any more space
> (and would look nicer to me, probably).
okay, I don't mind.
>
>> + int64_t sector_num = 0;
>> +
>> + if (mrb->num_reqs == 1) {
>> + virtio_submit_multireq2(blk, mrb, 0, 1, -1);
>> + mrb->num_reqs = 0;
>> return;
>> }
>> - ret = blk_aio_multiwrite(blk, mrb->blkreq, mrb->num_writes);
>> - if (ret != 0) {
>> - for (i = 0; i < mrb->num_writes; i++) {
>> - if (mrb->blkreq[i].error) {
>> - virtio_blk_rw_complete(mrb->blkreq[i].opaque, -EIO);
>> + max_xfer_len = blk_get_max_transfer_length(mrb->reqs[0]->dev->blk);
>> + max_xfer_len = MIN_NON_ZERO(max_xfer_len, INT_MAX);
>> +
>> + qsort(mrb->reqs, mrb->num_reqs, sizeof(*mrb->reqs),
>> + &virtio_multireq_compare);
>> +
>> + for (i = 0; i < mrb->num_reqs; i++) {
>> + VirtIOBlockReq *req = mrb->reqs[i];
>> + if (num_reqs > 0) {
>> + bool merge = true;
>> +
>> + /* merge would exceed maximum number of IOVs */
>> + if (niov + req->qiov.niov + 1 > IOV_MAX) {
>
> Hm, why the +1?
A really good question. I copied this piece from the old merge routine. It seems
definetely wrong.
>
>> + merge = false;
>> + }
>> +
>> + /* merge would exceed maximum transfer length of backend device
>> */
>> + if (req->qiov.size / BDRV_SECTOR_SIZE + nb_sectors >
>> max_xfer_len) {
>> + merge = false;
>> + }
>> +
>> + /* requests are not sequential */
>> + if (sector_num + nb_sectors != req->sector_num) {
>> + merge = false;
>> + }
>> +
>> + if (!merge) {
>> + virtio_submit_multireq2(blk, mrb, start, num_reqs, niov);
>> + num_reqs = 0;
>> }
>> }
>> +
>> + if (num_reqs == 0) {
>> + sector_num = req->sector_num;
>> + nb_sectors = niov = 0;
>> + start = i;
>> + }
>> +
>> + nb_sectors += req->qiov.size / BDRV_SECTOR_SIZE;
>> + niov += req->qiov.niov;
>> + num_reqs++;
>> }
>> - mrb->num_writes = 0;
>> + virtio_submit_multireq2(blk, mrb, start, num_reqs, niov);
>> + mrb->num_reqs = 0;
>> }
>> static void virtio_blk_handle_flush(VirtIOBlockReq *req, MultiReqBuffer
>> *mrb)
>> @@ -319,7 +430,9 @@ static void virtio_blk_handle_flush(VirtIOBlockReq *req,
>> MultiReqBuffer *mrb)
>> /*
>> * Make sure all outstanding writes are posted to the backing device.
>> */
>> - virtio_submit_multiwrite(req->dev->blk, mrb);
>> + if (mrb->is_write && mrb->num_reqs > 0) {
>> + virtio_submit_multireq(req->dev->blk, mrb);
>> + }
>> blk_aio_flush(req->dev->blk, virtio_blk_flush_complete, req);
>> }
>> @@ -329,6 +442,9 @@ static bool virtio_blk_sect_range_ok(VirtIOBlock *dev,
>> uint64_t nb_sectors = size >> BDRV_SECTOR_BITS;
>> uint64_t total_sectors;
>> + if (nb_sectors > INT_MAX) {
>> + return false;
>> + }
>> if (sector & dev->sector_mask) {
>> return false;
>> }
>> @@ -342,60 +458,6 @@ static bool virtio_blk_sect_range_ok(VirtIOBlock *dev,
>> return true;
>> }
>> -static void virtio_blk_handle_write(VirtIOBlockReq *req, MultiReqBuffer
>> *mrb)
>> -{
>> - BlockRequest *blkreq;
>> - uint64_t sector;
>> -
>> - sector = virtio_ldq_p(VIRTIO_DEVICE(req->dev), &req->out.sector);
>> -
>> - trace_virtio_blk_handle_write(req, sector, req->qiov.size / 512);
>> -
>> - if (!virtio_blk_sect_range_ok(req->dev, sector, req->qiov.size)) {
>> - virtio_blk_req_complete(req, VIRTIO_BLK_S_IOERR);
>> - virtio_blk_free_request(req);
>> - return;
>> - }
>> -
>> - block_acct_start(blk_get_stats(req->dev->blk), &req->acct,
>> req->qiov.size,
>> - BLOCK_ACCT_WRITE);
>> -
>> - if (mrb->num_writes == VIRTIO_BLK_MAX_MERGE_REQS) {
>> - virtio_submit_multiwrite(req->dev->blk, mrb);
>> - }
>> -
>> - blkreq = &mrb->blkreq[mrb->num_writes];
>> - blkreq->sector = sector;
>> - blkreq->nb_sectors = req->qiov.size / BDRV_SECTOR_SIZE;
>> - blkreq->qiov = &req->qiov;
>> - blkreq->cb = virtio_blk_rw_complete;
>> - blkreq->opaque = req;
>> - blkreq->error = 0;
>> -
>> - mrb->num_writes++;
>> -}
>> -
>> -static void virtio_blk_handle_read(VirtIOBlockReq *req)
>> -{
>> - uint64_t sector;
>> -
>> - sector = virtio_ldq_p(VIRTIO_DEVICE(req->dev), &req->out.sector);
>> -
>> - trace_virtio_blk_handle_read(req, sector, req->qiov.size / 512);
>> -
>> - if (!virtio_blk_sect_range_ok(req->dev, sector, req->qiov.size)) {
>> - virtio_blk_req_complete(req, VIRTIO_BLK_S_IOERR);
>> - virtio_blk_free_request(req);
>> - return;
>> - }
>> -
>> - block_acct_start(blk_get_stats(req->dev->blk), &req->acct,
>> req->qiov.size,
>> - BLOCK_ACCT_READ);
>> - blk_aio_readv(req->dev->blk, sector, &req->qiov,
>> - req->qiov.size / BDRV_SECTOR_SIZE,
>> - virtio_blk_rw_complete, req);
>> -}
>> -
>> void virtio_blk_handle_request(VirtIOBlockReq *req, MultiReqBuffer *mrb)
>> {
>> uint32_t type;
>> @@ -430,11 +492,54 @@ void virtio_blk_handle_request(VirtIOBlockReq *req,
>> MultiReqBuffer *mrb)
>> type = virtio_ldl_p(VIRTIO_DEVICE(req->dev), &req->out.type);
>> - if (type & VIRTIO_BLK_T_FLUSH) {
>> + switch (type & 0xff) {
>
> Somehow I'd prefer type & ~VIRTIO_BLK_T_BARRIER, but this is correct, too.
The problem is that type encodes operations and flags and there is no official
mask to distinguish them in
the specs defined. If anytime a new flag is defined it will break the switch
statement. I could add a macro
for the 0xff in the headers if you like?
>
>> + case VIRTIO_BLK_T_IN:
>> + case VIRTIO_BLK_T_OUT:
>> + {
>> + bool is_write = type & VIRTIO_BLK_T_OUT;
>> + req->sector_num = virtio_ldq_p(VIRTIO_DEVICE(req->dev),
>> + &req->out.sector);
>> +
>> + if (!virtio_blk_sect_range_ok(req->dev, req->sector_num,
>> + req->qiov.size)) {
>> + virtio_blk_req_complete(req, VIRTIO_BLK_S_IOERR);
>> + virtio_blk_free_request(req);
>> + return;
>> + }
>> +
>> + if (is_write) {
>> + qemu_iovec_init_external(&req->qiov, iov, out_num);
>> + trace_virtio_blk_handle_write(req, req->sector_num,
>> + req->qiov.size /
>> BDRV_SECTOR_SIZE);
>> + } else {
>> + qemu_iovec_init_external(&req->qiov, in_iov, in_num);
>> + trace_virtio_blk_handle_read(req, req->sector_num,
>> + req->qiov.size / BDRV_SECTOR_SIZE);
>> + }
>
> Before this patch, req->qiov (and subsequently req->qiov.size) was
> initialized before virtio_blk_sect_range_ok() was called (with req->qiov.size
> as an argument). Is it fine to change that?
Oops, thats definetly a mistake. Half of the checks in virtio_blk_sect_range_ok
won't work with the size set.
>
>> +
>> + block_acct_start(blk_get_stats(req->dev->blk),
>> + &req->acct, req->qiov.size,
>> + is_write ? BLOCK_ACCT_WRITE : BLOCK_ACCT_READ);
>> +
>> + /* merge would exceed maximum number of requests or IO direction
>> + * changes */
>> + if (mrb->num_reqs > 0 && (mrb->num_reqs ==
>> VIRTIO_BLK_MAX_MERGE_REQS ||
>> + is_write != mrb->is_write)) {
>> + virtio_submit_multireq(req->dev->blk, mrb);
>> + }
>> +
>> + mrb->reqs[mrb->num_reqs++] = req;
>
> Somehow I'd like an assert(mrb->num_reqs < VIRTIO_BLK_MAX_MERGE_REQS) before
> this. Feel free to ignore my request.
This can't happen and the check for this is right above. Of what scenario do
you think of that could overflow num_reqs?
>
>> + mrb->is_write = is_write;
>> + break;
>> + }
>> + case VIRTIO_BLK_T_FLUSH:
>
> I think VIRTIO_BLK_T_FLUSH_OUT is missing (that's the reason it was "type &
> VIRTIO_BLK_T_..." before instead of "type == VIRTIO_BLK_T_...").
VIRTIO_BLK_T_FLUSH_OUT is deprecated and not defined in the headers of qemu
anymore. I can add it, but then I would also add VIRTIO_BLK_T_SCSI_CMD_OUT.
>
>> virtio_blk_handle_flush(req, mrb);
>> - } else if (type & VIRTIO_BLK_T_SCSI_CMD) {
>> + break;
>> + case VIRTIO_BLK_T_SCSI_CMD:
>
> And VIRTIO_BLK_T_SCSI_CMD_OUT here.
See above. I will define them in the headers.
>
>
> The overall logic of this patch seems good to me (although I had some trouble
> following which QIOVs had which purpose, which were allocated and which were
> not, and so on...).
If you tell me at which point it was most difficult I can try to add comments.
Thanks for reviewing,
Peter