Re: [Qemu-devel] Poor 8K random IO performance inside the guest

qemu-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] Poor 8K random IO performance inside the guest

From:	Stefan Hajnoczi
Subject:	Re: [Qemu-devel] Poor 8K random IO performance inside the guest
Date:	Mon, 17 Jul 2017 11:13:57 +0100
User-agent:	Mutt/1.8.0 (2017-02-23)

On Fri, Jul 14, 2017 at 04:28:12AM +0000, Nagarajan, Padhu (HPE Storage) wrote:
> During an 8K random-read fio benchmark, we observed poor performance inside 
> the guest in comparison to the performance seen on the host block device. The 
> table below shows the IOPS on the host and inside the guest with both 
> virtioscsi (scsimq) and virtioblk (blkmq).
> 
> -----------------------------------
> config        | IOPS  | fio gst hst
> -----------------------------------
> host-q32-t1   | 79478 | 401     271

hst->fio adds 200 microseconds of latency/request?  That seems very
high.

> scsimq-q8-t4  | 45958 | 693 639 351
> blkmq-q8-t4   | 49247 | 647 589 308

The gst->fio latency is much lower than hst->fio in the host-q32-t1
case.  Strange unless the physical HBA driver is very slow or you have a
md or device mapper configuration on the host but not the guest.

What is the storage configuration (guest, host, and hardware)?

Please also look at the latency percentiles in the fio output.  It's
possible that the latency distribution is very different from a normal
distribution and the mean latency isn't very meaningful.

> -----------------------------------
> host-q48-t1   | 85599 | 559     291
> scsimq-q12-t4 | 50237 | 952 807 358
> blkmq-q12-t4  | 54016 | 885 786 329
> -----------------------------------
> fio gst hst => latencies in usecs, as
>                seen by fio, guest and
>                host block layers.
> q8-t4 => qdepth=8, numjobs=4
> host  => fio run directly on the host
> scsimq,blkmq => fio run inside the guest
> 
> Shouldn't we get a much better performance inside the guest ?
> 
> When fio inside the guest was generating 32 outstanding IOs, iostat on the 
> host shows avgqu-sz of only 16. For 48 outstanding IOs inside the guest, 
> avgqu-sz on the host was only marginally better.

The latency numbers you posted support the qvgqu-sz result.  fio - hst
is roughly equal to hst.  If the software overhead is ~50% of the entire
request duration, then it makes sense that the host queue size is 50% of
the desired benchmark queue size.

> 
> qemu command line: qemu-system-x86_64 -L /usr/share/seabios/ -name 
> node1,debug-threads=on -name node1 -S -machine pc,accel=kvm,usb=off -cpu 
> SandyBridge -m 7680 -realtime mlock=off -smp 4,sockets=4,cores=1,threads=1 
> -object iothread,id=iothread1 -object iothread,id=iothread2 -object 
> iothread,id=iothread3 -object iothread,id=iothread4 -uuid XX -nographic 
> -no-user-config -nodefaults -chardev 
> socket,id=charmonitor,path=/var/lib/libvirt/qemu/node1.monitor,server,nowait 
> -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc,driftfix=slew 
> -global kvm-pit.lost_tick_policy=discard -no-hpet -no-shutdown -boot 
> strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -device 
> lsi,id=scsi0,bus=pci.0,addr=0x6 -device 
> virtio-scsi-pci,ioeventfd=on,num_queues=4,iothread=iothread2,id=scsi1,bus=pci.0,addr=0x7
>  -device 
> virtio-scsi-pci,ioeventfd=on,num_queues=4,iothread=iothread2,id=scsi2,bus=pci.0,addr=0x8
>  -drive file=rhel7.qcow2,if=none,id=drive-virtio-disk0,format=qcow2 -device 
> virtio-blk-pci,ioeventfd=on,num-queues=4,iothread=iothread1,scsi=off,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1
>  -drive 
> file=/dev/sdc,if=none,id=drive-virtio-disk1,format=raw,cache=none,aio=native 
> -device 
> virtio-blk-pci,ioeventfd=on,num-queues=4,iothread=iothread1,iothread=iothread1,scsi=off,bus=pci.0,addr=0x17,drive=drive-virtio-disk1,id=virtio-disk1
>  -drive 
> file=/dev/sdc,if=none,id=drive-scsi1-0-0-0,format=raw,cache=none,aio=native 
> -device 
> scsi-hd,bus=scsi1.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi1-0-0-0,id=scsi1-0-0-0
>  -netdev tap,fd=24,id=hostnet0,vhost=on,vhostfd=25 -device 
> virtio-net-pci,netdev=hostnet0,id=net0,mac=XXX,bus=pci.0,addr=0x2 -netdev 
> tap,fd=26,id=hostnet1,vhost=on,vhostfd=27 -device 
> virtio-net-pci,netdev=hostnet1,id=net1,mac=YYY,bus=pci.0,multifunction=on,addr=0x15
>  -netdev tap,fd=28,id=hostnet2,vhost=on,vhostfd=29 -device 
> virtio-net-pci,netdev=hostnet2,id=net2,mac=ZZZ,bus=pci.0,multifunction=on,addr=0x16
>  -chardev pty,id=charserial0 -device 
> isa-serial,chardev=charserial0,id=serial0 -device 
> virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x3 -msg timestamp=on

Are you pinning vcpus and iothreads so that the physical HBA interrupts
are processed by the same host CPU as the vcpu/iothread?

> fio command line: /tmp/fio --time_based --ioengine=libaio --randrepeat=1 
> --direct=1 --invalidate=1 --verify=0 --offset=0 --verify_fatal=0 
> --group_reporting --numjobs=$jobs --name=randread --rw=randread 
> --blocksize=8K --iodepth=$qd --runtime=60 --filename={/dev/vdb or /dev/sda}
> 
> # qemu-system-x86_64 --version
> QEMU emulator version 2.8.0(Debian 1:2.8+dfsg-3~bpo8+1)
> Copyright (c) 2003-2016 Fabrice Bellard and the QEMU Project developers
> 
> The guest was running RHEL 7.3 and the host was Debian 8.

signature.asc
Description: PGP signature

[Prev in Thread]

Current Thread

[Next in Thread]

[Qemu-devel] Poor 8K random IO performance inside the guest, Nagarajan, Padhu (HPE Storage), 2017/07/14
- Re: [Qemu-devel] Poor 8K random IO performance inside the guest, Fam Zheng, 2017/07/14
- Re: [Qemu-devel] Poor 8K random IO performance inside the guest, Stefan Hajnoczi <=

Prev by Date: Re: [Qemu-devel] Status and RFC of patchew testings on QEMU
Next by Date: Re: [Qemu-devel] [PULL 0/4] slirp updates
Previous by thread: Re: [Qemu-devel] Poor 8K random IO performance inside the guest
Next by thread: [Qemu-devel] [PULL 00/18] ppc-for-2.10 queue 20170714
Index(es):
- Date
- Thread