qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] Following up questions related to QEMU and I/O Thread


From: Wei Li
Subject: Re: [Qemu-devel] Following up questions related to QEMU and I/O Thread
Date: Mon, 22 Apr 2019 21:21:53 -0700
User-agent: Microsoft-MacOutlook/10.15.0.190115

Hi Stefan,

I did investigation per your advices, please see inline for the details and 
questions.
 
       1. Compare "iostat -dx 1" inside the guest and host.  Are the I/O
           patterns comparable?  blktrace(8) can give you even more detail on
           the exact I/O patterns.  If the guest and host have different I/O
           patterns (blocksize, IOPS, queue depth) then request merging or
           I/O scheduler effects could be responsible for the difference.

[wei]: That's good point, I compared 'iostate -dx1" between guest and host, but 
I have not find obvious difference between guest and host which could 
responsible for the difference.
        
        2. kvm_stat or perf record -a -e kvm:\* counters for vmexits and
           interrupt injections.  If these counters vary greatly between queue
           sizes, then that is usually a clue.  It's possible to get higher
           performance by spending more CPU cycles although your system doesn't
           have many CPUs available, so I'm not sure if this is the case.

[wei]: vmexits looks like a reason. I am using FIO tool to read/write block 
storage via following sample command, interesting thing is that kvm:kvm_exit 
count decreased from 846K to 395K after I increased num_queues from 2 to 4 
while the vCPU count is 2.
           1). Does this mean using more queues than vCPU count may increase 
IOPS via spending more CPU cycle? 
           2). Could you please help me better understand how more queues is 
able to spend more CPU cycle? Thanks!
           FIO command: fio --filename=/dev/sdb --direct=1 --rw=randrw --bs=4k 
--ioengine=libaio --iodepth=64 --numjobs=4 --time_based --group_reporting 
--name=iops --runtime=60 --eta-newline=1

        3. Power management and polling (kvm.ko halt_poll_ns, tuned profiles,
           and QEMU iothread poll-max-ns).  It's expensive to wake a CPU when it
           goes into a low power mode due to idle.  There are several features
           that can keep the CPU awake or even poll so that request latency is
           reduced.  The reason why the number of queues may matter is that
           kicking multiple queues may keep the CPU awake more than batching
           multiple requests onto a small number of queues.
[wei]: CPU awake could be another reason, I noticed that kvm:kvm_vcpu_wakeup 
count decreased from 151K to 47K after I increased num_queues from 2 to 4 while 
the vCPU count is 2.
           1). Does this mean more queues may keep CPU more busy and awake 
which reduced the vcpu wakeup time?
           2). If using more num_queues than vCPU count is able to get higher 
IOPS for this case, is it safe to use 4 queues while it only have 2 vCPU, or 
there is any concern or impact by using more queues than vCPU count which I 
should keep in mind?

In addition, does Virtio-scsi support Batch I/O Submission feature which may be 
able to increase the IOPS via reducing the number of system calls?

Thanks,
Wei

On 4/16/19, 6:42 PM, "Wei Li" <address@hidden> wrote:

    Thanks Stefan and Dongli for your feedback and advices!
    
    I will do the further investigation per your advices and get back to you 
later on.
    
    Thanks, 
    -Wei
    
    On 4/16/19, 2:20 AM, "Stefan Hajnoczi" <address@hidden> wrote:
    
        On Tue, Apr 16, 2019 at 07:23:38AM +0800, Dongli Zhang wrote:
        > 
        > 
        > On 4/16/19 1:34 AM, Wei Li wrote:
        > > Hi @Paolo Bonzini & @Stefan Hajnoczi,
        > > 
        > > Would you please help confirm whether @Paolo Bonzini's multiqueue 
feature change will benefit virtio-scsi or not? Thanks!
        > > 
        > > @Stefan Hajnoczi,
        > > I also spent some time on exploring the virtio-scsi multi-queue 
features via num_queues parameter as below, here are what we found:
        > > 
        > > 1. Increase number of Queues from one to the same number as CPU 
will get better IOPS increase.
        > > 2. Increase number of Queues to the number (e.g. 8) larger than the 
number of vCPU (e.g. 2) can get even better IOPS increase.
        > 
        > As mentioned in below link, when the number of hw queues is larger 
than
        > nr_cpu_ids, the blk-mq layer would limit and only use at most 
nr_cpu_ids queues
        > (e.g., /sys/block/sda/mq/).
        > 
        > That is, when the num_queus=4 while vcpus is 2, there should be only 
2 queues
        > available /sys/block/sda/mq/
        > 
        > https://lore.kernel.org/lkml/address@hidden/
        > 
        > I am just curious how increasing the num_queues from 2 to 4 would 
double the
        > iops, while there are only 2 vcpus available...
        
        I don't know the answer.  It's especially hard to guess without seeing
        the benchmark (fio?) configuration and QEMU command-line.
        
        Common things to look at are:
        
        1. Compare "iostat -dx 1" inside the guest and host.  Are the I/O
           patterns comparable?  blktrace(8) can give you even more detail on
           the exact I/O patterns.  If the guest and host have different I/O
           patterns (blocksize, IOPS, queue depth) then request merging or
           I/O scheduler effects could be responsible for the difference.
        
        2. kvm_stat or perf record -a -e kvm:\* counters for vmexits and
           interrupt injections.  If these counters vary greatly between queue
           sizes, then that is usually a clue.  It's possible to get higher
           performance by spending more CPU cycles although your system doesn't
           have many CPUs available, so I'm not sure if this is the case.
        
        3. Power management and polling (kvm.ko halt_poll_ns, tuned profiles,
           and QEMU iothread poll-max-ns).  It's expensive to wake a CPU when it
           goes into a low power mode due to idle.  There are several features
           that can keep the CPU awake or even poll so that request latency is
           reduced.  The reason why the number of queues may matter is that
           kicking multiple queues may keep the CPU awake more than batching
           multiple requests onto a small number of queues.
        
        Stefan
        
    





reply via email to

[Prev in Thread] Current Thread [Next in Thread]