qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v2 2/4] virtio-scsi: default num_queues to -smp N


From: Michael S. Tsirkin
Subject: Re: [PATCH v2 2/4] virtio-scsi: default num_queues to -smp N
Date: Mon, 3 Feb 2020 07:53:20 -0500

On Mon, Feb 03, 2020 at 12:39:49PM +0100, Sergio Lopez wrote:
> On Mon, Feb 03, 2020 at 10:57:44AM +0000, Daniel P. Berrangé wrote:
> > On Mon, Feb 03, 2020 at 11:25:29AM +0100, Sergio Lopez wrote:
> > > On Thu, Jan 30, 2020 at 10:52:35AM +0000, Stefan Hajnoczi wrote:
> > > > On Thu, Jan 30, 2020 at 01:29:16AM +0100, Paolo Bonzini wrote:
> > > > > On 29/01/20 16:44, Stefan Hajnoczi wrote:
> > > > > > On Mon, Jan 27, 2020 at 02:10:31PM +0100, Cornelia Huck wrote:
> > > > > >> On Fri, 24 Jan 2020 10:01:57 +0000
> > > > > >> Stefan Hajnoczi <address@hidden> wrote:
> > > > > >>> @@ -47,10 +48,15 @@ static void 
> > > > > >>> vhost_scsi_pci_realize(VirtIOPCIProxy *vpci_dev, Error **errp)
> > > > > >>>  {
> > > > > >>>      VHostSCSIPCI *dev = VHOST_SCSI_PCI(vpci_dev);
> > > > > >>>      DeviceState *vdev = DEVICE(&dev->vdev);
> > > > > >>> -    VirtIOSCSICommon *vs = VIRTIO_SCSI_COMMON(vdev);
> > > > > >>> +    VirtIOSCSIConf *conf = &dev->vdev.parent_obj.parent_obj.conf;
> > > > > >>> +
> > > > > >>> +    /* 1:1 vq to vcpu mapping is ideal because it avoids IPIs */
> > > > > >>> +    if (conf->num_queues == VIRTIO_SCSI_AUTO_NUM_QUEUES) {
> > > > > >>> +        conf->num_queues = current_machine->smp.cpus;
> > > > > >> This now maps the request vqs 1:1 to the vcpus. What about the 
> > > > > >> fixed
> > > > > >> vqs? If they don't really matter, amend the comment to explain 
> > > > > >> that?
> > > > > > The fixed vqs don't matter.  They are typically not involved in the 
> > > > > > data
> > > > > > path, only the control path where performance doesn't matter.
> > > > > 
> > > > > Should we put a limit on the number of vCPUs?  For anything above ~128
> > > > > the guest is probably not going to be disk or network bound.
> > > > 
> > > > Michael Tsirkin pointed out there's a hard limit of VIRTIO_QUEUE_MAX
> > > > (1024).  We need to at least stay under that limit.
> > > > 
> > > > Should the guest have >128 virtqueues?  Each virtqueue requires guest
> > > > RAM and 2 host eventfds.  Eventually these resource requirements will
> > > > become a scalability problem, but how do we choose a hard limit and what
> > > > happens to guest performance above that limit?
> > > 
> > > From the UX perspective, I think it's safer to use a rather low upper
> > > limit for the automatic configuration.
> > > 
> > > Users of large VMs (>=32 vCPUs) aiming for the optimal performance are
> > > already facing the need of manually tuning (or relying on a software
> > > to do that for them) other aspects of it, like vNUMA, IOThreads and
> > > CPU pinning, so I don't think we should focus on this group.
> > 
> > Whether they're runing manually, or relying on software to tune for
> > them, we (QEMU maintainers) still need to provide credible guidance
> > on what todo with tuning for large CPU counts. Without clear info
> > from QEMU, it just descends into hearsay and guesswork, both of which
> > approaches leave QEMU looking bad.
> 
> I agree. Good documentation, ideally with some benchmarks, and safe
> defaults sound like a good approach to me.
> 
> > So I think we need to, at the very least, make a clear statement here
> > about what tuning approach should be applied vCPU count gets high,
> > and probably even apply that  as a default out of the box approach.
> 
> In general, I would agree, but in this particular case the
> optimization has an impact on something outside's QEMU control (host's
> resources), so we lack the information needed to make a proper guess.
> 
> My main concern here is users upgrading QEMU to hit some kind of crash
> or performance issue, without having touched their VM config. And
> let's not forget that Stefan said in the cover that this amounts to a
> 1-4% improvement on 4k operations on an SSD, and I guess that's with
> iodepth=1. I suspect with a larger block size and/or higher iodepth
> the improvement will be barely noticeable, which means it'll only have
> a positive impact on users running DB/OLTP or similar workloads on
> dedicated, directly attached, low-latency storage.
> 
> But don't get me wrong, this is a *good* optimization. It's just I
> think we should play safe here.
> 
> Sergio.

Yea I think a bit more benchmarking than with 4 vcpus
so at least we can see the trend can't hurt.




reply via email to

[Prev in Thread] Current Thread [Next in Thread]