qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH v2 0/3] virtio: proposal to optimize accesses to


From: Vincenzo Maffione
Subject: Re: [Qemu-devel] [PATCH v2 0/3] virtio: proposal to optimize accesses to VQs
Date: Wed, 30 Dec 2015 17:45:01 +0100

2015-12-16 16:46 GMT+01:00 Paolo Bonzini <address@hidden>:
>
>
> On 16/12/2015 15:25, Vincenzo Maffione wrote:
>>> vhost-net actually had better performance, so virtio-net dataplane
>>> was never committed.  As Michael mentioned, in practice on Linux you
>>> use vhost, and non-Linux hypervisors you do not use QEMU. :)
>>
>> Yes, I understand. However, another possible use-case would using QEMU
>> + virtio-net + netmap backend + Linux (e.g. for QEMU-sandboxed packet
>> generators or packe processors, where very high packet rates are
>> common), where is not possible to use vhost.
>
> Yes, of course.  That was tongue in cheek.  Another possibility for your
> use case is to interface with netmap through vhost-user, but I'm happy
> if you choose to improve virtio.c instead!
>
>>> The main optimization that vring.c has is to cache the translation of
>>> the rings.  Using address_space_map/unmap for rings in virtio.c would be
>>> a noticeable improvement, as your numbers for patch 3 show.  However, by
>>> caching translations you also conveniently "forget" to promptly mark the
>>> pages as dirty.  As you pointed out this is obviously an issue for
>>> migration.  You can then add a notifier for runstate changes.  When
>>> entering RUN_STATE_FINISH_MIGRATE or RUN_STATE_SAVE_VM the rings would
>>> be unmapped, and then remapped the next time the VM starts running again.
>>
>> Ok so it seems feasible with a bit of care. The numbers we've been
>> seing in various experiments have always shown that this optimization
>> could easily double the 2 Mpps packet rate bottleneck.
>
> Cool.  Bonus points for nicely abstracting it so that virtio.c is just a
> user.
>
>>> You also guessed right that there are consistency issues; for these you
>>> can add a MemoryListener that invalidates all mappings.
>>
>> Yeah, but I don't know exactly what kind of inconsinstencies there can
>> be. Maybe the memory we are mapping may be hot-unplugged?
>
> Yes.  Just blow away all mappings in the MemoryListener commit callback.
>
>>> That said, I'm wondering where the cost of address translation lies---is
>>> it cache-unfriendly data structures, locked operations, or simply too
>>> much code to execute?  It was quite surprising to me that on virtio-blk
>>> benchmarks we were spending 5% of the time doing memcpy! (I have just
>>> extracted from my branch the patches to remove that, and sent them to
>>> qemu-devel).
>>
>> I feel it's just too much code (but I may be wrong).
>
> That is likely to be a good guess, but notice that the fast path doesn't
> actually have _that much_ code, because a lot of "if"s that are almost
> always false.
>
> Looking at a profile would be useful.  Is it flat, or does something
> (e.g. address_space_translate) actually stand out?

I'm so sorry, I forget to answer this.

This is what perf top shows while doing the experiment

  12.35%  qemu-system-x86_64       [.] address_space_map
  10.87%  qemu-system-x86_64       [.] vring_desc_read.isra.0
   7.50%  qemu-system-x86_64       [.] address_space_lduw_le
   6.32%  qemu-system-x86_64       [.] address_space_translate
   5.84%  qemu-system-x86_64       [.] address_space_translate_internal
   5.75%  qemu-system-x86_64       [.] phys_page_find
   5.74%  qemu-system-x86_64       [.] qemu_ram_block_from_host
   4.04%  qemu-system-x86_64       [.] address_space_stw_le
   4.02%  qemu-system-x86_64       [.] address_space_write
   3.33%  qemu-system-x86_64       [.] virtio_should_notify

So it seems most of the time is spent while doing translations.

>
>> I'm not sure whether you are thinking that 5% is too much or too little.
>> To me it's too little, showing that most of the overhead it's
>> somewhere else (e.g. memory translation, or backend processing). In a
>> ideal transmission system, most of the overhead should be spent on
>> copying, because it means that you successfully managed to suppress
>> notifications and translation overhead.
>
> On copying data, though---not on copying virtio descriptors.  5% for
> those is entirely wasted time.
>
> Also, note that I'm looking at disk I/O rather than networking, where
> there should be no copies at all.

In the experiment I'm doing there is a per-packet copy from the guest
memory to the netmap backend.

Cheers,
  Vincenzo

>
> Paolo
>
>>> Examples of missing optimizations in exec.c include:
>>>
>>> * caching enough information in RAM MemoryRegions to avoid the calls to
>>> qemu_get_ram_block (e.g. replace mr->ram_addr with a RAMBlock pointer);
>>>
>>> * adding a MRU cache to address_space_lookup_region.
>>>
>>> In particular, the former should be easy if you want to give it a
>>> try---easier than caching ring translations in virtio.c.
>>
>> Thank you so much for the insights :)
>



-- 
Vincenzo Maffione



reply via email to

[Prev in Thread] Current Thread [Next in Thread]