qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] Towards an ivshmem 2.0?


From: Jan Kiszka
Subject: Re: [Qemu-devel] Towards an ivshmem 2.0?
Date: Tue, 17 Jan 2017 10:46:17 +0100
User-agent: Mozilla/5.0 (X11; U; Linux i686 (x86_64); de; rv:1.8.1.12) Gecko/20080226 SUSE/2.0.0.12-1.1 Thunderbird/2.0.0.12 Mnenhy/0.7.5.666

On 2017-01-17 10:13, Wang, Wei W wrote:
> Hi Jan,
> 
> On Monday, January 16, 2017 9:10 PM, Jan Kiszka wrote:
>> On 2017-01-16 13:41, Marc-André Lureau wrote:
>>> On Mon, Jan 16, 2017 at 12:37 PM Jan Kiszka <address@hidden
>>> <mailto:address@hidden>> wrote:
>>>     some of you may know that we are using a shared memory device similar to
>>>     ivshmem in the partitioning hypervisor Jailhouse [1].
>>>
>>>     We started as being compatible to the original ivshmem that QEMU
>>>     implements, but we quickly deviated in some details, and in the recent
>>>     months even more. Some of the deviations are related to making the
>>>     implementation simpler. The new ivshmem takes <500 LoC - Jailhouse is
>>>     aiming at safety critical systems and, therefore, a small code base.
>>>     Other changes address deficits in the original design, like missing
>>>     life-cycle management.
>>>
>>>     Now the question is if there is interest in defining a common new
>>>     revision of this device and maybe also of some protocols used on top,
>>>     such as virtual network links. Ideally, this would enable us to share
>>>     Linux drivers. We will definitely go for upstreaming at least a network
>>>     driver such as [2], a UIO driver and maybe also a serial port/console.
>>>
>>>
>>> This sounds like duplicating efforts done with virtio and vhost-pci.
>>> Have you looked at Wei Wang proposal?
>>
>> I didn't follow it recently, but the original concept was about introducing 
>> an
>> IOMMU model to the picture, and that's complexity-wise a no-go for us (we can
>> do this whole thing in less than 500 lines, even virtio itself is more 
>> complex). IIUC,
>> the alternative to an IOMMU is mapping the whole frontend VM memory into
>> the backend VM - that's security/safety-wise an absolute no-go.
> 
> Though the virtio based solution might be complex for you, a big advantage is 
> that we have lots of people working to improve virtio. For example, the 
> upcoming virtio 1.1 has vring improvement, we can easily upgrade all the 
> virtio based solutions, such as vhost-pci, to take advantage of this 
> improvement. From the long term perspective, I think this kind of complexity 
> is worthwhile.

We will adopt virtio 1.1 ring formats. That's one reason why there is
also still a bidirectional shared memory region: to host the new
descriptors (while keeping the payload safely in the unidirectional
regions).

> 
> We further have security features(e.g. vIOMMU) can be applied to vhost-pci.

As pointed out, this is way too complex for us. A complete vIOMMU model
would easily add a few thousand lines of code to a hypervisor that tries
to stay below 10k LoC. Each line costs a lot of money when going for
certification. Plus I'm not even sure that there will always be
performance benefits, but that's to be seen when both solutions matured.

> 
>>>
>>>     Deviations from the original design:
>>>
>>>     - Only two peers per link
>>>
>>>
>>> sound sane, that's also what vhost-pci aims to afaik
>>>
>>>
>>>       This simplifies the implementation and also the interfaces (think of
>>>       life-cycle management in a multi-peer environment). Moreover, we do
>>>       not have an urgent use case for multiple peers, thus also not
>>>       reference for a protocol that could be used in such setups. If someone
>>>       else happens to share such a protocol, it would be possible to discuss
>>>       potential extensions and their implications.
>>>
>>>     - Side-band registers to discover and configure share memory
>>> regions
>>>
>>>       This was one of the first changes: We removed the memory regions from
>>>       the PCI BARs and gave them special configuration space registers. By
>>>       now, these registers are embedded in a PCI capability. The reasons are
>>>       that Jailhouse does not allow to relocate the regions in guest address
>>>       space (but other hypervisors may if they like to) and that we now have
>>>       up to three of them.
>>>
>>>
>>>  Sorry, I can't comment on that.
>>>
>>>
>>>     - Changed PCI base class code to 0xff (unspecified class)
>>>
>>>       This allows us to define our own sub classes and interfaces. That is
>>>       now exploited for specifying the shared memory protocol the two
>>>       connected peers should use. It also allows the Linux drivers to match
>>>       on that.
>>>
>>>
>>> Why not, but it worries me that you are going to invent protocols
>>> similar to virtio devices, aren't you?
>>
>> That partly comes with the desire to simplify the transport (pure shared 
>> memory).
>> With ivshmem-net, we are at least reusing virtio rings and will try to do 
>> this with
>> the new (and faster) virtio ring format as well.
>>
>>>
>>>
>>>     - INTx interrupts support is back
>>>
>>>       This is needed on target platforms without MSI controllers, i.e.
>>>       without the required guest support. Namely some PCI-less ARM SoCs
>>>       required the reintroduction. While doing this, we also took care of
>>>       keeping the MMIO registers free of privileged controls so that a
>>>       guest OS can map them safely into a guest userspace application.
>>>
>>>
>>> Right, it's not completely removed from ivshmem qemu upstream,
>>> although it should probably be allowed to setup a doorbell-ivshmem
>>> with msi=off (this may be quite trivial to add back)
>>>
>>>
>>>     And then there are some extensions of the original ivshmem:
>>>
>>>     - Multiple shared memory regions, including unidirectional ones
>>>
>>>       It is now possible to expose up to three different shared memory
>>>       regions: The first one is read/writable for both sides. The second
>>>       region is read/writable for the local peer and read-only for the
>>>       remote peer (useful for output queues). And the third is read-only
>>>       locally but read/writable remotely (ie. for input queues).
>>>       Unidirectional regions prevent that the receiver of some data can
>>>       interfere with the sender while it is still building the message, a
>>>       property that is not only useful for safety critical communication,
>>>       we are sure.
>>>
>>>
>>> Sounds like a good idea, and something we may want in virtio too
> 
> Can you please explain more about the process of transferring a packet using 
> the three different memory regions?
> In the kernel implementation, the sk_buf can be allocated anywhere.

With shared memory-backed communication, you obviously will have to
copy, to and sometimes also from the communication regions. But you no
longer have to flip any mappings (or even give up on secure isolation).

Why we have up to three regions: two unidirectional ones for payload,
one for shared control structures or custom protocols. See also above.

> 
> Btw, this looks similar to the memory access protection mechanism using EPTP 
> switching:
> Slide 25 
> http://www.linux-kvm.org/images/8/87/02x09-Aspen-Jun_Nakajima-KVM_as_the_NFV_Hypervisor.pdf
> This missed right side of the figure is an alternative EPT, which gives a 
> full access permission to the small piece of security code.

EPTP might be some nice optimization for scenarios where you have to
switch (but are its security problems resolved by now?), but a) we can
avoid switching and b) it's Intel-only while we need a generic solution
for all archs.

> 
>>>
>>>
>>>     - Life-cycle management via local and remote state
>>>
>>>       Each device can now signal its own state in form of a value to the
>>>       remote side, which triggers an event there. Moreover, state changes
>>>       done by the hypervisor to one peer are signalled to the other side.
>>>       And we introduced a write-to-shared-memory mechanism for the
>>>       respective remote state so that guests do not have to issue an MMIO
>>>       access in order to check the state.
>>>
>>>
>>> There is also ongoing work to better support disconnect/reconnect in
>>> virtio.
>>>
>>>
>>>
>>>     So, this is our proposal. Would be great to hear some opinions if you
>>>     see value in adding support for such an "ivshmem 2.0" device to QEMU as
>>>     well and expand its ecosystem towards Linux upstream, maybe also DPDK
>>>     again. If you see problems in the new design /wrt what QEMU provides so
>>>     far with its ivshmem device, let's discuss how to resolve them. Looking
>>>     forward to any feedback!
>>>
>>>
>>> My feeling is that ivshmem is not being actively developped in qemu,
>>> but rather virtio-based solutions (vhost-pci for vm2vm).
>>
>> As pointed out, for us it's most important to keep the design simple - even 
>> at the
>> price of "reinventing" some drivers for upstream (at least, we do not need 
>> two
>> sets of drivers because our interface is fully symmetric). I don't see yet 
>> how
>> vhost-pci could achieve the same, but I'm open to learn more!
> 
> Maybe I didn’t fully understand this - "we do not need two sets of drivers 
> because our interface is fully symmetric"?

We have no backend/frontend drivers. While vhost-pci can reuse virtio
frontend drivers, it still requires new backend drivers. We use the same
drivers on both sides - it's just symmetric. That also simplifies
arguing over non-interference because both sides have equal capabilities.

> 
> The vhost-pci driver is a standalone network driver from the local guest 
> point of view - it's no different than any other network drivers in the 
> guest. When talking about usage,  it's used together with another VM's virtio 
> device - would this be the "two sets of drivers" that you meant? I think this 
> is pretty nature and reasonable, as it is essentially a vm-to-vm 
> communication. Furthermore, we are able to dynamically create/destroy and 
> hot-plug in/out a vhost-pci device based on runtime requests. 

Hotplugging works with shared memory devices as well. We don't use it
during runtime of the hypervisor due to safety constraints, but devices
show up and disappear in the root cell (the primary Linux) as the
hypervisor starts or stops.

Jan

-- 
Siemens AG, Corporate Technology, CT RDA ITP SES-DE
Corporate Competence Center Embedded Linux



reply via email to

[Prev in Thread] Current Thread [Next in Thread]