qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [multiprocess RFC PATCH 36/37] multi-process: add the c


From: Jag Raman
Subject: Re: [Qemu-devel] [multiprocess RFC PATCH 36/37] multi-process: add the concept description to docs/devel/qemu-multiprocess
Date: Tue, 11 Jun 2019 11:53:05 -0400
User-agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:52.0) Gecko/20100101 Thunderbird/52.9.1



On 5/23/2019 6:40 AM, Stefan Hajnoczi wrote:
On Tue, May 07, 2019 at 03:00:52PM -0400, Jag Raman wrote:
Hi Stefan,

Thank you very much for your feedback. Following is a summary of the
discussions our team had regarding your feedback.

On 4/25/2019 11:44 AM, Stefan Hajnoczi wrote:

Can multiple LSI SCSI controllers be launched such that each process
only has access to a subset of disk images?  Or is the disk image label
per-VM so that there is no isolation between LSI SCSI controller
processes for that VM?

Yes, it is possible to provide each process with access to a subset of
disk images. The Orchestrator (libvirt, etc.) assigns a set of MCS
Categories to each VM, then device instances can be isolated by being
assigned a subset of the VM’s Categories.


My concern with this overall approach is the practicality vs its
benefits.  Regarding practicality, each emulated device needs to be
proxied separately.  The QEMU subsystem used by the device also needs to
be proxied.  Global state, monitor commands, and live migration all
require code changes to support proxied operation.  This is very
invasive.

Then each emulated device needs an SELinux policy to achieve the
benefits of confinement.  I have no idea how to correctly write a policy
like this and it's likely that developers who contribute a single new
device will not be proficient in it either.  Writing these policies is a
rare thing and few people will be good at this.  It also makes me worry
about how we test and review them.

We also think that having an SELinux policy per device would become
complicated. Our proposal, therefore, is to define SELinux policies for
each device class - viz. disk, network, console, graphics, etc.
"fedora-selinux" upstream repo. [1] will contain these policies, so the
device developer doesn't have to worry about defining new policies for
each device. This proposal would diminish the complexity of SELinux
policies.

Have you considered using Linux namespaces?  I'm beginning to think that
SELinux becomes less relevant with pid and mount namespaces to isolate
processes.  The advantage of namespaces is that they are easy to
understand and can be expressed in code instead of a policy file in a
separate package.  This is the approach we're taking with virtiofsd
(vhost-user device backend for virtio-fs).


Despite the efforts required in making this work, all processes still
effectively have full access to the guest since they can access guest
RAM.  What I mean is that the device is actually not confined to its
host process (e.g. LSI SCSI controller process) because it can write
code to executable guest RAM pages.  The guest will then execute that
code and therefore all guest I/O (networking, disk, etc) is still
available indirectly to the "confined" processes.  They are not really
sandboxed from the outside world, regardless of how strict the SELinux
policy is :(.

There are performance issues due to proxying as well, but let's ignore
them for now and focus on security.

We are also focusing on performance. Please take a look at the following
blog for an initial report on performance. The results are for an iSCSI
backend in Oracle Cloud. We are working on collecting data on a much
heavier IOPS workload like an NVMe backend.

https://blogs.oracle.com/linux/towards-a-more-secure-qemu-hypervisor%2c-part-3-of-3-v2

Hard to reach a conclusion without also looking at CPU utilization.
IOPS alone don't tell the story.

If the system had spare CPU cycles then the performance results between
built-in LSI and separate LSI will be similar but the efficiency
(IOPS/CPU%) has actually decreased due to the extra CPU cycles required
to forward the hardware register access to the device emulation process.

If you rerun on a system without spare CPU cycles then IOPS degradation
would become apparent.  I'm not saying this is necessarily the case,
maybe the overhead is really doesn't have a significant effect, but the
graph shown in the blog post isn't enough to draw a conclusion either
way.

Hi Stefan,

We are working on getting a better idea about the CPU utilization while the performance test is running. We're looking forward to discussing this during the forthcoming KVM meeting.

Thank you!
--
Jag


Regarding the proposed QEMU bypass, these already exist in some form via
kvm.ko's ioeventfd and coalesced MMIO features.

Today ioeventfd is only used for performance-critical hardware
registers, so kvm.ko doesn't use a sophisticated dispatch mechanism.  If
you want to use it for all hardware register accesses handled by a
separate process then ioeventfd probably needs to be tweaked somewhat to
make it more scalable for that case.

Coalesced MMIO is also cool.  kvm.ko can accumulate guest MMIO writes in
a buffer that is only collected at a later point in time.  This improves
performance for devices that require multiple hardware register writes
to kick off an I/O operation (only the last one really needs to be
trapped by the device emulation code!).  This sounds similar to an MMIO
access shared ring buffer.


How do the benefits compare against today's monolithic approach?  If the
guest exploits monolithic QEMU it has full access to all host files and
APIs available to QEMU.  However, these are largely just the resources
that belong to the guest anyway - not resources we are trying to keep
away from the guest.  With multi-process QEMU each process still has
access to all guest interfaces via the code injection I mentioned above,
but the SELinux policy could restrict access to some resources.  But
this benefit is really small in my opinion, given that the resources
belong to the guest anyway and the guest can already access them.

The primary focus of our project is to defend the host from malicious
guest. The code injection problem you outlined above involves part of
the guest attacking itself, but not the host. Therefore, this wouldn't
compromise our objective.

Like you know, there are some parts of QEMU which are not directly
accessible from the guest (via drivers, etc.), which we prefer to call
the control plane. It executes ioctls to the host kernel and has access
to a broader set of syscalls, which the device emulation code doesn’t
need. We want to protect the control plane from emulated devices. In the
case where a device injects code into the RAM to attack another device
on the same VM, the control plane would still be protected.

Are you aware of any cases where the syscall attack surface led to an
exploitable bug in QEMU?  Any proof-of-concept exploit code or a CVE?

Another benefit with the project would be regarding detecting and
reporting failures in the emulated devices. For instance, in cases like
CVE-2018-18849, where an emulated device hangs/crashes, it wouldn't
directly crash the QEMU process as well. QEMU could detect the failure,
log the problem and exit, instead of generating coredump/hang.

Debugging is a lot easier with a coredump though :).  I would rather
have a coredump than a nice message that says "LSI died".


I think you can implement this for a handful of devices as a one-time
thing, but the invasiveness and the impracticality of getting wide cover
of QEMU make this approach questionable.

Am I mistaken about the invasiveness or impracticality?

We are not planning to implement this for all devices since it would be
impractical. But the project adds a framework for implementing more
devices in the future.

One other thing we would like to bring your attention to is that the
project doesn't affect the current usage. The same devices could still
be used as part of monolithic QEMU if the user chooses to do so.

I don't follow, to me this proposal seems extremely invasive and
requires awareness from all developers.

QEMU contains global state (like net/net.c:net_clients or
block.c:all_bdrv_states) and QMP commands that access global state.  All
of this needs to be carefully proxied to avoid losing functionality as
fundamental as the QMP monitor.

This is what worries me about this project.  There are amazing niche
features like record/replay that have been integrated into QEMU without
requiring all developers to be aware of how they work.  If you can
achieve this then I would have no reservations.

Right now I don't see that this will be possible and that's why I'm
challenging you to justify that the reduction in system call attack
surface is actually worth the invasive changes required.

Do you see a way to solve the issues I've mentioned?

Stefan




reply via email to

[Prev in Thread] Current Thread [Next in Thread]