qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] Block I/O optimizations


From: Stefan Hajnoczi
Subject: Re: [Qemu-devel] Block I/O optimizations
Date: Thu, 28 Feb 2013 15:43:04 +0100
User-agent: Mutt/1.5.21 (2010-09-15)

On Wed, Feb 27, 2013 at 05:25:49PM +0200, Abel Gordon wrote:
> 
> 
> Stefan Hajnoczi <address@hidden> wrote on 26/02/2013 06:45:30 PM:
> 
> 
> > > But is this significantly different than any other security bug in the
> > > host,
> > > qemu, kvm....? If you perform the I/O virtualization in a separate (not
> > > qemu)
> > > process, you have a significantly smaller, self-contained and bounded
> > > trusted computing base (TCB) from source code perspective as opposed to
> > > a single huge user-space process where it's very difficult to define
> > > boundaries and find potential security holes.
> >
> > I disagree here.
> >
> > The QEMU process is no more privileged than guest ring 0.  It can only
> > mess with resources that the guest itself has access to (CPU, disk,
> > network).
> >
> > The QEMU process cannot access other guests.  SELinux locks it down so
> > it cannot access host files or other resources.
> 
> I see your point, but the shared-process only needs access to
> the virtio ring/buffers (not necessary the entire memory of
> all the guests), the network sockets and image files opened by
> all the qemu user-space process. So, if you have a security hole,
> an attacker can get access only to all these resources.

You must never be able to get access to other VMs disk/memory/network.
That is game over:
 * Disk - you can steal their data or tamper with it.
 * Memory - same as disk really, because you can inject code to do
   anything you want.
 * Network - you can spoof the guest or monitor its traffic.  Although
   due to crypto this is the least dangerous of the three resources.

> With the traditional model (not shared thread), if you have a security
> hole in qemu then an attacker will be able to exploit exactly the same
> security hole to obtain access to the resources "all the qemu instances"
> have access. I don't see why a security hole in qemu will work only for
> VM1 and not VM2...they are hosted using exactly the same qemu code.

Most QEMU security holes will not be remote exploitable.  Now if the
attacker has access to VM1 (they are renting a VM on your cloud) they
cannot get access to VM2 due to isolation.

> If you move the virtio back-end from qemu to a different user-space
> process,
> it will be easier to analyze, maintain the code, and detect security
> bugs.

I agree with this to some extent.  It's the micro-kernel vs monolithic
kernel debate.  In theory micro-kernel is a nicer design.

> Maybe, you can use this model also to improve security:
> you can give access to the network/disk only to the shared virtio back-end
> process and not to the qemu processes...

That's not possible to achieve as long as the QEMU process has the guest
memory and can control guest execution.  QEMU could inject guest code
that accesses the disk.

> > > Sounds interesting... however, once the userspace thread runs the
> driver
> > > loses
> > > control (assuming you don't have spare cores).
> > > I mean, a userspace I/O thread will probably consume all
> > > its time slice while the driver may prefer to assign less (or more)
> cycles
> > > to a
> > > specific I/O thread based on the ongoing activity of all the VMs.
> > >
> > > Using a shared-thread, you can optimize the linux scheduler to handle
> > > virtual/emulated I/O while you actually don't modify the kernel
> scheduler
> > > code.
> >
> > Can you explain details on fine-grained I/O scheduling or post some
> > code?
> 
> Ok, I'll try with a simple pseudo code to exemplify the idea.
> If you use a thread per-device, the tx/request path for each
> thread (different qemu process) will probably look like:
> 
> while (!stop) {
>    wait_for_queue_data() /* for a specific virtual device of a VM */
>    while (queue_has_data() && ! stop) {
>      request = dequeue_request()
>      process(request)
>   }
> }
> 
> 
> Now, if you use a shared-thread with fine-grained I/O scheduling the code
> will look like:
> 
> while (!stop) {
>    queue=select_queue_to_process(); /* for all the virtual devices of all
> the VMs*/
>    while (queue_has_data(queue) && !should_change_queue()) {
>        dequeue(request)
>        process(request)
>    }
> }
> 
> The should_change_queue function will return true based on:
> (1) the amount of requests the thread handled for the latest processed
> queue
> (2) the amount of requests pending in all the the other queues
> (3) how old are the oldest/newest request of each other queue
> (4) priorities between queues
> 
> The select_queue_to_process will select a queue based on:
> (1) how old are the oldest/newest request of each queue
> (2) priorities between queues
> (3) the average throughput/latency per queue
> 
> The logic to select which queue should be processed and how many requests
> should
> bee processed from this specific queue is implemented in user-space and
> depends
> on the ongoing  I/O activity in all the queues. Note that with this model
> you can
> process many queues in less than 1 scheduler time slice.
> With the traditional thread-per device model, the Linux scheduler is
> actually who decides
> which queue will be processed and the amount of requests that will be
> processed for this
> specific queue (cycles the thread runs). Note Linux has no information
> about the ongoing
> activity and status of the queues. The scheduler only knows if a thread is
> waiting
> (empty queue) or is ready to run (queue has data = event signaled).
> 
> Finally, with the shared-thread model you have have significantly less
> thread/process
> context switches compared to 1 I/O thread per qemu process.

This final point is the only one that I can 100% agree with.

Everything else can be handled by a system that is designed to use the
Linux scheduler rather than bypass it:

1. Use a budget to set a hard limit on the amount of resources to expend
   per queue per iteration.  (We don't do this today.)
2. Use an I/O resource controller (cgroups blkio controller or QEMU I/O
   throttling, which are both supported today) to set a per-guest
   quality of service (max IOPS, max bandwidth, priorities).  Note that
   this isn't about CPU scheduling, it's about I/O request scheduling.
3. Choose an appropriate I/O scheduler on the host (e.g. deadline) to
   meet your requirements.  This is possible today.
4. Use thread priorities to favor specific guests.  (We don't do this
   today.)

I think extending and tuning the existing mechanisms is the way to go.
I don't see obvious advantages other than reducing context switches.
You also lose out on the I/O scheduler since everything is being
submitted by a single shared thread.

Since I don't see a killer advantage, both approaches can achieve good
results.  KVM's philosophy is to make use of Linux instead of
duplicating its functionality, so using the scheduler is in the spirit
of that.

Stefan



reply via email to

[Prev in Thread] Current Thread [Next in Thread]