qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] RFC: How to make seccomp reliable and useful ?


From: Daniel P. Berrange
Subject: Re: [Qemu-devel] RFC: How to make seccomp reliable and useful ?
Date: Thu, 16 Feb 2017 09:33:16 +0000
User-agent: Mutt/1.7.1 (2016-10-04)

On Thu, Feb 16, 2017 at 12:36:51AM +0100, Eduardo Otubo wrote:
> On Wed, Feb 15, 2017 at 06=27=32PM +0000, Daniel P. Berrange wrote:
> > The current impl of seccomp in QEMU is intentionally allowing a huge range
> > of system calls to be executed. The goal was that running '-sandbox on'
> > should never break any feature of QEMU, so naturally any syscall that can
> > executed on any codepath QEMU takes must be allowed.
> > 
> > This is good for usability because users don't need to understand the 
> > technical
> > details of the sandbox technology, they merely say "on" and it "just works".
> > Conversely though, this is bad for security because QEMU has to allow a huge
> > range of system calls to be used due to its broad functionality.
> > 
> > During initial discussions for seccomp back in 2012 it was suggested, there
> > might be alternate policies developed for QEMU which deny some features, but
> > improve security overall. To best of my knowledge, this has never been 
> > discussed
> > again since then.
> > 
> > 
> > In addition, since initially merging, there has been a steady stream of 
> > patches
> > to whitelist further syscalls that were missing. Some of these were missing 
> > due
> > to newly added functionality in QEMU since the original seccomp impl, while
> > others have been missing since day 1. It is reasonable to expect that there 
> > are
> > still many syscalls missing in the whitelist. In just a couple of minutes of
> > comparing the whitelist vs global syscall list it was possible to identify 
> > two
> > further missing syscalls. The '-netdev bridge,br=virbr0' network backend 
> > fails
> > because setuid is blocked, preventing execution of the qemu-bridge-helper
> > program. If built against glibc < 2.9, or running on kernel < 2.6.27 it will
> > fail to call eventfd() because we only permit eventfd2() syscall, not the
> > older eventfd() syscall used on older Linux. Some ifup scripts used with the
> > -netdev arg may also break due to lack of chmod, flock, getxattr 
> > permissions.
> > This risk of missing syscalls is why -sandbox defaults to off, and we've 
> > never
> > considered defaulting it to on.
> > 
> > 
> > The fundamental problem is that building a whitelist of syscalls used by 
> > QEMU
> > emulators is an intractable problem. QEMU on my system links to 183 
> > different
> > shared libraries and there is no way in the world that anyone can figure out
> > which code paths QEMU triggers in these libraries and thus identify which
> > syscalls will be genuinely needed.
> > 
> > Thus a whitelist based approach for QEMU is doomed to always be missing some
> > syscalls, resulting in uneccessary abrts of QEMU when it tickles some edge
> > case. If you are lucky the abort() happens at startup so you see it quickly
> > and can address it. If you are unlucky the abort() happens after your VM has
> > been running for days/week/months and you loose data.
> > 
> > IOW, seccomp integration as it currently exists today in QEMU offers minimal
> > security benefits, while at the same time causing spurious crashes which may
> > cause user data loss from aborting a running VM, discouraging users from 
> > using
> > even the minimal protection it offers.
> > 
> > I think we need to rework our seccomp support so that we can have a high 
> > enough
> > level of confidence in it, that it could be enabled by default. At the same 
> > time
> > we need to make it do something more tangibly useful from a security POV.
> > 
> > 
> > First we need to admit that whitelisting is a failed approach, and switch to
> > using blacklisting. Unless we do this, we'll never have high enough 
> > confidence
> > to enable it by default - something that's never turned on might as well not
> > exist at all.
> > 
> > 
> > There is a reasonable easily identifiable set of syscalls that QEMU should
> > never be permitted to use, no matter what configuration it is in, what 
> > helpers
> > it spawns, or what libraries it links to. eg reboot, swapon, swapoff,  
> > syslog,
> > mount, unmount, kexec_*, etc - any syscall that affects global system state,
> > rather than process local state should be forbidden.
> > 
> > There are some syscalls that are simply hardcoded to return ENOSYS which can
> > be trivially blacklisted. afs_syscall, break, fattach, ftime, etc (see the
> > man page 'unimplemented(2)').
> > 
> > There are some syscalls which are considered obsolete - they were previously
> > useful, but no modern code would call them, as they have been superceeded.
> > For example, readdir replaced by getdents. We could blacklist these by 
> > default
> > but provide a way to allow use of obsolete syscalls if running on older 
> > systems.
> > e.g. '-sandbox on,obsolete=allow'. They might be obsolete enough that we 
> > decide
> > to just block them permanently with no opt in - would need to analyse when
> > their replacements appeared in widespread use.
> > 
> > There might be a few more syscalls which we can determine are never valid to
> > use in QEMU or any library or helper program it might run. I expect this 
> > list
> > to be very small though, given the impossibility of auditing code paths 
> > through
> > millions of lines of code QEMU links to.
> > 
> > Everything else should be allowed.
> > 
> > At this point we have a highly reliable "-sandbox on" which we're not having
> > to constantly patch.
> > 
> > 
> > From here we need a way to allow a user to opt-in to more restrictive 
> > policies,
> > accepting that it will block certain features. For example, there should be 
> > a
> > a way to disable any means to elevate privileges from QEMU or things it 
> > spawns.
> > e.g. '-sandbox on,elevateprivileges=deny'.
> > 
> > This would not only block the variuous set*uid|gid functions via seccomp, 
> > but
> > should also prctl(PR_SET_NO_NEW_PRIVS). This would allows the user to optin 
> > to
> > a restrictive world if they know they'll not require things like the setuid
> > bridge helper.
> > 
> > Similarly there should be an '-sandbox on,spawn=deny' which prevents the 
> > ability
> > to fork/exec processes at all, whether privileged or not. This would block
> > features like the qemu bridge helper, SMB server, ifup/down scripts, 
> > migration
> > exec: protocol. These are all rarely used features though, so an opt-in to 
> > block
> > their use is reasonable & desirable.
> > 
> > A -sandbox on,resourcecontrol=deny, which prevents QEMU from setting stuff 
> > like
> > process affinity, schedular priority, etc. Some uses of QEMU might need 
> > them,
> > but normally such controls are left to the mgmt app above QEMU to set prior 
> > to
> > the exec() of QEMU.
> > 
> > 
> > 
> > The key is that these are *not* low level knobs controlling system calls, 
> > but
> > moderately high level knobs controlling general concepts. This is a high 
> > enough
> > level of abstraction to enable libvirt to automatically turn them on/off 
> > based
> > on guest config, without libvirt having to know anything detailed about QEMU
> > code impl for the features.
> > 
> > 
> > Finally, for avoidance of doubt, I'm *not* actually proposing to implement 
> > this
> > myself any time in the forseeable future. This mail came about from the fact
> > that many people have questioned whether current seccomp code is anything 
> > other
> > than "security theatre". I tend to agree with such an assessment myself, 
> > and was
> > initially intending to just send a patch to remove seccomp, to stimulate 
> > some
> > discussion. Instead, however, I decided to write this mail to see if we can
> > identify a way forward to make seccomp both reliable and useful. If QEMU 
> > had the
> > kind of approach outlined above, with a default blacklist instead of 
> > whitelist,
> > and some opt-ins for stricter lists, it is something I think libvirt would 
> > be
> > reasonably happy to enable out of the box. That would be a step forward from
> > today where libvirt would never consider turning seccomp on by default.
> > 
> > Perhaps this re-working could be a GSoC idea for some interested student...
> > 
> 
> I'm not a student, thus not eligible GSoC person but I would be more
> than grateful to take this initiative of yours and transform into some
> patches so we can make this feature something really useful and
> reliable.

Sure, I just threw GSoC out there as one possible idea. If you or anyone
else has time to work on it, that's great too.

> Perhaps now is not the right time to terse comments on every idea you
> gave, I agree with most of them. I wrote the whole implementation of
> this feature but actually became the maintainer because people approving
> sycalls and sending pull-requests were too busy, and I could do it. But
> to be completely honest I had few poor ideas on how to improve it and
> almost no time to actually do it in the past. Time passed by and all I
> did was approve new syscalls and turn them into pull-requests.
> 
> Let's spin up these ideas and hopefully incorporate into Qemu. Next step
> I'm gonna dig into every topic and draft a little more. I guess we can
> keep on this thread, or perhaps in separate ones. From there I can start
> to write some code.

ok

Regards,
Daniel
-- 
|: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org              -o-             http://virt-manager.org :|
|: http://entangle-photo.org       -o-    http://search.cpan.org/~danberr/ :|



reply via email to

[Prev in Thread] Current Thread [Next in Thread]