qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [RFC] Disk integrity in QEMU


From: Mark Wagner
Subject: Re: [Qemu-devel] [RFC] Disk integrity in QEMU
Date: Sat, 11 Oct 2008 20:43:00 -0400
User-agent: Thunderbird 2.0.0.12 (X11/20080226)

Anthony Liguori wrote:

Note

I think that are two distinct arguments going on here. My main concern is
that I don't think that this a simple "what do we make the default cache policy
be" issue. I think that regardless of the cache policy, if something in the
guest requests O_DIRECT, the host must honor that and not cache the data.

So in the following discussion below, the question of what the default cache
flag should be and the question of the host needing to honor O_DIRECT in a
guest are somewhat intermingled...

Mark Wagner wrote:
Avi Kivity wrote:

I think one of the main things to be considered is the integrity of the
actual system call.  The Linux manpage for open() states the following
about the use of the O_DIRECT flag:

O_DIRECT (Since Linux 2.6.10)
Try to minimize cache effects of the I/O to and from this file.  In
general this will degrade performance, but it is useful  in  special
situations,  such as when applications do their own caching.  File
I/O is done directly to/from user space buffers.  The I/O is
synchronous, that is, at the completion of a read(2) or write(2),
data is guaranteed to  have  been  transferred.   Under  Linux  2.4
transfer  sizes, and the alignment of user buffer and file offset
must all be multiples of the logical block size of the file system.
Under Linux 2.6 alignment to 512-byte boundaries suffices.


If I focus on the sentence "The I/O is synchronous, that is, at
the completion of a read(2) or write(2), data is guaranteed to have
been transferred. ",

It's extremely important to understand what the guarantee is. The guarantee is that upon completion on write(), the data will have been reported as written by the underlying storage subsystem. This does *not* mean that the data is on disk.

I apologize if I worded it poorly, I assume that the guarantee is that
the data has been sent to the storage controller and said controller
sent an indication that the write has completed.  This could mean
multiple things likes its in the controllers cache, on the disk, etc.

I do not believe that this means that the data is still sitting in the
host cache.  I realize it may not yet be on a disk, but, at a minimum,
I would expect that is has been sent to the storage controller.  Do you
consider the hosts cache to be part of the storage subsystem ?


If you have a normal laptop, your disk has a cache. That cache does not have a battery backup. Under normal operations, the cache is acting in write-back mode and when you do a write, the disk will report the write as completed even though it is not actually on disk. If you really care about the data being on disk, you have to either use a disk with a battery backed cache (much more expensive) or enable write-through caching (will significantly reduce performance).



We are testing things on the big side.  Systems with 32 GB of mem,
2 TB of enterprise storage (MSA, EVA, etc).  There is a write cache with
battery backup on the storage controllers.  We understand the trade offs
between the life-time of the battery and the potential data loss because
they are well documented and we can make informed decisions because we
know they are there.

I think that people are too quickly assuming that because an IDE drive
will cache your writes *if you let it*, then its clearly OK for the host
to lie to the guests when they request O_DIRECT and cache whatever the
developers feel like.  I think the leap to get from the write cache on
an IDE drive to "its OK to cache what ever we want on the host" is huge,
and deadly.

Keep in mind, the disk on a laptop is not caching GB worth of data
like the host can. The impact is that while there is a chance of data
loss with my laptop if I leave the disk cache on, the amount of data is
much smaller and the time it takes to flush the disks cache is also
much smaller than a multi-GB cache on my host.

In the case of KVM, even using write-back caching with the host page cache, we are still honoring the guarantee of O_DIRECT. We just have another level of caching that happens to be write-back.

I still don't get it.  If I have something running on the host that I
open with O_DIRECT, do you still consider it not to be a violation of
the system call if that data ends up in the host cache instead of being
sent to the storage controller?

If you do think it violates the terms of the call, then what is the
difference between the host and a guest in this situation?
QEMU is clearly not a battery backed storage controller.



I think there a bug here. If I open a
file with the O_DIRECT flag and the host reports back to me that
the transfer has completed when in fact its still in the host cache,
its a bug as it violates the open()/write() call and there is no
guarantee that the data will actually be written.

This is very important, O_DIRECT does *not* guarantee that data actually resides on disk. There are many possibly places that it can be cached (in the storage controller, in the disks themselves, in a RAID controller).

I don't believe I said was on the disk, just that the host indicated
to the guest that the write had completed. Everything you mentioned
could be considered external to the OS. You didn't mention the host
page cache, is it allowed there or not?

So I guess the real issue isn't what the default should be (although
the performance team at Red Hat would vote for cache=off),

The consensus so far has been that we want to still use the host page cache but use it in write-through mode. This would mean that the guest would only see data completion when the host's storage subsystem reports the write as having completed. This is not the same as cache=off but I think gives the real effect that is desired.

Do you have another argument for using cache=off?
Thats not the argument I'm trying to make.

Well I guess I still didn't make my point clearly. cache=off seems to be
a band-aid to the fact that the host is not honoring the O_DIRECT flag.
I can easily see a malicious use of the cache=on flag to inject something
into the data stream or highjack said stream from a guest app that
requested O_DIRECT. While this is also possible in may other ways, in this
particular case it is enabled via the config option in QEMU.  I can easily
see something as simple as setting a large page cache, config the guests
to use cache=on and then every second messing with the caches in order to
cause data corruption. (wonder if  "echo 1 > /proc/sys/vm/drop_caches will
do the trick ?)".  From the guests perspective, they have been guaranteed
that their data is secure, but it real isn't.

We are testing with Oracle right now. Oracle assumes it has control of the
storage and does lots of things assuming direct IO. However, I can configure
cache=on for the storage presented to the guest and Oracle really won't have
direct control because there is a host cache in the way.

If I run the same Oracle config on bare metal, it does have direct control
because the OS knows that the host cache must be bypassed.

The end result is that the final behavior of guest OS is drastically different
than that of the same OS running on a host because I can configure QEMU to
hijack the data underneath the actual call and at a minimum, delay it from
going to the external storage subsystem where the application expects it to be.
The impact of this decision is that this is causing QEMU to be unreliable
for any type of use that requires data integrity and unsuitable for any
type of enterprise deployment.

-mark

Regards,

Anthony Liguori

the real
issue is that we need to honor the system call from the guest. If
the file is opened with O_DIRECT on the guest, then the host needs
to honor that and do the same.

-mark












reply via email to

[Prev in Thread] Current Thread [Next in Thread]