qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] Re: [PATCH v2 3/7] docs: Add QED image format specifica


From: Avi Kivity
Subject: Re: [Qemu-devel] Re: [PATCH v2 3/7] docs: Add QED image format specification
Date: Tue, 12 Oct 2010 12:25:29 +0200
User-agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.9) Gecko/20100921 Fedora/3.1.4-1.fc13 Lightning/1.0b3pre Thunderbird/3.1.4

 On 10/11/2010 06:10 PM, Anthony Liguori wrote:
On 10/11/2010 11:02 AM, Avi Kivity wrote:
 On 10/11/2010 05:49 PM, Anthony Liguori wrote:
On 10/11/2010 09:58 AM, Avi Kivity wrote:
A leak is unacceptable. It means an image can grow to an unbounded size. If you are a server provider offering multitenancy, then a malicious guest can potentially grow the image beyond it's allotted size causing a Denial of Service attack against another tenant.


This particular leak cannot grow, and is not controlled by the guest.

As the image gets moved from hypervisor to hypervisor, it can keep growing if given a chance to fill up the disk, then trim it all way.

In a mixed hypervisor environment, it just becomes a numbers game.

I don't see how it can grow. Both the freelist and the clusters it points to consume space, which becomes a leak once you move it to a hypervisor that doesn't understand the freelist. The older hypervisor then allocates new blocks. As soon as it performs a metadata scan (if ever), the freelist is reclaimed.

Assume you don't ever do a metadata scan (which is really our design point).

What about crashes?


If you move to a hypervisor that doesn't support it, then move to a hypervisor that does, you create a brand new freelist and start leaking more space. This isn't a contrived scenario if you have a cloud environment with a mix of hosts.

It's only a leak if you don't do a metadata scan.


You might not be able to get a ping-pong every time you provision, but with enough effort, you could create serious problems.

It's really an issue of correctness. Making correctness trade-offs for the purpose of compatibility is a policy decision and not something we should bake into an image format. If a tool feels strongly that it's a reasonable trade off to make, it can always fudge the feature bits itself.

I think the effort here is reasonable, clearing a bit on startup is not that complicated.


A potential solution here is to treat TRIM a little differently than we've been discussing.

When TRIM happens, don't immediately write an unallocated cluster entry for the L2. Leave the L2 entry in-tact. Don't actually write a UCE to the L2 until you actually allocate the block.

This implies a cost because you'll need to do metadata syncs to make this work. However, that eliminates leakage.

The information is lost on shutdown; and you can have a large number of unallocated-in-waiting clusters (like a TRIM issued by mkfs, or a user expecting a visit from RIAA).

A slight twist on your proposal is to have an allocated-but-may-drop bit in a L2 entry. TRIM or zero detection sets the bit (leaving the cluster number intact). A following write to the cluster needs to clear the bit; if we reallocate the cluster we need to replace it with a ZCE.

Yeah, this is sort of what I was thinking. You would still want a free list but it becomes totally optional because if it's lost, no data is leaked (assuming that the older version understands the bit).

I was suggesting that we store that bit in the free list though because that let's us support having older QEMUs with absolutely no knowledge still work.

It doesn't - on rewrite an old qemu won't clear the bit, so a newer qemu would think it's still free.

The autoclear bit solves it nicely - the old qemu automatically drops the allocated-but-may-drop bits, undoing any TRIMs (which is unfortunate) but preserving consistency.



--
error compiling committee.c: too many arguments to function




reply via email to

[Prev in Thread] Current Thread [Next in Thread]