qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] DMG chunk size independence


From: John Snow
Subject: Re: [Qemu-devel] DMG chunk size independence
Date: Mon, 17 Apr 2017 16:29:24 -0400
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.8.0


On 04/15/2017 04:38 AM, Ashijeet Acharya wrote:
> Hi,
> 
> Some of you are already aware but for the benefit of the open list,
> this mail is regarding the task mentioned
> Here -> http://wiki.qemu-project.org/ToDo/Block/DmgChunkSizeIndependence
> 

OK, so the idea here is that we should be able to read portions of
chunks instead of buffering entire chunks, because chunks can be quite
large and an unverified DMG file should not be able to cause QEMU to
allocate large portions of memory.

Currently, QEMU has a maximum chunk size and it will not open DMG files
that have chunks that exceed that size, correct?

> I had a chat with Fam regarding this and he suggested a solution where
> we fix the output buffer size to a max of say "64K" and keep inflating
> until we reach the end of the input stream. We extract the required
> data when we enter the desired range and discard the rest. Fam however
> termed this as only a  "quick fix".
> 

So it looks like your problem now is how to allow reads to subsets while
tolerating zipped chunks, right?

We can't predict where the data we want is going to appear mid-stream,
but I'm not that familiar with the DMG format, so what does the data
look like and how do we seek to it in general?

We've got the mish blocks stored inside of the ResouceFork (right?), and
each mish block contains one-or-more chunk records. So given any offset
into the virtual file, we at least know which chunk it belongs to, but
thanks to zlib, we can't just read the bits we care about.

(Correct so far?)

> The ideal fix would obviously be if we can somehow predict the exact
> location inside the compressed stream relative to the desired offset
> in the output decompressed stream, such as a specific sector in a
> chunk. Unfortunately this is not possible without doing a first pass
> over the decompressed stream as answered on the zlib FAQ page
> Here -> http://zlib.net/zlib_faq.html#faq28
> 

Yeah, I think you need to start reading the data from the beginning of
each chunk -- but it depends on the zlib data. It COULD be broken up
into different pieces, but there's no way to know without scanning it in
advance.

(Unrelated:

Do we have a zlib format driver?

It might be cute to break up such DMG files and offload zlib
optimization to another driver, like this:

[dmg]-->[zlib]-->[raw]

And we could pretend that each zlib chunk in this file is virtually its
own zlib "file" and access it with modified offsets as appropriate.

Any optimizations we make could just apply to this driver.

[anyway...])


Pre-scanning for these sync points is probably a waste of time as
there's no way to know (*I THINK*) how big each sync-block would be
decompressed, so there's still no way this helps you seek within a
compressed block...

> AFAICT after reading the zran.c example in zlib, the above mentioned
> ideal fix would ultimately lead us to decompress the whole chunk in
> steps at least once to maintain an access point lookup table. This
> solution is better if we get several random access requests over
> different read requests, otherwise it ends up being equal to the fix
> suggested by Fam plus some extra effort needed in building and
> maintaining access points.
> 

Yeah, probably not worth it overall... I have to imagine that most uses
of DMG files are for iso-like cases for installers where accesses are
going to be either sequential (or mostly sequential) and most data will
not be read twice.

I could be wrong, but that's my hunch.

Maybe you can cache the state of the INFLATE process such that once you
fill the cache with data, we can simply resume the INFLATE procedure
when the guest almost inevitably asks for the next subsequent bytes.

That'd probably be efficient /enough/ in most cases without having to
worry about a metadata cache for zlib blocks or a literal data cache for
inflated data.

Or maybe I'm full of crap, I don't know -- I'd probably try a few
approaches and see which one empirically worked better.

> I have not explored the bzip2 compressed chunks yet but have naively
> assumed that we will face the same situation there?
> 

Not sure.

> I would like the community's opinion on this and add their suggestions
> if possible to give me some new thinking points.
> 
> Thanks
> Ashijeet
> 



reply via email to

[Prev in Thread] Current Thread [Next in Thread]