qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] Adding a persistent writeback cache to qemu


From: Liu Yuan
Subject: Re: [Qemu-devel] Adding a persistent writeback cache to qemu
Date: Fri, 21 Jun 2013 23:18:07 +0800
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20130510 Thunderbird/17.0.6

On 06/20/2013 11:58 PM, Sage Weil wrote:
> On Thu, 20 Jun 2013, Stefan Hajnoczi wrote:
>>> The concrete problem here is that flashcache/dm-cache/bcache don't
>>> work with the rbd (librbd) driver, as flashcache/dm-cache/bcache
>>> cache access to block devices (in the host layer), and with rbd
>>> (for instance) there is no access to a block device at all. block/rbd.c
>>> simply calls librbd which calls librados etc.
>>>
>>> So the context switches etc. I am avoiding are the ones that would
>>> be introduced by using kernel rbd devices rather than librbd.
>>
>> I understand the limitations with kernel block devices - their
>> setup/teardown is an extra step outside QEMU and privileges need to be
>> managed.  That basically means you need to use a management tool like
>> libvirt to make it usable.
>>
>> But I don't understand the performance angle here.  Do you have profiles
>> that show kernel rbd is a bottleneck due to context switching?
>>
>> We use the kernel page cache for -drive file=test.img,cache=writeback
>> and no one has suggested reimplementing the page cache inside QEMU for
>> better performance.
>>
>> Also, how do you want to manage QEMU page cache with multiple guests
>> running?  They are independent and know nothing about each other.  Their
>> process memory consumption will be bloated and the kernel memory
>> management will end up having to sort out who gets to stay in physical
>> memory.
>>
>> You can see I'm skeptical of this and think it's premature optimization,
>> but if there's really a case for it with performance profiles then I
>> guess it would be necessary.  But we should definitely get feedback from
>> the Ceph folks too.
>>
>> I'd like to hear from Ceph folks what their position on kernel rbd vs
>> librados is.  Why one do they recommend for QEMU guests and what are the
>> pros/cons?
> 
> I agree that a flashcache/bcache-like persistent cache would be a big win 
> for qemu + rbd users.  
> 
> There are few important issues with librbd vs kernel rbd:
> 
>  * librbd tends to get new features more quickly that the kernel rbd 
>    (although now that layering has landed in 3.10 this will be less 
>    painful than it was).
> 
>  * Using kernel rbd means users need bleeding edge kernels, a non-starter 
>    for many orgs that are still running things like RHEL.  Bug fixes are 
>    difficult to roll out, etc.
> 
>  * librbd has an in-memory cache that behaves similar to an HDD's cache 
>    (e.g., it forces writeback on flush).  This improves performance 
>    significantly for many workloads.  Of course, having a bcache-like 
>    layer mitigates this..
> 
> I'm not really sure what the best path forward is.  Putting the 
> functionality in qemu would benefit lots of other storage backends, 
> putting it in librbd would capture various other librbd users (xen, tgt, 
> and future users like hyper-v), and using new kernels works today but 
> creates a lot of friction for operations.
> 

I think I can share some implementation details about persistent cache
for guest because 1) Sheepdog has a persistent object-oriented cache as
exactly what Alex described 2) Sheepdog and Ceph's RADOS both provide
volumes on top of object store. 3) Sheepdog choose a persistent cache on
local disk while Ceph choose a in memory cache approach.

The main motivation of object cache is to reduce network traffic and
improve performance and the cache can be seen as a hard disk' internal
write cache, which modern kernels support well.

For a background introduction, Sheepdog's object cache works similar to
kernel's page cache, except that we cache a 4M object of a volume in
disk while kernel cache 4k page of a file in memory. We use LRU list per
volume to do reclaim and dirty list to track dirty objects for
writeback. We always readahead a whole object if not cached.

The benefit of a disk cache over a memory cache, in my option, is
1) VM get a more smooth performance because cache don't consume memory
(if memory is on high water mark, the latency of guest IO will be very
high).
2) smaller memory requirement and leave all the memory to guest
3) objects from base can be shared by all its children snapshots & clone
4) more efficient reclaim algorithm because sheep daemon knows better
than kernel's dm-cache/bcacsh/flashcache.
5) can easily take advantage of SSD as a cache backend

There is no migration problems for sheepdog with client cache because we
can release the cache in migration.

If QEMU has persistent cache built-in a generic layer, say block layer,
Sheepdog's object cache code can be removed. There is also some
advantage besides code reduction to built-in this cache for QEMU, for
e.g, we can teach QEMU to multi-connect more than sheep daemon to get a
better HA without caring cache consistency problem.

I believe sheepdog and RBD can share many codes of the persistent cache
but currently only RBD and Sheepdog use object store to provide volumes,
other formats/protocol use file abstraction, it is hard to reuse code
for them. Maybe we can provide a vfs-like layer to accommodate all the
block storage system, no matter whether it is on top of object store or
file store. This is a touch work but worth a try.

Thanks,
Yuan



reply via email to

[Prev in Thread] Current Thread [Next in Thread]