[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-devel] Adding a persistent writeback cache to qemu
From: |
Stefan Hajnoczi |
Subject: |
Re: [Qemu-devel] Adding a persistent writeback cache to qemu |
Date: |
Mon, 24 Jun 2013 11:31:35 +0200 |
User-agent: |
Mutt/1.5.21 (2010-09-15) |
On Fri, Jun 21, 2013 at 11:18:07PM +0800, Liu Yuan wrote:
> On 06/20/2013 11:58 PM, Sage Weil wrote:
> > On Thu, 20 Jun 2013, Stefan Hajnoczi wrote:
> >>> The concrete problem here is that flashcache/dm-cache/bcache don't
> >>> work with the rbd (librbd) driver, as flashcache/dm-cache/bcache
> >>> cache access to block devices (in the host layer), and with rbd
> >>> (for instance) there is no access to a block device at all. block/rbd.c
> >>> simply calls librbd which calls librados etc.
> >>>
> >>> So the context switches etc. I am avoiding are the ones that would
> >>> be introduced by using kernel rbd devices rather than librbd.
> >>
> >> I understand the limitations with kernel block devices - their
> >> setup/teardown is an extra step outside QEMU and privileges need to be
> >> managed. That basically means you need to use a management tool like
> >> libvirt to make it usable.
> >>
> >> But I don't understand the performance angle here. Do you have profiles
> >> that show kernel rbd is a bottleneck due to context switching?
> >>
> >> We use the kernel page cache for -drive file=test.img,cache=writeback
> >> and no one has suggested reimplementing the page cache inside QEMU for
> >> better performance.
> >>
> >> Also, how do you want to manage QEMU page cache with multiple guests
> >> running? They are independent and know nothing about each other. Their
> >> process memory consumption will be bloated and the kernel memory
> >> management will end up having to sort out who gets to stay in physical
> >> memory.
> >>
> >> You can see I'm skeptical of this and think it's premature optimization,
> >> but if there's really a case for it with performance profiles then I
> >> guess it would be necessary. But we should definitely get feedback from
> >> the Ceph folks too.
> >>
> >> I'd like to hear from Ceph folks what their position on kernel rbd vs
> >> librados is. Why one do they recommend for QEMU guests and what are the
> >> pros/cons?
> >
> > I agree that a flashcache/bcache-like persistent cache would be a big win
> > for qemu + rbd users.
> >
> > There are few important issues with librbd vs kernel rbd:
> >
> > * librbd tends to get new features more quickly that the kernel rbd
> > (although now that layering has landed in 3.10 this will be less
> > painful than it was).
> >
> > * Using kernel rbd means users need bleeding edge kernels, a non-starter
> > for many orgs that are still running things like RHEL. Bug fixes are
> > difficult to roll out, etc.
> >
> > * librbd has an in-memory cache that behaves similar to an HDD's cache
> > (e.g., it forces writeback on flush). This improves performance
> > significantly for many workloads. Of course, having a bcache-like
> > layer mitigates this..
> >
> > I'm not really sure what the best path forward is. Putting the
> > functionality in qemu would benefit lots of other storage backends,
> > putting it in librbd would capture various other librbd users (xen, tgt,
> > and future users like hyper-v), and using new kernels works today but
> > creates a lot of friction for operations.
> >
>
> I think I can share some implementation details about persistent cache
> for guest because 1) Sheepdog has a persistent object-oriented cache as
> exactly what Alex described 2) Sheepdog and Ceph's RADOS both provide
> volumes on top of object store. 3) Sheepdog choose a persistent cache on
> local disk while Ceph choose a in memory cache approach.
>
> The main motivation of object cache is to reduce network traffic and
> improve performance and the cache can be seen as a hard disk' internal
> write cache, which modern kernels support well.
>
> For a background introduction, Sheepdog's object cache works similar to
> kernel's page cache, except that we cache a 4M object of a volume in
> disk while kernel cache 4k page of a file in memory. We use LRU list per
> volume to do reclaim and dirty list to track dirty objects for
> writeback. We always readahead a whole object if not cached.
>
> The benefit of a disk cache over a memory cache, in my option, is
> 1) VM get a more smooth performance because cache don't consume memory
> (if memory is on high water mark, the latency of guest IO will be very
> high).
> 2) smaller memory requirement and leave all the memory to guest
> 3) objects from base can be shared by all its children snapshots & clone
> 4) more efficient reclaim algorithm because sheep daemon knows better
> than kernel's dm-cache/bcacsh/flashcache.
> 5) can easily take advantage of SSD as a cache backend
It sounds like the cache is in the sheep daemon and therefore has a
global view of all volumes being accessed from this host. That way it
can do things like share the cached snapshot data between volumes.
This is what I was pointing out about putting the cache in QEMU - you
only know about this QEMU process, not all volumes being accessed from
this host.
Even if Ceph and Sheepdog don't share code, it sounds like they have a
lot in common and it's worth looking at the Sheepdog cache before adding
one to Ceph.
Stefan
- Re: [Qemu-devel] Adding a persistent writeback cache to qemu, (continued)
- Re: [Qemu-devel] Adding a persistent writeback cache to qemu, Stefan Hajnoczi, 2013/06/20
- Re: [Qemu-devel] Adding a persistent writeback cache to qemu, Sage Weil, 2013/06/20
- Re: [Qemu-devel] Adding a persistent writeback cache to qemu, Stefan Hajnoczi, 2013/06/21
- Re: [Qemu-devel] Adding a persistent writeback cache to qemu, Liu Yuan, 2013/06/21
- Re: [Qemu-devel] Adding a persistent writeback cache to qemu,
Stefan Hajnoczi <=
- Re: [Qemu-devel] Adding a persistent writeback cache to qemu, Alex Bligh, 2013/06/24