qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [libvirt] IO accounting overhaul


From: Kevin Wolf
Subject: Re: [Qemu-devel] [libvirt] IO accounting overhaul
Date: Mon, 8 Sep 2014 11:12:15 +0200
User-agent: Mutt/1.5.21 (2010-09-15)

Am 08.09.2014 um 09:12 hat Markus Armbruster geschrieben:
> Kevin Wolf <address@hidden> writes:
> 
> > Am 01.09.2014 um 13:41 hat Markus Armbruster geschrieben:
> >> Benoît Canet <address@hidden> writes:
> >> 
> >> > The Monday 01 Sep 2014 à 11:52:00 (+0200), Markus Armbruster wrote :
> >> >> Cc'ing libvirt following Stefan's lead.
> >> >> 
> >> >> Benoît Canet <address@hidden> writes:
> >> >> > /* the following would compute latecies for slices of 1 seconds
> >> >> > then toss the
> >> >> >  * result and start a new slice. A weighted sumation of the
> >> >> > instant latencies
> >> >> >  * could help to implement this.
> >> >> >  */
> >> >> > 1s_read_average_latency
> >> >> > 1s_write_average_latency
> >> >> > 1s_flush_average_latency
> >> >> >
> >> >> > /* the former three numbers could be used to further compute a 1
> >> >> > minute slice value */
> >> >> > 1m_read_average_latency
> >> >> > 1m_write_average_latency
> >> >> > 1m_flush_average_latency
> >> >> >
> >> >> > /* the former three numbers could be used to further compute a 1 hours
> >> >> > slice value */
> >> >> > 1h_read_average_latency
> >> >> > 1h_write_average_latency
> >> >> > 1h_flush_average_latency
> >> >> 
> >> >> This is something like "what we added to total_FOO_time in the last
> >> >> completed 1s / 1m / 1h time slice divided by the number of additions".
> >> >> Just another way to accumulate the same raw data, thus no worries.
> >> >> 
> >> >> > /* 1 second average number of requests in flight */
> >> >> > 1s_read_queue_depth
> >> >> > 1s_write_queue_depth
> >> >> >
> >> >> > /* 1 minute average number of requests in flight */
> >> >> > 1m_read_queue_depth
> >> >> > 1m_write_queue_depth
> >> >> >
> >> >> > /* 1 hours average number of requests in flight */
> >> >> > 1h_read_queue_depth
> >> >> > 1h_write_queue_depth
> >
> > I don't think I agree with putting fixed time periods like 1 s/min/h
> > into qemu. What you need there is policy and we should probably make
> > it configurable.
> >
> > Do we need accounting for multiple time periods at the same time or
> > would it be enough to have one and make its duration an option?
> >
> >> > Optionally collecting the same data for each BDS of the graph.
> >> 
> >> If that's the case, keeping the shared infrastructure in the block layer
> >> makes sense.
> >> 
> >> BDS member acct then holds I/O stats for the BDS.  We currently use it
> >> for something else: I/O stats of the device model backed by this BDS.
> >> That needs to move elsewhere.  Two places come to mind:
> >> 
> >> 1. BlockBackend, when it's available (I resumed working on it last week
> >>    for a bit).  Superficially attractive, because it's close to what we
> >>    have now, but then we have to deal with what to do when the backend
> >>    gets disconnected from its device model, then connected to another
> >>    one.
> >> 
> >> 2. The device models that actually implement I/O accounting.  Since
> >>    query-blockstats names a backend rather than a device model, we need
> >>    a BlockDevOps callback to fetch the stats.  Fetch fails when the
> >>    callback is null.  Lets us distinguish "no stats yet" and "device
> >>    model can't do stats", thus permits a QMP interface that doesn't lie.
> >> 
> >> Right now, I like (2) better.
> >
> > So let's say I have some block device, which is attached to a guest
> > device for a while, but then I detach it and continue using it in a
> > different place (maybe another guest device or a block job). Should we
> > really reset all counters in query-blockstats to 0?
> >
> > I think as I user I would be surprised about this, because I still refer
> > to it by the same name (the device_name, which will be in the BB), so
> > it's the same thing for me and the total requests include everything
> > that was ever issued against it.
> 
> In my opinion, what's wrong here is the user interface: query-blockstats
> lets you query device model I/O statistics, but they're reported for the
> backend rather than the device model.  This is confusing.
> 
> Once you accept that these statistics measure device model behavior,
> it's no longer surprising they go away on device model destruction.
>
> We may want to measure BDS behavior, too.  But that's a separate set of
> stats.

If they measure device model behaviour rather than backend behaviour
(which results in the same numbers as long as you keep them attached),
then the API is broken because it should be using the qdev ID to
identify the device, and not the backend's device_name.

So I would argue that we're measuring the backend (even though as an
implementation detail we don't measure _in_ the backend), and if we
want to measure the device model, too, _that_ is what needs a new
interface.

> >> > -API wize I think about adding
> >> > bdrv_acct_invalid() and
> >> > bdrv_acct_failed() and systematically issuing a bdrv_acct_start() asap.
> >> 
> >> Complication: partial success.  Example:
> >> 
> >> 1. Guest requests a read of N sectors.
> >> 
> >> 2. Device model calls
> >>    bdrv_acct_start(s->bs, &req->acct, N * BDRV_SECTOR_SIZE, BDRV_ACCT_READ)
> >> 
> >> 3. Device model examines the request, and deems it valid.
> >> 
> >> 4. Device model passes it to the block layer.
> >> 
> >> 5. Block layer does its thing, but for some reason only M < N sectors
> >>    can be read.  Block layer returns M.
> >
> > No, it returns -errno.
> 
> Really?
> 
> Consider a device that can arrange for a DMA of multiple sectors
> into/from guest memory, where partial success can happen, and the device
> can tell the OS how much I/O succeeded then.

I don't think we have any such device and can't think of any that we
might want to implement. But let's assume the existence of such a device
for the sake of the argument.

> Now let's build a device model.  Consider a read.  The guest passes some
> guest memory to fill to the device model.  The device model passes it on
> to the block layer.  The block layer succeeds only partially.  Now the
> device model needs to figure out how much succeeded, so it can tell the
> guest OS.  How does it do that?

Currently it doesn't. The block layer doesn't have a concept of
"suceeding partially". A request either success fully, or it fails. If
you want to expose partial success to the guest, you would have to
fundamentally change the block layer read/write APIs (probably it's
"just" allowing short reads/writes, but today everyone relies on the
facts that those don't happen).

In practice, we would probably just never emulate partial success but
signal failure even to a device that could handle partial success.
Trying to be clever in corner cases is hardly ever a good idea anyway.

> >> 6. What's the device model to do now?  Both bdrv_acct_failed() and
> >>    bdrv_acct_done() would be wrong.
> >> 
> >>    Should the device model account for a read of size M?  This ignores
> >>    the partial failure.
> >> 
> >>    Should it split the read into a successful and a failed part for
> >>    accounting purposes?  This miscounts the number of reads.
> >
> > I think we should simply account it as a failed request because this is
> > what it will look like for the guest. If you want the partial data that
> > was internally issued, you need to look at different statistics
> > (probably those of bs->file).
> 
> I doubt "complete success" and "complete failure" is all guests could
> ever see.

If you define everything that isn't a complete success as a failure,
then you've covered all cases.

Kevin



reply via email to

[Prev in Thread] Current Thread [Next in Thread]