qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH v6 01/22] instrument: Add documentation


From: Emilio G. Cota
Subject: Re: [Qemu-devel] [PATCH v6 01/22] instrument: Add documentation
Date: Fri, 29 Sep 2017 13:59:43 -0400
User-agent: Mutt/1.5.24 (2015-08-30)

On Fri, Sep 29, 2017 at 16:16:41 +0300, Lluís Vilanova wrote:
> Lluís Vilanova writes:
> [...]
> > This was working on a much older version of instrumentation for QEMU, but I 
> > can
> > implement something that does the first use-case point above and some 
> > filtering
> > example (second use-case point) to see what's the performance difference.
> 
> Ok, so here's some numbers for the discussion (booting Emilio's ARM full 
> system
> image that immediately shuts down):
> 
> * Without instrumentation
> 
>   real        0m10,099s
>   user        0m9,876s
>   sys 0m0,128s
> 
> * Count number of memory access writes, by instrumenting only when they are
>   executed
> 
>   real        0m15,896s
>   user        0m15,752s
>   sys 0m0,108s
> 
> * Counting same, but the filtering is done at translation time (i.e., not
>   generate an execute-time callback if it's not a write)
> 
>   real        0m11,084s
>   user        0m10,880s
>   sys 0m0,112s
> 
>   As Peter said, the filtering can be added into the API to take advantage of
>   this "speedup", without exposing translation vs execution time callbacks.

I'm not sure I understand this concept of filtering. Are you saying that in
the first case, all memory accesses are instrumented, and then in the
"access helper" we only call the user's callback if it's a memory write?
And in the second case, we simply just generate a "write helper" instead
of an "access helper". Am I understanding this correctly?

> * Counting number of executed instructions, by instrumenting the beginning of
>   each one of them
> 
>   real        0m24,583s
>   user        0m24,352s
>   sys 0m0,184s
> 
> * Counting same, but per-TB numbers are collected at translation-time, and we
>   only generate a per-TB execution time callback to add the corresponding 
> number
>   of instructions for that TB
> 
>   real        0m11,151s
>   user        0m10,952s
>   sys 0m0,092s
> 
>   This really needs to expose translation vs execution time callbacks to take
>   advantage of this "speedup".

Clearly instrumenting per-TB is a significant net gain. I think we should
definitely allow instrumenters to use this option.

FWIW my experiments so far show similar numbers for instrumenting each
instruction (haven't done the per-tb yet). The difference is that I'm
exposing to instrumenters a copy of the guest instructions (const void *data,
size_t size). These copies are kept around until TB's are flushed.
Luckily there seems to be very little overhead in keeping these around,
apart from the memory overhead -- but in terms of performance, the
necessary allocations do not induce significant overhead.

Thanks,

                Emilio



reply via email to

[Prev in Thread] Current Thread [Next in Thread]