qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [RFC PATCH V7 00/19] Multithread TCG.


From: Frederic Konrad
Subject: Re: [Qemu-devel] [RFC PATCH V7 00/19] Multithread TCG.
Date: Tue, 11 Aug 2015 08:27:23 +0200
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.5.0

On 11/08/2015 08:15, Benjamin Herrenschmidt wrote:
On Mon, 2015-08-10 at 17:26 +0200, address@hidden wrote:
From: KONRAD Frederic <address@hidden>

This is the 7th round of the MTTCG patch series.


It can be cloned from:
address@hidden:fkonrad/mttcg.git branch multi_tcg_v7.

This patch-set try to address the different issues in the global picture of
MTTCG, presented on the wiki.

== Needed patch for our work ==

Some preliminaries are needed for our work:
  * current_cpu doesn't make sense in mttcg so a tcg_executing flag is added to
    the CPUState.
Can't you just make it a TLS ?

True that can be done as well. But the tcg_exec_flags has a second meaning saying
"you can't start executing code right now because I want to do a safe_work".

  * We need to run some work safely when all VCPUs are outside their execution
    loop. This is done with the async_run_safe_work_on_cpu function introduced
    in this series.
  * QemuSpin lock is introduced (on posix only yet) to allow a faster handling 
of
    atomic instruction.
How do you handle the memory model ? IE , ARM and PPC are OO while x86
is (mostly) in order, so emulating ARM/PPC on x86 is fine but emulating
x86 on ARM or PPC will lead to problems unless you generate memory
barriers with every load/store ..

For the moment we are trying to do the first case.

At least on POWER7 and later on PPC we have the possibility of setting
the attribute "Strong Access Ordering" with mremap/mprotect (I dont'
remember which one) which gives us x86-like memory semantics...

I don't know if ARM supports something similar. On the other hand, when
emulating ARM on PPC or vice-versa, we can probably get away with no
barriers.

Do you expose some kind of guest memory model info to the TCG backend so
it can decide how to handle these things ?

== Code generation and cache ==

As Qemu stands, there is no protection at all against two threads attempting to
generate code at the same time or modifying a TranslationBlock.
The "protect TBContext with tb_lock" patch address the issue of code generation
and makes all the tb_* function thread safe (except tb_flush).
This raised the question of one or multiple caches. We choosed to use one
unified cache because it's easier as a first step and since the structure of
QEMU effectively has a ‘local’ cache per CPU in the form of the jump cache, we
don't see the benefit of having two pools of tbs.

== Dirty tracking ==

Protecting the IOs:
To allows all VCPUs threads to run at the same time we need to drop the
global_mutex as soon as possible. The io access need to take the mutex. This is
likely to change when http://thread.gmane.org/gmane.comp.emulators.qemu/345258
will be upstreamed.

Invalidation of TranslationBlocks:
We can have all VCPUs running during an invalidation. Each VCPU is able to clean
it's jump cache itself as it is in CPUState so that can be handled by a simple
call to async_run_on_cpu. However tb_invalidate also writes to the
TranslationBlock which is shared as we have only one pool.
Hence this part of invalidate requires all VCPUs to exit before it can be done.
Hence the async_run_safe_work_on_cpu is introduced to handle this case.
What about the host MMU emulation ? Is that multithreaded ? It has
potential issues when doing things like dirty bit updates into guest
memory, those need to be done atomically. Also TLB invalidations on ARM
and PPC are global, so they will need to invalidate the remote SW TLBs
as well.

Do you have a mechanism to synchronize with another thread ? IE, make it
pop out of TCG if already in and prevent it from getting in ? That way
you can "remotely" invalidate its TLB...
Yes that's what the safe_work is doing. Ask everybody to exit prevent VCPUs to
resume (tcg_exec_flag) and do the work when everybody is outside cpu-exec.


== Atomic instruction ==

For now only ARM on x64 is supported by using an cmpxchg instruction.
Specifically the limitation of this approach is that it is harder to support
64bit ARM on a host architecture that is multi-core, but only supports 32 bit
cmpxchg (we believe this could be the case for some PPC cores).
Right, on the other hand 64-bit will do fine. But then x86 has 2-value
atomics nowadays, doesn't it ? And that will be hard to emulate on
anything. You might need to have some kind of global hashed lock list
used by atomics (hash the physical address) as a fallback if you don't
have a 1:1 match between host and guest capabilities.
VOS did a "Slow path for atomic instruction translation" series you can find here:
https://lists.gnu.org/archive/html/qemu-devel/2015-08/msg00971.html

Which will be used in the end.

Thanks,
Fred

Cheers,
Ben.







reply via email to

[Prev in Thread] Current Thread [Next in Thread]