[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-devel] [RFC 00/10] MultiThread TCG.
From: |
Alex Bennée |
Subject: |
Re: [Qemu-devel] [RFC 00/10] MultiThread TCG. |
Date: |
Thu, 23 Apr 2015 16:44:28 +0100 |
Frederic Konrad <address@hidden> writes:
> On 10/04/2015 18:03, Frederic Konrad wrote:
>> On 30/03/2015 23:46, Peter Maydell wrote:
>>> On 30 March 2015 at 07:52, Mark Burton <address@hidden>
>>> wrote:
>>>> So - Fred is unwilling to send the patch set as it stands, because
>>>> frankly this part is totally broken.
>>>>
>>>> There is an independent patch set that needs splitting out which
>>>> deals with just the atomic instruction issue - specifically for ARM
>>>> (though I guess it’s applicable across the board)…
>>>>
>>>> So - in short - I HOPE to get the patch set onto the reflector
>>>> sometime next week, and I’m sorry for the delay.
>>> What I really want to see is not so much the patch set
>>> but the design sketch I asked for that lists the
>>> various data structures and indicates which ones
>>> are going to be per-cpu, which ones will be shared
>>> (and with what locking), etc.
>>>
>>> -- PMM
>
> Does that makes sense?
>
> BTW here is the repository:
> git clone address@hidden:fkonrad/mttcg.git -b multi_tcg_v4
Is there a non-authenticated read-only http or git:// access to this repo?
>
> Thanks,
> Fred
>
>> Hi everybody,
>> Hi Peter,
>>
>> I tried to recap what we did, how it "works" and what the status:
>>
>> All the mechanism are basically unchanged.
>>
>> A lot of TCG structures are not thread safe.
>> And all TCG threads can run at the same times and sometimes want to
>> generate
>> code at the same time.
>>
>> Translation block related structure:
>>
>> struct TBContext {
>>
>> TranslationBlock *tbs;
>> TranslationBlock *tb_phys_hash[CODE_GEN_PHYS_HASH_SIZE];
>> int nb_tbs;
>> /* any access to the tbs or the page table must use this lock */
>> QemuMutex tb_lock;
>>
>> /* statistics */
>> int tb_flush_count;
>> int tb_phys_invalidate_count;
>>
>> int tb_invalidated_flag;
>> };
>>
>> This structure is used in TCGContext: TBContext tb_ctx;
>>
>> "tbs" is basically where the translated block are stored and
>> tb_phys_hash an
>> hash table to find them quickly.
>>
>> There are two solutions to prevent thread issues:
>> A/ Just have two tb_ctx.
>> B/ Share it between CPUs and protect the tb_ctx access.
>>
>> We took the second solution so all CPUs can benefit of the translated TB.
>> TBContext is written almost everywhere in translate-all.c.
>> When there are too much tbs a tb_flush occurs and destroy the array.
>> We don't
>> handle this case right now.
>> tb_lock is already used by user-mode code, so we just convert it to
>> QemuMutex so
>> we can reuse it in system-mode.
>>
>> struct TCGContext {
>> uint8_t *pool_cur, *pool_end;
>> TCGPool *pool_first, *pool_current, *pool_first_large;
>> TCGLabel *labels;
>> int nb_labels;
>> int nb_globals;
>> int nb_temps;
>>
>> /* goto_tb support */
>> tcg_insn_unit *code_buf;
>> uintptr_t *tb_next;
>> uint16_t *tb_next_offset;
>> uint16_t *tb_jmp_offset; /* != NULL if USE_DIRECT_JUMP */
>>
>> /* liveness analysis */
>> uint16_t *op_dead_args; /* for each operation, each bit tells if the
>> corresponding argument is dead */
>> uint8_t *op_sync_args; /* for each operation, each bit tells if the
>> corresponding output argument needs to be
>> sync to memory. */
>>
>> /* tells in which temporary a given register is. It does not take
>> into account fixed registers */
>> int reg_to_temp[TCG_TARGET_NB_REGS];
>> TCGRegSet reserved_regs;
>> intptr_t current_frame_offset;
>> intptr_t frame_start;
>> intptr_t frame_end;
>> int frame_reg;
>>
>> tcg_insn_unit *code_ptr;
>> TCGTemp temps[TCG_MAX_TEMPS]; /* globals first, temps after */
>> TCGTempSet free_temps[TCG_TYPE_COUNT * 2];
>>
>> GHashTable *helpers;
>>
>> #ifdef CONFIG_PROFILER
>> /* profiling info */
>> int64_t tb_count1;
>> int64_t tb_count;
>> int64_t op_count; /* total insn count */
>> int op_count_max; /* max insn per TB */
>> int64_t temp_count;
>> int temp_count_max;
>> int64_t del_op_count;
>> int64_t code_in_len;
>> int64_t code_out_len;
>> int64_t interm_time;
>> int64_t code_time;
>> int64_t la_time;
>> int64_t opt_time;
>> int64_t restore_count;
>> int64_t restore_time;
>> #endif
>>
>> #ifdef CONFIG_DEBUG_TCG
>> int temps_in_use;
>> int goto_tb_issue_mask;
>> #endif
>>
>> uint16_t gen_opc_buf[OPC_BUF_SIZE];
>> TCGArg gen_opparam_buf[OPPARAM_BUF_SIZE];
>>
>> uint16_t *gen_opc_ptr;
>> TCGArg *gen_opparam_ptr;
>> target_ulong gen_opc_pc[OPC_BUF_SIZE];
>> uint16_t gen_opc_icount[OPC_BUF_SIZE];
>> uint8_t gen_opc_instr_start[OPC_BUF_SIZE];
>>
>> /* Code generation. Note that we specifically do not use
>> tcg_insn_unit
>> here, because there's too much arithmetic throughout that relies
>> on addition and subtraction working on bytes. Rely on the GCC
>> extension that allows arithmetic on void*. */
>> int code_gen_max_blocks;
>> void *code_gen_prologue;
>> void *code_gen_buffer;
>> size_t code_gen_buffer_size;
>> /* threshold to flush the translated code buffer */
>> size_t code_gen_buffer_max_size;
>> void *code_gen_ptr;
>>
>> TBContext tb_ctx;
>>
>> /* The TCGBackendData structure is private to tcg-target.c. */
>> struct TCGBackendData *be;
>> };
>>
>> This structure is used to translate the TBs.
>> The easier solution was to protect the generation of the code to only
>> allow one
>> CPU to generate code at a time. This is normal as we don't want double
>> generated
>> tb in the pool anyway. This is achieved with the tb_lock used above.
>>
>> TLB:
>>
>> TLB seems to be CPU dependant, so it is not really a problem as in our
>> implementation one CPU = one pthread. But sometimes a CPU wants to
>> flush TLB,
>> through an instruction for example. It is very likely an other CPU in
>> an other
>> thread is executing code at the same time. That's why we choose to
>> create a
>> tlb_flush_mechanism:
>> When a CPU wants to flush it asks and wait all CPU to exit TCG and
>> then exit
>> itself. This can be reused for tb_invalidate and or tb_flush as well.
>>
>> Atomic instructions:
>>
>> Atomic instructions are quite hard to implement.
>> The TranslationBlock implementing the atomic instruction can't be
>> interrupted
>> during the execution (eg: by an interrupt or a signal) cmpxchg64
>> helper is used
>> for that.
>>
>> QEMU's global lock:
>>
>> TCG thread take the lock during code execution. This is not ok for
>> multi-thread
>> because that means only one thread will be running at a time. That's
>> why we took
>> Jan's patch to allow TCG to run without the lock and take it when needed.
>>
>> What is the status:
>>
>> * We can start a vexpress-a15 simulation with two A15 and run two
>> dhrystones at
>> a time, the performance are increased it's quite stable.
>>
>> What is missing:
>>
>> * tb_flush is not implemented correctly.
>> * PageDesc structure is not protected the patch which introduced a
>> first_tb
>> array was not the right approach and is removed. This implies that
>> tb_invalidate is broken.
>>
>> For both issues we plan to use the same mechanism as tlb_flush:
>> exiting all the
>> CPU, flushing, invalidating and let them continue. A generic mechanism
>> must be
>> implemented for that.
>>
>> Known issues:
>>
>> * GDB stub is broken because it uses tb_invalidate and we didn't
>> implement that
>> for now, and there are probably other issues.
>> * SMP > 2 crashes, probably because of tb_invalidate as well.
>> * We don't know the status of the user code, which is probably broken
>> by our
>> changes.
>>
--
Alex Bennée