qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] Debian 7.8.0 SPARC64 on qemu - anything i can do to spe


From: Mark Cave-Ayland
Subject: Re: [Qemu-devel] Debian 7.8.0 SPARC64 on qemu - anything i can do to speedup the emulation?
Date: Sun, 02 Aug 2015 14:11:54 +0100
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Icedove/31.8.0

On 31/07/15 16:43, Aurelien Jarno wrote:

> On 2015-07-31 17:31, Artyom Tarasenko wrote:
>> On Thu, Jul 30, 2015 at 5:50 PM, Aurelien Jarno <address@hidden> wrote:
>>> On 2015-07-30 10:55, Aurelien Jarno wrote:
>>>> On 2015-07-30 10:16, Dennis Luehring wrote:
>>>>> Am 30.07.2015 um 09:52 schrieb Aurelien Jarno:
>>>>>> On 2015-07-30 05:52, Dennis Luehring wrote:
>>>>>>> Am 29.07.2015 um 17:01 schrieb Aurelien Jarno:
>>>>>>>> The point is that emulation has a cost, and it's quite difficult to
>>>>>>>> to lower it and thus improve the emulation speed.
>>>>>>>
>>>>>>> so its just not strange for you to see an 1/100...200 of the native x64
>>>>>>> speed under qemu/SPARC64
>>>>>>> i hoped that someone will jump up an shout "its impossible - it needs 
>>>>>>> to be
>>>>>>> a bug" ...sadly not
>>>>>>
>>>>>> Overall the ratio is more around 10, but in some specific cases where
>>>>>> the TB cache is inefficient and TB can't be linked or with an
>>>>>> inefficient MMU, a ratio of 100 is possible.
>>>>>
>>>>>
>>>>> sysbench (0.4.12) --num-threads=1 --test=cpu --cpu-max-prime=2000 run
>>>>>    Host x64    :   1.3580s
>>>>>    Qemu SPARC64: 184.2532s
>>>>>
>>>>> sysbench shows nearly ration of 200
>>>>
>>>> Note that when you say SPARC64 here, it's actually only the kernel, you
>>>> are using a 32-bit userland. And that makes a difference. Here are my
>>>> tests here:
>>>>
>>>> host (x86-64)                    0.8976s
>>>> sparc32 guest (sparc64 kernel)  99.6116s
>>>> sparc64 guest (sparc64 kernel)   4.4908s
>>>>
>>>> So it looks like the 32-bit code is not QEMU friendly. I haven't looked
>>>> at it yet, but I guess it might be due to dynamic jumps, so that TB
>>>> can't be chained.
>>>
>>> This is the corresponding C code from sysbench, which is ran 10000
>>> times.
>>>
>>> | int cpu_execute_request(sb_request_t *r, int thread_id)
>>> | {
>>> |   unsigned long long c;
>>> |   unsigned long long l,t;
>>> |   unsigned long long n=0;
>>> |   log_msg_t           msg;
>>> |   log_msg_oper_t      op_msg;
>>> |
>>> |   (void)r; /* unused */
>>> |
>>> |   /* Prepare log message */
>>> |   msg.type = LOG_MSG_TYPE_OPER;
>>> |   msg.data = &op_msg;
>>> |
>>> |   /* So far we're using very simple test prime number tests in 64bit */
>>> |   LOG_EVENT_START(msg, thread_id);
>>> |
>>> |   for(c=3; c < max_prime; c++)
>>> |   {
>>> |     t = sqrt(c);
>>> |     for(l = 2; l <= t; l++)
>>> |       if (c % l == 0)
>>> |         break;
>>> |     if (l > t )
>>> |       n++;
>>> |   }
>>> |
>>> |   LOG_EVENT_STOP(msg, thread_id);
>>> |
>>> |   return 0;
>>> | }
>>>
>>> This is a very simple test, which is probably not a good representation
>>> of the CPU performances, even more when emulated by QEMU. In addition to
>>> that, given it mostly uses 64 bit integer, it's kind of expected that
>>> the 32-bit version is slower.
>>>
>>> Anyway I have extracted this code into a C file (see attached file) that
>>> can more easily compiled to 32 or 64 bit using -m32 or -m64. I observe
>>> the same behavior than sysbench, even with qemu-user (which is not
>>> surprising as the above code doesn't really put pressure the MMU.
>>>
>>> Running it in I get the following time:
>>> x86-64 host       0.877s
>>> sparc guest -m32  1m39s
>>> sparc guest -m64   3.5s
>>> opensparc T1 -m32 1m59s
>>> opensparc T1 -m64 1m12s
>>>
>>> So overall QEMU is faster than a not so old real hardware. That said
>>> looking at it quickly it seems that some of the FP instructions are
>>> actually trapped and emulated by the kernel on the opensparc T1.
>>>
>>> Now coming back to the QEMU problem, the issue is that the 64-bit code
>>> is using the udivx instruction to compute the modulo, while the 32-bit
>>> code calls the __umoddi3 GCC helper.
>>
>> Actually this looks like a bug/missing feature in gcc. Why doesn't it use 
>> udivx
>> instruction in "SPARC32PLUS, V8+ Required" code?
> 
> No idea.
> 
>>> It uses a lot of integer functions
>>> based on CPU flags, so most of the time is spent computing them in
>>> helper_compute_psr.
>>
>> I wonder if this can be optimized. I guess most RISC CPUs would have a
>> similar problem. Unlike x86, the compilers usually optimize
>> instructions on flag usage. If there is an instruction modifying flags
>> in a code, the flags will be used for sure, so it probably makes a
>> little sense to pospone the flag computation?
> 
> Indeed. ARM and SH4 use one TCG temp per flag, and they can be computed
> one by one using setcond. The optimizer and the liveness analysis then
> get rid of the unused computation. However while it allows intra-TB
> optimization, it prevent any other flags optimization. Therefore the
> only way to know if it is a good idea or not is to implement it and
> benchmark that, but using a bit more than a single biased benchmark like
> the one from sysbench.
> 
> Also note that the current implementation predates the introduction of
> setcond, which is necessary to be able to compute the flags using TCG
> code.

Aurelien - just to say thank you for looking into this. My focus for
SPARC64, as time allows, has being more on emulation side, i.e. getting
to the point where it can start to run more OSs which is gradually
happening over time. Once the basic emulation is complete, trying to
improve performance is definitely something I would like to work on
although I will likely have many questions :)


ATB,

Mark.




reply via email to

[Prev in Thread] Current Thread [Next in Thread]