qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] Debian 7.8.0 SPARC64 on qemu - anything i can do to spe


From: Artyom Tarasenko
Subject: Re: [Qemu-devel] Debian 7.8.0 SPARC64 on qemu - anything i can do to speedup the emulation?
Date: Fri, 31 Jul 2015 17:31:26 +0200

On Thu, Jul 30, 2015 at 5:50 PM, Aurelien Jarno <address@hidden> wrote:
> On 2015-07-30 10:55, Aurelien Jarno wrote:
>> On 2015-07-30 10:16, Dennis Luehring wrote:
>> > Am 30.07.2015 um 09:52 schrieb Aurelien Jarno:
>> > >On 2015-07-30 05:52, Dennis Luehring wrote:
>> > >> Am 29.07.2015 um 17:01 schrieb Aurelien Jarno:
>> > >> >The point is that emulation has a cost, and it's quite difficult to
>> > >> >to lower it and thus improve the emulation speed.
>> > >>
>> > >> so its just not strange for you to see an 1/100...200 of the native x64
>> > >> speed under qemu/SPARC64
>> > >> i hoped that someone will jump up an shout "its impossible - it needs 
>> > >> to be
>> > >> a bug" ...sadly not
>> > >
>> > >Overall the ratio is more around 10, but in some specific cases where
>> > >the TB cache is inefficient and TB can't be linked or with an
>> > >inefficient MMU, a ratio of 100 is possible.
>> >
>> >
>> > sysbench (0.4.12) --num-threads=1 --test=cpu --cpu-max-prime=2000 run
>> >    Host x64    :   1.3580s
>> >    Qemu SPARC64: 184.2532s
>> >
>> > sysbench shows nearly ration of 200
>>
>> Note that when you say SPARC64 here, it's actually only the kernel, you
>> are using a 32-bit userland. And that makes a difference. Here are my
>> tests here:
>>
>> host (x86-64)                    0.8976s
>> sparc32 guest (sparc64 kernel)  99.6116s
>> sparc64 guest (sparc64 kernel)   4.4908s
>>
>> So it looks like the 32-bit code is not QEMU friendly. I haven't looked
>> at it yet, but I guess it might be due to dynamic jumps, so that TB
>> can't be chained.
>
> This is the corresponding C code from sysbench, which is ran 10000
> times.
>
> | int cpu_execute_request(sb_request_t *r, int thread_id)
> | {
> |   unsigned long long c;
> |   unsigned long long l,t;
> |   unsigned long long n=0;
> |   log_msg_t           msg;
> |   log_msg_oper_t      op_msg;
> |
> |   (void)r; /* unused */
> |
> |   /* Prepare log message */
> |   msg.type = LOG_MSG_TYPE_OPER;
> |   msg.data = &op_msg;
> |
> |   /* So far we're using very simple test prime number tests in 64bit */
> |   LOG_EVENT_START(msg, thread_id);
> |
> |   for(c=3; c < max_prime; c++)
> |   {
> |     t = sqrt(c);
> |     for(l = 2; l <= t; l++)
> |       if (c % l == 0)
> |         break;
> |     if (l > t )
> |       n++;
> |   }
> |
> |   LOG_EVENT_STOP(msg, thread_id);
> |
> |   return 0;
> | }
>
> This is a very simple test, which is probably not a good representation
> of the CPU performances, even more when emulated by QEMU. In addition to
> that, given it mostly uses 64 bit integer, it's kind of expected that
> the 32-bit version is slower.
>
> Anyway I have extracted this code into a C file (see attached file) that
> can more easily compiled to 32 or 64 bit using -m32 or -m64. I observe
> the same behavior than sysbench, even with qemu-user (which is not
> surprising as the above code doesn't really put pressure the MMU.
>
> Running it in I get the following time:
> x86-64 host       0.877s
> sparc guest -m32  1m39s
> sparc guest -m64   3.5s
> opensparc T1 -m32 1m59s
> opensparc T1 -m64 1m12s
>
> So overall QEMU is faster than a not so old real hardware. That said
> looking at it quickly it seems that some of the FP instructions are
> actually trapped and emulated by the kernel on the opensparc T1.
>
> Now coming back to the QEMU problem, the issue is that the 64-bit code
> is using the udivx instruction to compute the modulo, while the 32-bit
> code calls the __umoddi3 GCC helper.

Actually this looks like a bug/missing feature in gcc. Why doesn't it use udivx
instruction in "SPARC32PLUS, V8+ Required" code?

> It uses a lot of integer functions
> based on CPU flags, so most of the time is spent computing them in
> helper_compute_psr.

I wonder if this can be optimized. I guess most RISC CPUs would have a
similar problem. Unlike x86, the compilers usually optimize
instructions on flag usage. If there is an instruction modifying flags
in a code, the flags will be used for sure, so it probably makes a
little sense to pospone the flag computation?

Artyom

-- 
Regards,
Artyom Tarasenko

SPARC and PPC PReP under qemu blog: http://tyom.blogspot.com/search/label/qemu



reply via email to

[Prev in Thread] Current Thread [Next in Thread]