octave-maintainers
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: JIT test crash


From: Max Brister
Subject: Re: JIT test crash
Date: Fri, 3 Aug 2012 14:28:08 -0500

On Fri, Aug 3, 2012 at 1:30 PM, Daniel J Sebald <address@hidden> wrote:
> On 08/03/2012 01:13 PM, Max Brister wrote:
>>
>> On Fri, Aug 3, 2012 at 1:05 PM, Michael Goffioul
>> <address@hidden>  wrote:
>>>
>>> On Thu, Aug 2, 2012 at 8:54 PM, Michael Goffioul
>>> <address@hidden>  wrote:
>>>>
>>>>
>>>> On Thu, Aug 2, 2012 at 6:42 PM, Max Brister<address@hidden>  wrote:
>>>>>
>>>>>
>>>>> On Thu, Aug 2, 2012 at 8:36 AM, Michael Goffioul
>>>>> <address@hidden>  wrote:
>>>>>>
>>>>>> On Thu, Aug 2, 2012 at 1:57 PM, Max Brister<address@hidden>  wrote:
>>>>>>>
>>>>>>> [snip]
>>>>>
>>>>>
>>>>>>>
>>>>>>> The output with OCTAVE_JIT_DEBUG looks correct to me.
>>>>>>>
>>>>>>> I have attached the patch for llvm 3.1.
>>>>>>
>>>>>>
>>>>>>
>>>>>> I applied it, but it didn't change anything (the generated assembly
>>>>>> looks
>>>>>> exactly the same). If I'm reading this [1] correctly (XF86SubTarget
>>>>>> constructor), the stack alignment was already set to 4 anyway. And in
>>>>>> [2],
>>>>>> in X86_32TargetMachine constructor, the native stack alignment is also
>>>>>> specified on 4 bytes (trailing "-S32" at line 45).
>>>>>>
>>>>>> Michael.
>>>>>>
>>>>>> [1]
>>>>>>
>>>>>>
>>>>>> https://github.com/earl/llvm-mirror/blob/master/lib/Target/X86/X86Subtarget.cpp
>>>>>> [2]
>>>>>>
>>>>>>
>>>>>> https://github.com/earl/llvm-mirror/blob/master/lib/Target/X86/X86TargetMachine.cpp
>>>>>
>>>>>
>>>>> Actually, that makes sense. In order to use the sse instruction, we
>>>>> really want the stack to 16 byte aligned I think. Can you try changing
>>>>> the stack alignment to 16 bytes instead of 4?
>>>>
>>>>
>>>>
>>>> No luck. I've modified your patch to read:
>>>>
>>>> opts.StackAlignmentOverride = 16
>>>>
>>>> For your information, I've attached the generated assembly for the
>>>> 4-bytes
>>>> and 16-bytes case. The code still crashes, but at an earlier location.
>>>> Now
>>>> it crashes at the MOVAPD call (address 02D300BC). If you compare with
>>>> the
>>>> 4-bytes case, the latter uses MOVUPD instead, so it doesn't crash. Also
>>>> if
>>>> you compare the 2 files, you see that in the 16-bytes case, all stack
>>>> offsets are multiple of 16 bytes, but I don't see any code to realign
>>>> the
>>>> stack on a 16-bytes boundary.
>>>>
>>>> The bottom line is: within the generated code, the stack is kept aligned
>>>> on 16-bytes, but as there's no forced realignment, it entirely depends
>>>> on
>>>> the stack alignment on function entry.
>>>
>>>
>>>
>>> Any update, ideas or suggestions?
>>>
>>> Michael.
>>>
>>
>> Michael,
>>
>> This defiantly looks like a bug in LLVM to me. I'll bring it up with
>> the LLVM people. In the mean time I'm thinking of not using the SSE
>> instructions for complex operations. I'm not sure how much benefit
>> there is considering complex numbers only have two values.
>
>
> SSE works in groups of four, if I remember correctly.  An alternative to
> across complex numbers, another way to use SSE might be parallel operations
> such as vector/matrix operations.  For example, say 9x1 complex vectors are
> multiplied.  It would be
>
> 4 real x real mult
> 4 imag x imag mult
> 4 real add
> 4 real sub
> 4 real x imag mult
> 4 imag x real mult
> 4 imag add
> 4 imag add
> -
> 4 real x real mult
> 4 imag x imag mult
> 4 real add
> 4 real sub
> 4 real x imag mult
> 4 imag x real mult
> 4 imag add
> 4 imag add
> -
> 1 real x real mult (3 bogus)
> 1 imag x imag mult (3 bogus)
> 1 real add (3 bogus)
> 1 real sub (3 bogus)
> 1 real x imag mult (3 bogus)
> 1 imag x real mult (3 bogus)
> 1 imag add (3 bogus)
> 1 imag add (3 bogus)
>
> That would speed by a factor of four when it is really needed, e.g., large
> matrix multiplies.
>
> Dan

Actually, I think SSE2 only supports 2 doubles as the xxm<n> registers
are only 128 bits wide [1]. AVX adds support for 256 bit registers
[2]. (I'm by no means an expert on SSE and AVX though)

The problem we are running into is that llvm seems to be generating
incorrect code on a 32 bit windows platform when using vectorized
instructions. I would rather error on the side of being a little
slower if it means generating correct code on all platforms.

[1] http://en.wikipedia.org/wiki/Streaming_SIMD_Extensions#Registers
[2] http://en.wikipedia.org/wiki/Advanced_Vector_Extensions

Max Brister


reply via email to

[Prev in Thread] Current Thread [Next in Thread]