[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Optionally using more advanced CPU features
From: |
Ludovic Courtès |
Subject: |
Re: Optionally using more advanced CPU features |
Date: |
Mon, 28 Aug 2017 15:48:00 +0200 |
User-agent: |
Gnus/5.13 (Gnus v5.13) Emacs/25.2 (gnu/linux) |
Hi Dave,
Dave Love <address@hidden> skribis:
> address@hidden (Ludovic Courtès) writes:
[...]
>> To some extent, I think this is a compiler/OS/upstream issue. By that I
>> mean that the best way to achieve use of extra CPU features is by using
>> the “IFUNC” feature of GNU ld.so, which is what libc does (it has
>> variants of strcmp etc. tweaked for various CPU extensions like SSE, and
>> the right one gets picked up at load time.) Software like GMP, Nettle,
>> or MPlayer also does this kind of selection at run time, but using
>> custom mechanisms.
>
> That may be the best way to handle it, but it's not widely available,
> and isn't possible generally (as far as I know), e.g. for Fortran code.
> See also below. This issue surfaced again recently in Fedora.
Right. Do you have examples of Fortran packages in mind?
> In cases that don't dispatch on cpuid (or whatever), I think the
> relevant missing OS/tool support is SIMD-specific hwcaps in the loader.
> Hwcaps seem to be essentially undocumented, but there is, or has been,
> support for instruction set capabilities on some architectures, just not
> x86_64 apparently. (An ancient example was for missing instructions on
> some SPARC systems which greatly affected crypto operations in ssh et
> al.)
But that sounds similar to IFUNC in that application code would need to
actually use hwcap info to select the right implementation at load time,
right?
>> There’s probably scientific software out there that can benefit from
>> using the latest SSE/AVX/whatever extension, and yet doesn’t use any of
>> the tricks above. When we find such a piece of software, I think we
>> should investigate and (1) see whether it actually benefits from those
>> ISA extensions, and (2) see whether it would be feasible to just use
>> ‘target_clones’ or similar on the hot spots.
>
> One example which has been investigated, and you can't, is BLIS. You
(Why “you can’t?” It’s free software AFAICS on
<https://github.com/flame/blis/tree/master>.)
> need it for vaguely competitive avx512 linear algebra. (OpenBLAS is
> basically fine for previous Intel and AMD SIMD.) See, e.g.,
> <https://github.com/xianyi/OpenBLAS/issues/991#issuecomment-273631173>
> et seq. I don't know if there's any good reason to, but if you want
> ATLAS you have the same issue -- along with extra issues building it.
ATLAS is a problem because it does built-time ISA selection (and maybe
profile-guided optimization?).
> Related, I argue, as on the Fedora list, that like BLAS (and LAPACK)
> should handled the way they are in Debian, with shared libraries built
> compatibly with the reference BLAS. They should be selectable at run
> time, typically according to compute node type by flipping the ld.so
> search path; you should be able to substitute BLIS or a GPU
> implementation for OpenBLAS. That likely applies in other cases, but
> I'm most familiar with the linear algebra ones.
I sympathize with the idea of having several ABI-compatible BLAS
implementations for the reasons you give. That somewhat conflicts with
the idea of reproducibility, but after all we can have our cake and eat
it too: the user can decide to have LD_LIBRARY_PATH point to an
alternate ABI-compatible BLAS, or they can keep using the one that
appears in RUNPATH.
Thoughts?
Ludo’.