[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: PyTorch with ROCm
From: |
David Elsing |
Subject: |
Re: PyTorch with ROCm |
Date: |
Wed, 03 Apr 2024 22:21:43 +0000 |
Hello,
Ludovic Courtès <ludovic.courtes@inria.fr> writes:
> Yeah, we could think about a transformation option. Maybe
> ‘--with-configure-flags=python-pytorch=-DAMDGPU_TARGETS=xyz’ would work,
> and if not, we can come up with a specific transformation and/or an
> procedure that takes a list of architectures and returns a package.
I think that would work for python-pytorch itself, but it would need to
be set for all ROCm dependencies as well. It would be good to make sure
that the targets for a package are a subset of the intersection of the
targets specified for its dependencies.
>>>> - Many tests assume a GPU to be present, so they need to be disabled.
>>>
>>> Yes. I/we’d like to eventually support that. (There’d need to be some
>>> annotation in derivations or packages specifying what hardware is
>>> required, and ‘cuirass remote-worker’, ‘guix offload’, etc. would need
>>> to honor that.)
>>
>> That sounds like a good idea, could this also include CPU ISA
>> extensions, such as AVX2 and AVX-512?
>
> That’d be great, yes. Don’t hold your breath though as I/we haven’t
> scheduled work on this yet. If you’re interested in working on it, we
> can discuss it of course.
I am definitively interested, but am not familiar with Cuirass. Would
this also require support by the build daemon to determine which
hardware is available?
>> I think the issue is simply that elf-file? just checks the magic bytes
>> and has-elf-header? checks for the entire header. If the former returns
>> #t and the latter #f, an error is raised by parse-elf in guix/elf.scm.
>> It seems some ROCm (or tensile?) ELF files have another header format.
>
> Uh, never came across such a situation. What’s so special about those
> ELF files? How are they created?
After checking again, I noticed that the error actually only occurs for
rocblas. :)
Here, the problematic ELF files are generated by Tensile [1], and are
installed in lib/rocblas/library (by library/src/CMakeLists.txt, which
calls a CMake function from the Tensile package). They are shared object
libraries for the GPU architecture(s) [2]. Tensile uses
`clang-offload-builder` (from rocm-toolchain) to create the files, and
it seems to me that the "ELF" header comes from there, but I don't know
why it is special.
Thanks,
David
[1]
https://github.com/ROCm/Tensile/blob/17df881bde80fc20f997dfb290f4bb4b0e05a7e9/Tensile/TensileCreateLibrary.py#L283
[2]
https://github.com/ROCm/Tensile/wiki/TensileCreateLibrary#code-object-libraries