For some reason, with this, clinfo just segfaults on my Artix install. Luckily, I was able to get the LD_DEBUG output (https://sperg.funny.cl/cdn/lddebug.txt), but it still segfaults even then. Seems like it's loading all the correct libraries. ICYW, here's the lddebug.txt from a chroot into my Arch install: https://sperg.funny.cl/cdn/arch_lddebug.txt (could boot into it directly and get it but cba rn)
The error fish gives specifically is:
fish: “LD_DEBUG=all clinfo 2> lddebug.…” terminated by signal SIGSEGV (Address boundary error)
So yeah, I don't know what's going on. However, one thing I'm particularly worried might be the problem is that nothing in rocm-device-libs
archive gets installed. Though, I tried installing it to the corresponding directory but that didn't seem to work. Which means it's probably something else, or something with my Artix install. But I've got all the necessary stuff installed, numactl, ocl-icd, etc...
Also @gsus those are warnings not errors. I also get those for all of the libambd* libraries, on both my Artix and Arch installs, so you don't need to set the execution bit, in fact, you usually shouldn't do that for a library, and it wouldn't really do much since it's basically just a bunch of symbols and function definitions.
edit: The last line of the lddebug.txt
says this: binding file /usr/lib/libamdocl64.so [0] to /usr/lib/libstdc++.so.6 [0]: normal symbol '_ZNSt8ios_baseD2Ev'
I can't tell if this specifically is crashing it, but it might be. Looks like it's trying to link/bind ios_base
or something.
edit2: dmesg log says this: [ 5148.574644] clinfo[24744]: segfault at 8 ip 00007f33d50462b4 sp 00007fffa86b0f90 error 4 in libamdocl64.so[7f33d4faa000+db000]
So the error is in libamdocl64.so
. Maaaayyybeee a missing library? I'd at least expect that to say "failed to open shared object" or something.
Directly after this message, it also says: [ 5321.488236] Code: 3b 44 24 08 73 08 89 5c 24 0c 89 44 24 08 8d 43 01 48 8b 55 00 48 89 c3 48 3b 04 24 72 a8 8b 44 24 0c 48 8d 04 40 48 8d 14 c2 <48> 8b 42 08 48 8d 0d 81 f5 07 00 4c 8b 0a 49 89 87 f0 05 00 00 48
which I don't understand in the slightest. Clearly it's some hex, maybe the <48>
is like a bad instruction or byte or something, since it's the only one in <>? But there ARE other 48
's there, so I've got no clue...
edit3: That last message is some hex in the libamdocl64
library. Specifically, it starts at offset 0x0B128A, and the <48>
thing is at offset 0x0B12B4. That's the only place it occurs. So yeah, I think it's the area where it's segfaulting. Later, I might try to grab the assembly and see what's there? Maybe that'll help? Who knows
edit4: Here's the assembly file: http://sperg.funny.cl/cdn/libamdocl64.asm
Also, here's a snippet from that for 0x0B128A to 0x0B12B4:
b128a: 3b 44 24 08 cmp 0x8(%rsp),%eax
b128e: 73 08 jae b1298 <clCreateContextFromType@@OPENCL_1.0+0x3ac28>
b1290: 89 5c 24 0c mov %ebx,0xc(%rsp)
b1294: 89 44 24 08 mov %eax,0x8(%rsp)
b1298: 8d 43 01 lea 0x1(%rbx),%eax
b129b: 48 8b 55 00 mov 0x0(%rbp),%rdx
b129f: 48 89 c3 mov %rax,%rbx
b12a2: 48 3b 04 24 cmp (%rsp),%rax
b12a6: 72 a8 jb b1250 <clCreateContextFromType@@OPENCL_1.0+0x3abe0>
b12a8: 8b 44 24 0c mov 0xc(%rsp),%eax
b12ac: 48 8d 04 40 lea (%rax,%rax,2),%rax
b12b0: 48 8d 14 c2 lea (%rdx,%rax,8),%rdx
b12b4: 48 8b 42 08 mov 0x8(%rdx),%rax
meaning that if my "theory" is correct, mov 0x8(%rdx),%rax
is segfaulting it. Fish mentioned address boundaries, so it might trying to be moving a value into an unallocated RAM address or something, or one it can't access?
Pinned Comments
nho1ix commented on 2023-12-29 08:43 (UTC) (edited on 2024-02-10 07:13 (UTC) by nho1ix)
Note for anyone who has a Polaris GPU (Radeon RX 5xx) debugging issues with this package; Packages that use OpenCL like clinfo or davinci-resolve-studio will need you to downgrade opencl-amd to 1:5.7.1-1 as well as amdgpu-pro-oglp to 23.10_1620044-1 to avoid coredumps & segfaults.
DVR would not open unless these 2 packages were downgraded (along with their dependencies). Had to figure it out the hard way after hours using valgrind and rebooting over and over. Hopefully someone else will not have to pull their hair out trying to resolve their issue.