How to transpose a 8x8 int64 matrix with AVX512
Consider 8 AVX512 registers containing rows of a matrix, so that each 64-bit lane is a cell of an 8x8 matrix. How to transpose such a matrix in C/C++?
What I have tried so far: 8 _mm512_i32scatter_epi64() intrinsics.
It's
extremely slow and uses the cache heavily so that other CPU cores can't get
enough
L3 bandwidth.
1 Answer
If the values are originally in RAM, try _mm512_i32gather_epi64
instead.
Simultaneous reading from cache could lead to less stalling than writing to it.
If the destination is memory, dummy writing zeroes to a cache-line-aligned destination area could ensure that the instructions don't stall waiting on cache. With the scatter instructions the CPU doesn't know that whole destination area will be overwritten, so it has to wait for loading of the previous values.
If the values are already in registers and the goal is to keep them in registers, _mm512_permutex2var_pd seems like the most useful instruction. It lets to pick individual elements by index from two sources. But because values are distributed across so many registers, shuffling them into place will take a lot of operations.
I'll mark the matrix rows A-H and columns 0-7. Initially each register will
contain
one row, which I'll mark e.g. A0-7. Goal is to have each register
contain one
column, marked e.g. A-H0.
A0 .. A7
.. ..
H0 .. H7
The best sequence I found is still 24 shuffles:
-
Do 8 shuffles on pairs of registers in blocks of 4:
- Combine
A0-7andB0-7toA-B0-3 - Combine
A0-7andB0-7toA-B4-7 - ...
- Combine
G0-7andH0-7toG-H4-7
- Combine
-
Do 8 shuffles in blocks of 2:
- Combine
A-B0-3andC-D0-3toA-D0-1 - Combine
A-B0-3andC-D0-3toA-D2-3 - ...
- Combine
E-F4-7andG-H4-7toE-H6-7
- Combine
-
Do 8 shuffles in blocks of 1:
- Combine
A-D0-1andE-H0-1toA-H0 - Combine
A-D0-1andE-H0-1toA-H1 - ...
- Combine
A-D6-7andE-H6-7toA-H7
- Combine
I think this is the minimum possible number of shuffles when operating only in registers. Each AVX512 instruction takes at most two registers as input, and this sequence takes full advantage of the inputs and outputs to gather data from the 8 source registers.
_mm512_permutex2var_epi64()intrinsic something? Serge Rogatch2 hours ago