What it takes to transpose a matrix
What it takes to transpose a matrix Table of Contents Introduction Setting Naive implementation [3.90] Read stream Write stream Combining read and write streams together Dealing with cache aliasing Reversing the order [2.61] Exploring block structure [1.46] Software prefetching [1.35] 64-bit SIMD [0.74] 256-bit SIMD [0.49] Buffering output [0.35] Conclusion Andrei Gudkov <gudokk@gmail.com> Introduction Classical CPU architecture is a poor choice for performing matrix-oriented computations.
Read full article →