All vector examples come from the algorithms in an H.265 video encoder using the AVX2 instruction set. I’m using several posts because there’s a lot of text.
Example 1: intra prediction on a block of 4×4 8-bit pixels.
ijklmnopq h.... g.... f.... e.... d c b a
The pixels shown as ‘.’ are predicted from 17 neighbor pixels ‘a-q’ and a projection angle. There are 33 projection angles defined, ranging from bottom-left to top-right.
Opereation 1: linearize the neighbors.
Given the chosen prediction angle, linearize the neighbors for the projection. For example, if the pixels are projected toward the bottom and a little to the right, the neighbor pixels could be linearized as follow.
abcdefghijklmnopq => fhijklm
This operation is best done by a shuffle operation. Load the neighbors in a register, compute the offsets in another register (per byte), and reorder (PSHUFB).
Operation 2: filter the neighbors.
For some angles, the neighbors are filtered to smooth the projection. Essentially, for 3 contiguous neighbors ‘abc’, the operation is
Note that the pixels are temporarily expanded to 16-bit during this computation to avoid overflows.
This can be done by shuffling the neighbors in a register to align them, or by loading the neighbors from memory at an offset of +-1.
Operation 3: project the neighbors.
For each pixel of the 4×4 block, compute the offset(s) of the corresponding neighbor pixel(s). If the projection does not align cleanly on a neighbor position, the two closest pixels are weighted with two i/32, (32-i)/32 fractions as 16-bit.
The projection algorithm depends on the angle direction. If the direction is mostly vertical, then the neighbors can be shuffled within a register or loaded as a group from memory, with an offset specific to each row. Shuffling is more efficient and avoids failed store forwarding issues.
If the direction is horizontal, one way to do it is to pretend the angle is vertical, then transpose the matrix at the end (I discuss this in the other examples). This can also be done more efficiently by shuffling at each row.
In practice, the pixel blocks range from 4×4 to 32×32 pixels. The optimal algorithm differs considerably by block size and depends on the vector register size since that determines the effective shuffle range.