Example 2: motion estimation using the sum of absolute differences (SAD).
The motion estimation is done by repeatedly comparing a block of pixels in a reference frame with a source block of pixels, e.g. a block of 16×16 pixels. For each reference (‘a’) and source pixel (‘b’) of the block, we compute |a-b|, then sum the values (e.g. 256 values for a 16×16 block), yielding the SAD.
On AVX2, there is a SAD instruction that computes |a-b| between two registers and then sums 8 contiguous values, yielding 1 8-byte value per group of 8-bytes in the register. Without that instruction, we’d have to subtract with saturation as 8-bit (MAX(|a-b|, |b-a|)).
The natural processing is to loop per row. For each loop iteration, load the two rows, compute the SAD “vertically” (i.e. per pixel or pixel group), and sum in an accumulator register. This part of the computation follows the natural alignment and can be done efficiently.
In the last part, we must “fold” the results in a register horizontally into a single value. With X values in the register, there are lg(X) folding operations (e.g. 8 => 4 => 2 => 1). This can done in two ways. We can add the values horizontally within the register (e.g. PHADD on AVX2). Or, we can shuffle to align the values, then sum. On Intel, it is actually faster to shuffle because the PHADD instruction does 2 slow shuffle uops internally.
This type of processing is encountered often. Fast vertical processing, followed by none-too-fast horizontal processing.