I’m not sure that’s exactly what we want. To get maximum performance, we should probably use this algorithm:
1: for d = 1 to log2 n do 2: for all k in parallel do 3: if k >= 2^d then 4: x[k] = x[k – 2^(d-1)] + x[k] (from NVIDIA'as GPU Gems page)
Alternate would be useful for d = 3, but the other steps require a shuffle.
By the way, I would be personally excited for faster cumulative sum because it is useful for implementing fast gaussian blurs.