## Forum Replies Created

- AuthorPosts
I think it’s worth to mention that progressive sum (or scan / prefix sum as it’s often called) is a very useful building block for a number of parallel algorithms, often ones that involve mapping each input element into 0 to N outputs, with the output amount varying for each input.

One interesting case is using it to do radix sorting, where you first loop through a large array of integers, counting histograms for each of your radix “bins”, then compute base offsets for each bin using prefix sum, and finally scatter the array elements into their new places using those base offsets.

Having a native instruction (or an IR instruction that is specialized to machine code) to compute native vector-wide prefix sums would be very nice, since longer prefix sums can easily be built in terms of smaller ones.

Of course, prefix sum can always be implemented manually with shuffles, but I could imagine that it might be inconvenient if you don’t know the hardware vector size at compile time (you need log N steps, but you need to know what N is, and I understood the hardware vector size varies between Mill family members).

- AuthorPosts