I’d like to request you guys add an operation (if you haven’t already).
I want to explain the operation, and then explain why I think it’d be
useful. The operation is smear-sum. It takes a vector of small (8-bit,
maybe 16-bit) integers an produces a running sum, and in addition
produces a second value which is the final sum. So, if given the vector
[ 1, 3, 10, 7, 4, 0, 2, 9 ] it produces the vector [ 1, 4, 14, 21, 25,
25, 27, 36 ] and the value 36. An alternative would smear-sumx, which
produces the sum excluding the current value, so the vector [ 1, 3, 10,
7, 4, 0, 2, 9 ] would return the vector [ 0, 1, 4, 14, 21, 25, 25, 27 ]
and the value 36. NaRs and Nones are propagated, but count as 0 for the
purposes of the sum, so the vector [ 1, 3, 10, NaR, None, 0, 2, 9 ]
returns the vector [ 1, 4, 14, NaR, None, 14, 16, 25 ] and the value 25.

Why this instruction would be useful: a lot of languages now use copying
GC. One of the advantages of copying GC is that the heap is alway
contiguous, so allocation is bumping a pointer. The allocation routine
then looks like this:

extern size_t * heap_top;
extern size_t * heap_limit;

size_t * alloc(size_t num_words) {
register size_t * rval = heap_top;
heap_top += num_words;
if (heap_top > heap_limit) {
/* This is a rarely taken branch- less than 1% of the time */
do_a_gc();
/* do_a_gc() resets heap_top and heap_limit so we can now
* succeed at the allocation.
*/
rval = heap_top;
heap_top += num_words;
}
return rval;
}

Similar pointer-bump allocations also show up in languages without
garbage collection- for example, when adding items to a vector. Where
smear-sum comes into play is when you want to vectorize a loop that
contains a pointer-bump allocation. You’d simply do a smear-sum to
calculate what the offsets of each iteration’s allocation are from the
current top pointer, then do a vector add of the offsets to the current
top pointer to get their real address, and then off you go. A vector
compare and a normal smear catches the (presumptively rare) case where
an allocation would trigger a gc (or vector resize).

The problem with this instruction is, of course, the deep dependency
chain between the adds. You can’t produce the nth value until you know
the (n-1)th value. Even limiting it to small 8-bit adds, I doubt that
this instruction would be a 1-clock “fast” instruction. It’ll probably
be a 2- or 3-clock instruction. And that should be sufficient.