Mill Computing, Inc. Forums The Mill Architecture Metadata Reply To: Metadata

Post count: 5

I’d like to request you guys add an operation (if you haven’t already).
I want to explain the operation, and then explain why I think it’d be
useful. The operation is smear-sum. It takes a vector of small (8-bit,
maybe 16-bit) integers an produces a running sum, and in addition
produces a second value which is the final sum. So, if given the vector
[ 1, 3, 10, 7, 4, 0, 2, 9 ] it produces the vector [ 1, 4, 14, 21, 25,
25, 27, 36 ] and the value 36. An alternative would smear-sumx, which
produces the sum excluding the current value, so the vector [ 1, 3, 10,
7, 4, 0, 2, 9 ] would return the vector [ 0, 1, 4, 14, 21, 25, 25, 27 ]
and the value 36. NaRs and Nones are propagated, but count as 0 for the
purposes of the sum, so the vector [ 1, 3, 10, NaR, None, 0, 2, 9 ]
returns the vector [ 1, 4, 14, NaR, None, 14, 16, 25 ] and the value 25.

Why this instruction would be useful: a lot of languages now use copying
GC. One of the advantages of copying GC is that the heap is alway
contiguous, so allocation is bumping a pointer. The allocation routine
then looks like this:

extern size_t * heap_top;
extern size_t * heap_limit;

size_t * alloc(size_t num_words) {
register size_t * rval = heap_top;
heap_top += num_words;
if (heap_top > heap_limit) {
/* This is a rarely taken branch- less than 1% of the time */
/* do_a_gc() resets heap_top and heap_limit so we can now
* succeed at the allocation.
rval = heap_top;
heap_top += num_words;
return rval;

Similar pointer-bump allocations also show up in languages without
garbage collection- for example, when adding items to a vector. Where
smear-sum comes into play is when you want to vectorize a loop that
contains a pointer-bump allocation. You’d simply do a smear-sum to
calculate what the offsets of each iteration’s allocation are from the
current top pointer, then do a vector add of the offsets to the current
top pointer to get their real address, and then off you go. A vector
compare and a normal smear catches the (presumptively rare) case where
an allocation would trigger a gc (or vector resize).

The problem with this instruction is, of course, the deep dependency
chain between the adds. You can’t produce the nth value until you know
the (n-1)th value. Even limiting it to small 8-bit adds, I doubt that
this instruction would be a 1-clock “fast” instruction. It’ll probably
be a 2- or 3-clock instruction. And that should be sufficient.

You may be, likely are, well ahead of me here, and have already added
the capability to do this. But just in case you haven’t, I thought I’d
put the request in.