Metadata

David

Yes, I agree with the tradeoffs you've chosen. It does have some impedance mismatch with what Lisp would ideally want, but that's already the case in x86 (even the slab allocation pointer is in thread-local storage, not a register), and it's still speed-competitive there.

Like I said, it's an interesting issue to wipe the slate of optimization assumptions clean and look at how the top-level goals for this class of language compilers will be accomplished on a new architecture.

ivan

p.s. after some more thought

The variable returns won't work on the belt, and there are no shared registers or scratchpad. In fact, the only shared thing is memory. So what happens if the results (after the first) are communicated in memory?

If the shared-register approach works on a conventional, then sharing pseudo-registers at static addresses (w/r/t the program load image) should work too, albeit not as clean as something like a Forth double-ended stack. The cache line containing the pseudo-registers (PR hereafter) will always be hot and in the top-level cache. If ~14 (32-bit?) registers is enough, then the 16 32-bit values in a line should also be enough, although a 64-bit Lisp might want two lines. If the reserved addresses were placed at the bottom of the dp space then the address offsets would always encode in one byte for the accessing load/store operations. As the l/s opcode and other info are going to be around 16 bits, it means an access will be 24 or so bits of entropy in the encoding, favorable compared to 32-bit RISC codes and probably comparable to x86 (I am not an x86 expert - how many bytes in a load/store if the offset is one byte?) However, the l/s ops occupy flow slots, retire stations, and d$1 bandwidth. This is not free, compared to returning results on the belt.

The stores, inside the called function, are fire-and-forget. However, there is load latency to pick the values up from the PRs. The code doesn't want to wait for a D$1 cycle before looking at the call result, although I suppose it could overlap the testing for the presence of the extra results, and probably some use of the prime result.

However, there's another way: issue the load of the result before making the call, with the retire delay set so the load will retire in the instruction after the call. That load can be hoisted arbitrarily high, but if LISP overhead guarantees that any call will last longer than d$1 then hoisting is not necessary.

Unfortunately, this doesn't buy anything when there actually is a new result to load. The load will allocate a retire station and go to cache for the value and then wait out the call. At the end of the call the store will be detected by the retire station, which will then re-issue the load request. But as the load is supposed to retire at once, and the second request to the d$1 is not instantaneous, the retire station must stall the machine until it gets the just-stored data, which is the same time that would have been if the load were issued after the call rather than before it :-(

All in all, with the present Mill definition, it looks like extra results would pass in PRs in memory, and not be available until d$1 latency after the call. This works, but is unattractive.

I'll keep the issue in mind and will report if illumination strikes.

Will_Edwards

If the caller knows whether it is interested in more than the first result, Lisps can have a calling convention where they tell the callee when they call, perhaps?

So single interesting return values, presumably the common case, are fast pathed.

And debuggers can patch this as they step through code, or they can just go the route of native debuggers and see what's happening live even if there is only ever one result.

ivan

The Mill call operation knows how many results it expects, but this value is not currently available to the callee. It would be easy enough to stick it in a call-saved specReg with rd() access - would that be useful?

infogulch

If the return types are all homogeneous, could the caller always return a vector of maximum size with None filled in for omitted values?

ivan

Returning a vector would certainly work, and the Nones are very Mill-ish. But smaller members have only 8-byte vectors. It would also cost some code to pack the vector and unpack it again, which might be worse than going via memory. The optimal might be to return one scalar and the vector; in the common case the vector would be all Nones, available as a popCon from rd() cheaply and ignored by the caller. In the uncommon case you'd pay for the pack/unpack. If exactly two results were common then return two scalars for that case and the caller can test whether it has a scalar or a vector second result.

However, because the code depends on the size of a vector then your Lisp becomes member-dependent, which we really want to avoid. :-(

David

Knowing the expectant number of return values via hardware would help, as far as what I know now. It would at least eliminate having to keep an extra count parameter live from the function entry all the way through to where it's used at the very end.

I presume that all return values are in a single return instruction, that you can't stack values up separately in a loop and then have a single shared return? It sounds like it would have to switch/case between "return val", "return val,nil", "return val,nil,nil" etc if it's all set up by the return instruction itself. I'm not sure if there are other better ways of expressing this, but if it's easy to toss in a specReg, it would definitely open up a decent option.

So what happens if the results (after the first) are communicated in memory? If the shared-register approach works on a conventional, then sharing pseudo-registers at static addresses (w/r/t the program load image) should work too

They'd have to be thread-local, or potentially stack-allocated, not statically addressed. But yes, memory would be the default go-to for passing parameters around,.

(I am not an x86 expert – how many bytes in a load/store if the offset is one byte?)

Looking at some disasms on x86-64, all the ones I'm seeing are 5 bytes long. Opinions about the x86 architecture are not likely to change given this information. ;-)

That load can be hoisted arbitrarily high, but if LISP overhead guarantees that any call will last longer than d$1 then hoisting is not necessary.

In existing implementations, the only Lisp overhead in returns is the callee either clearing carry, or setting it & the count of returned values. The caller has no overhead before accessing at least the first 3 returned values.

Now, I haven't even gotten into passing parameters _into_ a call, which is much more complex. ;-) There are optional parameters, order-independent optional keyword parameters, raising the trailing set of parameters into a list object, referencing the entire parameter list as well as its parts, etc. There is an 'apply' function for dynamic buildup or pass-through of parameter lists. Current implementations set a register to list the count of parameters, similar to multi-value returns. A lot of the same issues all come up here, too, and having a specReg for the count of incoming parameters would avert some overhead as well. At least there are no older values on the function's belt in this case.

David

Return types are rarely if ever homogeneous.

ivan

Return takes the same belt-argument list that call (and a few other operations line conform and rescue) do. You can't build it up piecemeal. The arguments don't have to be adjacent on the belt, and the same belt position can be used for more than argument. There's no "nil" argument to return (or call etc.); you have to put a nil (whatever that is in the app) on the belt and then pass it as many times as you want.

Thread-local is possible. It is not in general possible for a callee to address the caller's stack without action on the part of the caller, which would be more trouble than TLS.

Sounds like a null function would be in and out before a d$1 latency, arguing again for putting the loads after the call rather than in-flight over the call.

Varargs calls in the belt are a problem; by the time you have figured out how many arguments you have they may well have fallen off the belt. In C only the fixed arguments pass in the belt and the variable part is passed in the stack, but C VARARGS is rare enough that it doesn't matter. Sounds like every call in Lisp is facing the issue, which would matter.

I think the in-memory solution, while it might be better than other CPUs, is fundamentally inapt for the Mill function model. Needs thought.

infogulch

Well then that idea is out.

How big are lisp objects, anyways? If they're too big you might have to deal with memory anyways. E.g. ruby is moving from 5 word objects to 6 or 8 word objects (because they fit more evenly into cache lines). Yes, you read that correctly, that's 8 words of 8 bytes each or 64 bytes per object, which pushes the total belt size on small mills already, let alone returning half a dozen of them.

David

Joe: Lisp heap objects are not of fixed size. Pointers to objects contain tag bits, with some of the common types completely contained within in the tag. So cons cells (the basic 2-tuple linked list element) are literally 2 bare words in memory with zero overhead. The fact that it exists as a cell is purely held in the tag bits of pointers reaching that memory location, not on the object itself. More complicated objects obviously have some actualized overhead, generally one word. Of course, objects themselves have no inherent mutexes, hashes, etc, that languages like Java do. Those would be user-managed fields.

Ivan: While this has been focused on Lisp, I think it is a reasonable concern in today's market to ensure that Javascript, Python, and Java can achieve great performance as well, without hackish workarounds inside their JITs. Javascript and Python especially have a lot of Lisp-like features, with lots of similarities on their input parameter passing. Inline slab allocators, multiple stacks, etc, are common on all these types of systems. Having the hardware track dirty old-generation writes for the garbage collector without needing to trap & interrupt can save a ton of time.

Just something to toss in with the "needs thought" pile. Some of these likely do not affect the core functionality, but might be orthogonal add-ons of varying intensity.

David

Sorry, I've been using the wrong term. Instead of "slab allocator", I've been actually meaning "tlab allocator". Nursery allocation with simple inline pointer bumping, each thread having its own preallocated buffer. Many systems try to keep such a pointer in a global register.

ivan

Transcribed from comp.arch
===================================================================

On 1/18/2014 12:19 PM, Stephen Fuld wrote:
> On 1/7/2014 10:37 AM, Stephen Fuld wrote:
>> I watched the talk, and after some time thinking about it, I have a
>> few questions.
>
>
> Another one.
>
>
> If this would be better after the talks on software pipelining and or
> "phasing", just say so.
>
> I have been thinking about your loop parallelization mechanism. It
> seems from what you have presented so far, the degree of
> parallelization possible is the number of elements in a vector for
> the particular version of the Mill you are running on. But since you
> have so many FUs available, is there a way to "link" multiple vectors
> together in order to gain increased parallelism? e.g. if you could
> somehow process two vectors worth of bytes in parallel, (with the
> associated controls to prevent stores on the second vector), you
> would double the speed of the strcopy example you presented. I am
> not sure what I am asking for here but I could see some possible ways
> to do it.
>
>
>
>

If there are enough FUs then the compiler can unroll the vectorized loop sufficiently to soak them up. In the strcpy example, the ops would simply be issued twice, plus you need a bit extra to compute the None mask for the second store and for the branch condition. Roughly:

   load1

   load2

   eql1

   eql2

   smear1

   smear2

   pick1

   pick2a(done1, smear2_mask, Nones)

   pick2b(pick2a, load2, Nones)

   store1

   store2

   or(smear1_done, smear2_done)

   branch

The extras are straightforward. The extra "or" is oring the "are we done" results of the two smears so that the loop exits if either thought it was finished.

The pick2a and pick2b build a mask for the second store. Pick2a uses the "are we done" bool from the first smear to produce either the mask from smear2 or all None. That is then used as the control to mask None into the loaded value; if there was a null in the first load then the pick2b will have an all-None control and will yield all None, whereas if there was no null then pick2b has the usual smear mask for control and works like in the talk.

The result is that the unrolled loop is twice as long plus two more ops. It could be fully pipelined for a throughput of 2x vector size per cycle. The pipeline startup time to first store would be the same as for the not-unrolled loop; load->eql->smear->pick->store. Unrolling needs five flow slots (2x(load, store), branch); three exu slots (2xeql, or); and three writer slots (3xpick). Four load/store function units means a pretty high-end Mill.

Some Fortran compilers can do this kind of unrolled vectorization, so I don't see any reason why a Mill compiler could not find this code. Not high priority work though because of the limited utility in medium and smaller family members.

Symmetry

How is the metadata perserved when the belts are spilled to memory? Is there generally an extra byte for each belt entry or is there some sort of coelescing that happens?

Also, is it just me or do the values on the belt with their metadata function as monads? In fact, it looks just like Maybe in Haskell except instead of being
Maybe a = Just a | Nothing instead it's Belt a = Just a | Nothing | Error and binding gives Error precedence over Nothing. When I was watching the talk I thought to myself "Oh, it's just like Haskell" in the same way the Belt talk made me think "Oh, its just like Static Single Assignment form"

Will_Edwards

The scratch and spill preserves metadata.

They are dealing with belt items, and not naked bytes, so just take the extra bits needed to maintain all this item state.

The belt width is model specific, but completely known to the hardware and any software that interacts with it, obviously, so its easy to take care of.

And yes, IMO these parallels with SSA and monads are appropriate :)

ivan

"Nothing new under the sun"? That what you get with a compiler guy doing architecture :-)

Will's reply is correct. The actual memory format is member-specific (software gets an API) and will be the same as is used by extended-scratchpad. We need to have a scratchpad talk :-(

Will_Edwards

Maybe a = Just a | Nothing instead it’s Belt a = Just a | Nothing | Error and binding gives Error precedence over Nothing

Small clarification, Ivan to correct me if I'm wrong, but my understanding is that None has precedence over NaR; a NaR and None is None.

ivan

Actually NaR and None together are impossible. NaRs carry a "kind" indicating the nature of the initial problem, and a None is encoded as a flavor of NaR. That way the None/NaR propagation logic in the speculable FUs doesn't have to test, and only the non-speculable ops (like store) need to care. As often happens, at the program level these are different but at the machine level they are blended to optimize the hardware.

Will_Edwards

Yes, I can see how None is a kind of not a result, but what I think we meant by None taking precedence over NaR was None taking precedence over other kinds of NaR.

If two different NaR kinds are operands to an arithmetic operation, what is the output NaR type?

For example if you multiply the two vectors:

2NaRNaRNone and
NoneNone2NaR, do you get:
NoneNoneNaRNone ?

bhurt

I'd like to request you guys add an operation (if you haven't already).
I want to explain the operation, and then explain why I think it'd be
useful. The operation is smear-sum. It takes a vector of small (8-bit,
maybe 16-bit) integers an produces a running sum, and in addition
produces a second value which is the final sum. So, if given the vector
[ 1, 3, 10, 7, 4, 0, 2, 9 ] it produces the vector [ 1, 4, 14, 21, 25,
25, 27, 36 ] and the value 36. An alternative would smear-sumx, which
produces the sum excluding the current value, so the vector [ 1, 3, 10,
7, 4, 0, 2, 9 ] would return the vector [ 0, 1, 4, 14, 21, 25, 25, 27 ]
and the value 36. NaRs and Nones are propagated, but count as 0 for the
purposes of the sum, so the vector [ 1, 3, 10, NaR, None, 0, 2, 9 ]
returns the vector [ 1, 4, 14, NaR, None, 14, 16, 25 ] and the value 25.

Why this instruction would be useful: a lot of languages now use copying
GC. One of the advantages of copying GC is that the heap is alway
contiguous, so allocation is bumping a pointer. The allocation routine
then looks like this:

extern size_t * heap_top;
extern size_t * heap_limit;

size_t * alloc(size_t num_words) {
register size_t * rval = heap_top;
heap_top += num_words;
if (heap_top > heap_limit) {
/* This is a rarely taken branch- less than 1% of the time */
do_a_gc();
/* do_a_gc() resets heap_top and heap_limit so we can now
* succeed at the allocation.
*/
rval = heap_top;
heap_top += num_words;
}
return rval;
}

Similar pointer-bump allocations also show up in languages without
garbage collection- for example, when adding items to a vector. Where
smear-sum comes into play is when you want to vectorize a loop that
contains a pointer-bump allocation. You'd simply do a smear-sum to
calculate what the offsets of each iteration's allocation are from the
current top pointer, then do a vector add of the offsets to the current
top pointer to get their real address, and then off you go. A vector
compare and a normal smear catches the (presumptively rare) case where
an allocation would trigger a gc (or vector resize).

The problem with this instruction is, of course, the deep dependency
chain between the adds. You can't produce the nth value until you know
the (n-1)th value. Even limiting it to small 8-bit adds, I doubt that
this instruction would be a 1-clock "fast" instruction. It'll probably
be a 2- or 3-clock instruction. And that should be sufficient.

You may be, likely are, well ahead of me here, and have already added
the capability to do this. But just in case you haven't, I thought I'd
put the request in.

« Previous Page Next Page »