Mill Computing, Inc. Forums The Mill Architecture Pipelining Reply To: Pipelining

Ivan Godard
Keymaster
Post count: 689

This reply addresses LarryP’s question about “baked-in” (as opposed to member-dependent) aspects of the Mill.

Similarly, I’m curious about what other parameters are “baked” into the Mill architecture. If I recall correctly, the bit-width of the address space is another such. (64 bit pointers, with 60 bits of true address + four reserved bits for garbage collection and forking.) Since pointers must fit in (single) belt positions, it sounds like this requires a minimum height of 8 bytes for all Mill family members. The shortest belt length mentioned to date is 8 for the Tin. I suspect that a shorter belt length (e.g. 4) would effectively rule out wide issue (since results need to live on the belt for at least the spill/fill latency.)

Very little is baked-in. Cache-in-virtual is, so the address space must be big enough to hold the space of all processes without overlap, which implies 64-bit pointers for general-purpose work. Extending shared addresses off-chip might require 128-bit pointers in some future supercomputer, but that’s not GP. Likewise, single-process apps (some embedded for example) might fit in 32 bits or smaller, but that’s not GP either.

Belt size is not baked; it is set so that spill of single-use operands is rare. Eight might be too small even for a Tin, but we might leave Tin somewhat hamstrung with a too-small belt for market segmentation reasons anyway. All this tuning waits on real data.

Similarly, two streams of half-bundles (at least until we get that hypercube memory ;-) ) and three blocks to decode within each half bundle. Other baked-in Mill-architecture parameters?

This is not really baked in; you could add a clock to either side’s decode and get two more blocks if you had a use for them. Earlier Mills had four blocks on the exu side, but we merged two of them to save the clock. If, on some member, we find that a decode stage is clock-limiting the rest of the machine then we might split and rearrange the existing blocks, costing us another clock in decode but speeding up the overall clock rate. Whatever sim says works best overall.