Mill Computing, Inc. › Forums › The Mill › Architecture › Control flow divergence and The Belt › Reply To: Control flow divergence and The Belt
There are several possible implementations for the remapper. I’m not a hardware guy, but I’m told that there’s no one-size-fits-all; some approaches work better for short belts but don’t scale to very long belts, which use a different approach, and where the implementation cutover(s) are is also process and clock dependent.
One representative approach is to attach a physical number to each operand when created using a simple take-a-number. Each consumer knows what physical # it wants, and broadcasts that to all the sources; the source with an operand with matching physnum responds with the operand. This is a straightforward MxN crossbar, and comprises a large part of the area and power on the larger members. A conventional genreg machine has a similar crossbar running from all registers to all FUs. I’m told that there are well known ways to use hardware cleverness to reduce the cost of such crossbars, but they are not well known to me.
The crossbar works in physnums; it is the job of the remapper to convert from the belt position numbers (beltnums) to physnums. The mapper (for an example 16-position belt) is a circular array of 4-bit adders, where each adder holds the physnum corresponding to the beltnum index in the array. Translation is just indexing into the array based off the current origin. Advancing the belt is just bumping the origin with wraparound. Dropping new physnums is taking to operation drop count and the statically defined drop order to get a zero-origin sequence of dropping physnums, which is biased by the origin and used to replace the content of the affected adders. Bulk remap (including branch) just builds a new zero-origin drop list by using the beltnums from the op to index the adders (this is just like indexing for arguments of any op) and then dropping the whole thing normally. There is lots of time in the cycle to do a 4-bit index for 4 bit result and a couple of 4-bit adds in time for the physnums to be available for broadcast.
An alternative useful for the larger members does not use beltnum to physnum to broadcast but instead maps directly from beltnum to residence number. Each possible location that can hold an operand gets a resnum; there are a lot of locations on the bigger members so resnums are 6 or 7 bits. The mapper works the same way, but the adders respond to the index step with a resnum, not a physnum, and the drops are resnums not physnums too. The resnums directly address the datapaths feeding the crossbar, so there’s no broadcast although the crossbar itself remains. However, when an operand is moved from one residence location to another the mapper adder content must be updated with the new resnum; this is in effect a drop. In the broadcast approach the physnum of an operand never changes so operands can be moved without updating the mapper.
Or so I’m told. It’s all transparent to the compiler level that I work at.