Mill Computing, Inc. Forums The Mill Architecture spiller work optimisation Reply To: spiller work optimisation

Ivan Godard
Keymaster
Post count: 689

Exactly. However, while we describe the belt and the spiller as two different physical units for clarity, in reality they are just two different ways of looking at the same collection of hardware. The rarely-mentioned guts is actually the skid buffers and their replay mechanism. It’s not possible to stall a function unit in mid-pipe, so either you throw away what it does and re-issue, or you catch what it does and replay the result, in effect re-retire.

Replay is hard if an operation can fault or if there are ordering hazards. In a conventional this is handled by OOO renaming, at some cost, but without OOO the hazards pretty much dictate use of re-issue, and even OOO uses re-issue, I suppose as much out of tradition as anything. The Mill’s NaRs and SSA belt eliminates the fault and hazard issue and we have static ordering, so we can use replay. Replay is much cheaper than re-issue, especially with long pipes. However, replay means we have to catch results produced during stall, and for that we need skid buffers, and the ability to feed buffer contents to function units as we come back up from stalls.

Given that machinery, it’s only a small step to providing the paths by which the skid buffer contents can be moved to SRAM, in addition to the paths to the FUs. That’s the top of the spiller. The bottom of the spiller, which moves content from SRAM to the memory hierarchy, is largely independent.

Skid buffer replay must work at core-clock and full machine width, which is a huge amount of data with very tight timing. Consequently it doesn’t have time to decide if something is really needed as it comes out the back of the FU; a buffer has to catch it whether it is there or not. This leads to holes in the buffers of course. Those holes get squeezed out later in the bucket-brigade that leads to eventually to DRAM. Where “later” is is up to the implementation.