Mill Computing, Inc

Keymaster

March 28, 2017 at 11:53 am

Post count: 689

That was pretty much what our initial design did, except that it was a real switch op, not just an indexed branch, and bounds-checked the argument into the default case.

The mucking arose because the Mill code-side is both heavily streamed and heavily pipelined. It’s a fire-hose and the code side has to do all it can to keep everything out of the way. The whle code side is done on a whole-cache-line basis. Recall that each instruction is variable length and may be up to a whole line long and may cross a line boundary. The instruction header says how long it is, so the front end of decode is an instruction shifter that is two lines wide and cracks the line stream into an instruction stream.

The cracker line fetch sequence is controlled by the predictor, so a predicted taken transfer causes the lines containing the target to replace the lines containing the transfer in the cracker. After the cracker the line relationship is lost. A miss discards the inflight instruction stream and starts the cracker with a new line-pair just as if it had acted on a predicted transfer.

In those streams there’s no place for an arbitrarily large data structure like a jump table; it won’t be fetched as part of the line stream, and whatever part of it was in the remainder of the line containing the switch op would have been discarded as the cracker was refilled from the predicted switch target. Moreover, we don’t even know that we have a switch until two cycles after the line-cracker, and the cracker is quite likely to have followed one or two predictions before we know – the table is probably no longer in the decode at all any more, and we have to be prepared to get it, or at least the desired address, from cache.

Consequently, other than the potential value of maybe part of the table having been brought to the icache along with the instruction containing the switch, there’s no value in having the table adjacent to the switch op except that it gives a cheap way to find the address of the table.

Now that we know the address of the table (having decoded the switch op enough to discover we need an address), we have to do an indexed load and pull up eight bytes that we can compare against the prediction we followed a couple of cycles ago to see if we went where we should have. That’s a round-trip to the cache. However, the icache is set up to do whole lines – and nothing but whole lines. So to get the case target address we have to pull up a line (needs an adder somewhere to add the table address and the case index), and isolate eight bytes out of that line, which means there’s a shifter somewhere too. Can’t use the ALUs or L/D address adders for the add; they are busy working down the predicted path. Can’t use the ALU shifters for the isolate; they aren’t line wide and are also busy. So we need an adder in the branch unit and an isolate shifter in the icache, equivalent to what is on the front of a dcache, and both adder and isolater are used for nothing but the rare case of a switch. And you have several more data paths to mux, and another command sugnal or packet for the icache to deal with

The latency is bad too. We don’t know we have a switch op until D2, plus a cycle for the add, plus maybe two-three cycles to get the line from the i$1 (the table won’t be in the i$0 microcache – I don’t even want to think about that), plus a cycle for the isolater, plus a cycle back and a cycle for the success comparison, just to confirm that you did indeed predict correctly. Then what happens if the switch target (as predicted) immediately does a switch too? Is switch handling fully pipelined? Probably not – you will stall, but then you will be stalling anyway while you figure out the address.

Muck.

Nowadays the jump table (when we get around to it) is on the data side, which already has the address-adder for an indexed load and the isolaters in the data path from the D$1, so no logic need be added to support switches. The ISA has dynamic branches that take the target address from the belt, for which the prediction confirmation takes zero latency, just like any other branch. The indexed load is separate from the branch and so can be hosted into the future as far as the switch argument value is known. With any luck the pointer will already be on the belt by the time we need it in the dynamic branch, and if there’s a data dependence that prevents hoisting the load then we are still no worse off than a D$1 latency.

No added logic, no added data paths, no added muxes, no high fixed latency, no stalls (if hoistable).

No muck.

Reply To: switches