I’ve been thinking a bit about how to do interpreters in general on the Mill, if you are not doing a JIT. The ideas are applicable to Forth, although they might be easier in a one-stack model.
The actual execution part of a code will tend to be only a few and frequently only one instruction on a wide Mill. Consequently the bulk of the cost will be in the dispatch overhead. Of course, some codes may take much longer, but the fast-exec codes should dominate.
The dispatch overhead in turn will be dominated by the cost of the dynamic jump in the direct-threaded code (or the switch, if that is used), and icache churn if the engine doesn’t fit. However, at least for Forth I think 2X32k should hold the entire interpreter. At 3 bytes/op (generous, even on big Mills) you have 20k ops available for the interpreter; the Forth code itself is on the data side and doesn’t collide.
The problem with that dynamic jump is that it is megamorphic and will miss all the time. The next exec is not likely to be in the i$0 microcache, so besides the 4 cycle miss we have a i$1 fetch to do, say 7-8 cycles total depending on banking and cache implementation details. When the exec is only 1-2 cycles you can see where the time goes.
So the question is how to make the indirect jump go away. Let’s assume that we know a priori that all ops take two exec cycles and the Forth code has no branches. If we do:
then we stall in every dispatch. But the Mill supports deferred branches: compute the address and predicate now, but do the transfer then. See First Winner Rule in the docs. This would permit us to dispatch early:
<- transfer of dispatc0 happens here exec0 dispatch3 <- transfer of dispatch1 happens here exec1 dispatch4 <- transfer of dispatch2 happens here exec2 dispatch 5
Because we are computing the actual target, the true target will override a (bogus) prediction in the decoder and we won’t miss on the transfers; essentially we have pipelined the execution.
There are issues. Mill deferred branches take a static deferral value; we can’t calculate how long we should be waiting to get all the in-between execs done. So the amount of the deferral has to be selected to fit an optimal number of dispatches and typical execs in the number of cycles it will take for the prediction/decode hardware to have the target instruction ready to go; that is a very member dependent number, but readily tunable.
Then there are the execs that don’t fit in the allotted standard cycle count. Those can be done by a call out to the full exec routine; branch deferral is frame-local on the Mill, so it doesn’t matter how long the out-of-line exec routine takes, the deferred transfer will be waiting to take effect after the return. The call itself should always predict, because it is static.
The biggest problem is what to do when the Forth etc. codes themselves do a branch, because we will have already dispatched (and have transfers waiting for) the codes down the fall-through path. In essence the deferred dispatch is a form of branch prediction of the Forth code, and the prediction can miss at the Forth level even though there are no misses at the interpreter level. When we do miss we need to get rid of those queued-up transfers.
One way is to use the Mill inner op and then exit the putative loop when we execute what turns out to have been a Forth branch code that didn’t fall through. The exit discards all in-flights, both data in FUs and FWR transfers pending. We will then need to do a new inner and a few dispatches without execs to get the pipeline restarted at the correct Forth target, but that shouldn’t be hard. However, inner and its exit are spiller-active and have a power cost of roughly a call; it’s not clear whether this matters in the event though.
A second approach would be to put an explicit “branch coming up” code in the Forth code, which would cause the dispatch logic to stop putting out deferred branches and let the pipeline drain to empty when the exec reached the branch. The branch exec would then restart the pipe down whichever path was correct.
Which leads to a final approach: add deferred branching to the Forth codeset. With that, the dispatch pipeline always knows where the next real code will be and can do a Mill deferred branch to have it ready for exec at the right time. The drawback is that frequently there is nothing useful to do between the time the Forth predicate is known and when we have to defer to to avoid a Mill miss; welcome to no-ops (or take the Mill miss and stall).
Whether any of this pays depends on the dynamic frequency of branches among Forth codes. Unless you add a Forth-level predictor you will be taking a 7-8 cycle hit at every branch to restart the pipeline, and a predictor is asynchronous enough to the execution cycle that I don’t think it can be coded into the static code of the Mill along with the dispatch and exec. The rule of thumb for common languages is one branch every 5-8 instructions, so pipelining the dispatch should be a win: a 6-instruction basic block with a two-cycle dispatch/exec becomes 5×2+8 or 18 cycles, whereas without the dispatch pipeline it is 6×8 or 48 cycles. Factors of 2.5 are not to be sneezed at.