You wander a bit, but I’ll try to address some of your points.
The belt is not equivalent to a multi-port register file because there’s no update; everything is SSA, so if anything it acts like a RF with only read ports and no write ports. There is an all-source-to-all-sink distribution problem, conceptually akin to a crossbar matrix, but one of the patents explains how we can do 30X40 and have it time like 8×20. However, the distribution is still the scaling limit for the belt.
Phasing is valuable in open code because of the shape of the typical dataflow graph: a few long chains, with local short business at each node of the long chain. We’re not often executing two long chains simultaneously, but we are often doing several bush pieces at once and feeding that to the long chain. The bush fragments tend to be short (1-3 ops) and to fall naturally into the three cycles given by phasing, so we can overlap the later bush phases of this long-chain node with the earlier phases of the bush of the next node. It sounds a bit like breadth-first search, but the presence of the long chains means that the placement algorithm is different.
We get long open code because we do a lot of speculative execution. It’s a little like trace scheduling.
Some years ago we had vectors ultra-wide operands (UWOs). UWOs make sense on a machine with a dedicated UW RF, but not on the belt where scalars are mixed in, so we replaced UWOs with our skyline vectors.
About seeing silicon: go to https://millcomputing.com/investor-list/
Predication: we have predicated forms of (nearly) all the operations that cannot be speculated, notably all control flow and stores. There’s no point in wasting entropy and compiler monkey-puzzle on the rest.
Managed systems: we have some gc-centric facilities, We also expect that a Mill JIT would take advantage of the regularity and simplicity of the ISA to generate to the machine rather than to a memory-to-memory legacy abstraction. The big Mill wins in a managed environment will not come from the execution core however; it will come from the reduced memory traffic, cheap coherence, and ultra-fast task switch.