A full crossbar connects every source to every sink simultaneously. The Mill contains a full crossbar, and indeed each sing can obtain an operand from any source, even all sinks from a single source. On larger Mill the crossbar is whopping big, although smaller than similar structure on OOO due to having fewer sources because there are no rename registers. There is no stall.
The time to transit a crossbar depends on the fanout, i.e. the number of possible sources that can feed a sink; this is a natural consequence of the mux tree implementation. This latency has a significant impact on cycle time, which must include a crossbar transit. Uniquely, the Mill splits the crossbar into a small crossbar (the fastpath) which contains only a small fraction of the total sources, and a big one (the slow path) that contains the rest of the sources. The fastpath crossbar is sized so as to have no cycle-time impact. The slowpath does have cycle-time impact, or would have if the slowpath had to fit in a cycle. However, the Mill is organized so that everything using the slowpath is already multi-cycle, and we simply accept the added latency of slowpath for those already-slower operations.
Of course, many multi-cycle operations do not fill their last cycle, and can use slowpath without causing another cycle latency. And many don’t quite fit and an op that would be three cycles if it could use fastpath becomes four cycles when it has to use slowpath. However, over the entire workload, letting popular ops use fastpath is a winner.