Try and use one question per post. It’s easier for the reply and the readers.
What are the advantages of using the belt for addressing over direct access to the output registers? Is this purely an instruction density thing?
What’s an output register?
Why does the split mux tree design prevent you from specifying latency-1 instructions with multiple drops? Couldn’t you just have a FU with multiple output registers feeding into the latency-1 tree? I’m not able to visualize what makes it difficult.
Hardware fanout and clock constraints. Lat-1 is clock critical and the number of sources (and the muxes for them) add latency to the lat-1 FUs. Lettng lat-1s drop two results would double the number of sources for the belt crossbar, and it’s not worth it. Lat-2 and up have much more clock room.
For that matter, how does the second-level mux tree know how to route if the one-cycle mux tree only knows a cycle in advance? It seemed to me like either both mux trees would be able to see enough to route the info, or neither would. Does this have to do with the logical-to-physical belt mapping, because that’s the only thing I can think of that the second-level mux tree would have over the one-cycle tree.
There’s no problem with the routing itself; everything is known for all latencies when the L2P mapping is done. The problem is in the physical realization of that routing in minimal time. A one-level tree would have so many sources that we’d need another pipe cycle in there to keep the clock rate up.