More than two retires of latency-N ops per cycle, how do the crossbars handle this case?
In the belt lecture, in the section on data forwarding (slides 50 and 51, around 42 minutes into the talk), the latency-N crossbar is shown with only two outputs, both feeding into the latency-1 crossbar. In the accompanying description, Ivan says words to the effect that:
everybody else [results from lat-N ops] goes through a separate crossbar…. that winnows all those inputs down to two inputs , which are in turn routed to the latency-one crossbar….
If the (only?) outputs of the lat-N crossbar are the two winnowed values that feed from the lat-N crossbar into the Lat-1 crossbar, how can the results of more than two lat-N operations retiring in cycle N be forwarded for use by ops in cycle N+1? (Or even make it to the belt itself?) Since family members such as gold have FU populations that can execute many more than two latency-N operations per instruction, I suspect there must be a way for the rest of those lat-N results to be forwarded, although exactly how isn’t clear to me.
I’ll suppress my urge to speculate on how this is/could be done, and hope that the answer doesn’t require stalls.