The abstraction is accurate. For a belt with nominal 32 entries, any data sink can take its data from any position. There are actually twice as many physical entries (64) as nominal, to allow for phasing, but each phase can only see a nominal window into the physical belt space. Without that, we would have to renumber the belt at each phase boundary, which is not practical.
There is a crossbar, but it is split with a narrow fastpath and a wide slowpath. One-cycle latency results (a maximum of one per slot) go direct to the fastpath, while two values (for each consumer) are selected from the slow path and passed to the fast path. This is explained in greater detail in the Belt talk.
However, although there are many possible sources for the crossbar, there are many fewer sinks than you assume. In particular, slots accept at most two input arguments from the belt, and quite a few (all the Reader block) accept none at all. The cost of the crossbar is determined almost entirely by the number of consumers, which is one reason why Gold has only 8 exu slots.
Similarly, while a slot can concurrently drop results from differing-latency ops issued in that slot, the long-term steady-state peak throughput is ~one result per slot, which determines how we size the spiller.
We describe the belt implementation as in effect a tagged distributed CAM, because that is easy to explain, and for many people a concrete realization is more understandable than a more abstract description. However, the implementation of the belt is invisible to the program model, and the hardware guys are free to come up with bright ideas if they can, which will likely differ in different family members.
There is expected to be a talk this Fall giving the present most-recent final word about the gate-level hardware of the first Belt implementation.