At the end of decode0 (D0) all the exu-side reader block has been decoded. This is important because we can start the scratchpad fill operations in D1, with the data available at the end of D2 (I0, X-1) for use by the adds in X0. Two cycles to read the scratchpad SRAM is plenty.
We also have the flow-side flow block decoded at the end of D0, but that doesn’t help because we need the flow-side extension and manifest blocks (which are really just data arrays) to make sense out of those decodes. The extension and manifest blocks are interleaved into the long (very long) constant in D1 and the selectors set up to extract the bit fields belonging to each op (based on the extCount/conCount fields in the already-decoded flow block) at the end of D1. The selectors extract the constants of any con ops and have them available at the end of D2, ready for the adds in X0 again.
The clock-critical part of all this is the priority-encodes of the extcount/conCount fields that set up the selectors. Priority encode is a linear parse, and with as many as eight flow slots that’s a lot to figure out. The priority encode and the actual selection can blur over the two cycles D1/D2, but this may constrain clock rate at high clock on very wide Mills, forcing us to add another decode cycle. A Gold at 4GHz is not for the faint of heart! It’s all heavily process dependent of course. Best guess is that the width vs clock constraints will pinch in the fastpath belt crossbar before they pinch in the con operation decode, but we don’t know.