In more detail: a retire station is a dedicated, blocking, not-hardware-pipelined device, much like a classic divide unit in which the unit can have only one divide in flight at a time; contrast with a hardware-pipelined device like a multiplier where you can start a new multiply every cycle and have several in-flight at different stages of execution.
Now apply that information to the scheduler in the specializer. A software pipeline is constrained by the resources it needs for the code it executes, so if (for example) the loop body contains six adds and the target member has four ALUs then the software pipeline must be at least two cycle long. The scheduler determines this minimum by considering blocking ranges as it schedules. For an ordinary, hardware-piped operation like an add the blocking range is the one cycle for issue, regardless of the latency of the operation; you can issue an add every cycle. In contrast, a 14 cycle blocking divide has a 14 cycle blocking range, so if the loop body has a divide and the hardware has one divide unit then the software pipeline must be at least 14 cycles long, even it has to fill those cycles with no-ops.
Now to apply that to loads. The amount of deferral of a load is up to the code, up to a maximum number of cycles determined by the representation (for count-deferred loads) or with no upper limit (for tag-deferred loads). In particular, the deferral can be zero so that a load issued in this cycle can retire in the next issue cycle, albeit not in the next clock cycle. A load uses the retire station as a blocking resource, but the blocking range is only for the period from issue to retire of that load. Consequently, if loads use zero deferral, the limit of parallelism is the lesser of the number of load units (i.e. the number of loads that can be decoded) and the number of retire stations. All Mill configurations will have more retire stations than load units (otherwise deferral would be pointless) so in practice the constraint for non-deferred loads is the number of load units. Thus, non-deferred loads can be pipelined just like adds, because they have a one-cycle blocking range just like an add.
Now introduce deferral. If there are eight retire stations, then we can have eight loads in flight at any one time. If there are four loads in the loop body, then each load could have a blocking range of two, which means that loop can be software pipelined with each of the four loads specifying a one-cycle deferral (i.e. a two cycle flight time). This constraint is static and the blocking range can be figured as soon as you know what the loop does and have the specs for the target, i.e. at specialize time.
Now it happens that stalls are expensive because modern hardware doesn’t stop on a dime, nor start immediately either. A two-cycle deferral will always cause a stall, even if the data is in the top level cache, because the cache access latency is longer than that. So it is better to have a deferral that is at least the d$1 latency, assumed to be three cycles for initial Mills.
So for performance the specializer (when scheduling a pipelined loop) wants to use a deferral of at least three. That means that it cannot issue all four loads every cycle, because that would require 12 loads in-flight and there are only eight retire stations. Consequently the schedule must add additional cycles to the software-pipeline, just like it does with a blocking divide. The resulting code will contain no-ops.
Of course, if the schedule does not run out of retire stations even with a deferral of three, it can use a larger deferral for all or some of the loads until the cumulative blocking region exceeds the available stations. The same algorithm is use for all ops with a blocking region greater than one; the only thing that is different about loads is that the size of the blocking region is determined by the code rather than by the hardware spec.
Incidentally, loads are uncommon in Mill software-pipelines, for reasons that are NYF 🙂