In a sense nearly all Mill operations are speculative; only those (like store) that change globally visible state cannot be speculated. So you can issue as many loads as there are functional units to handle the issue and retire stations to hold the values. An OOO machine has a similar constraint: it cannot have more loads in flight than it has issue and buffer capacity to handle. Consequently, in this the Mill and an OOO are equally constrained for a given amount of hardware resources.
Now to your example. You assume that the Mill will issue four loads and stall picking up the first one, while the OOO can issue six more. The problem with the example is: why didn’t the Mill issue 10 loads before it tried the first pickup? If there is code after the load that doesn’t depend on a load, the compiler is supposed to hoist that code to either before the load issue or between the lod issue and the load pickup. If the compiler did not hoist that code then there is a compiler bug, not an architecture problem. We’ll have compiler bugs for sure (I write compilers) but we fix them.
Incidentally, the purpose of the Mill deferred load facility is to mask the L1 and L2 latencies, not to mask the DRAM latency; masking the DRAM would require being able to hoist issue an impractical distance before pickup. Instead there are other (NYF) mechanisms to mask the DRAM latency.