You can pipeline the loads as far ahead as you want, up to the maximum deferral expressible in the encoding, equivalently, the maximum number of distinct pick-tokens expressible.
OOO potentially does do a better job than Mill at hiding DRAM latency for simple loads; that is not the purpose of deferral, although it does help somewhat. There are even some cases where OOO can do a better job of hiding cache latency; those cases mostly involve cascaded loads. We believe such cases to be insignificant, but as yet we have no numbers to back that belief.
The Mill does extensive prefetch on the code side; the mechanism was explained in the Prediction talk. What it does on the data side must await a future talk.