Very clear, thanks.

The only drawback I see to pipelining more than required to mask the L1 latency is that you incur an overhead, which may or may not be significant depending on the number of loop iterations.