Last question, and then I should probably stop there:
– 19) Pipelining loads
The Mill’s stated strategy for hiding load latencies is to give them a long delay and fit some work in-between. If you have a pipelined loop, that means the loop reaches its steady state once its first load is retired, and the following loads can reasonably be expected to retire in sequence.
If your steady state is one cycle long (eg the same instruction repeated over and over), that means your loop will have LOAD_COUNT * LOAD_DELAY loads in-flight at any given time. Eg if your loop does three loads per cycle with a delay of 5 cycles, you can expect your loop to have 15 loads in-flight during its steady state.
The Silver Core will supposedly have 16 retire stations; in your talks, you say that you expect L2 hits to have about 10 cycles of latency. That means if you want your pipeline to run smoothly and you expect your data won’t be in the L1 cache, you can’t have more than a single load per cycle in your loop. So you can’t, say, iterate other two arrays and return the sum of their pairs.
There are some obvious mitigations (including, as we’ve discussed, software prefetching and stride prefetching). Generally speaking, how do you expect compilers to address that problem?
(I’m guessing the answer will start with “streamers” and end with “NYF”)