Tagged: cache latency
- mhkoolParticipantNovember 4, 2015 at 3:58 pmPost count: 7
In one of the talks it was mentioned that the L1 data cache has a latency of 3 which is better than the Haswell but still might cause stalls.
On the Haswell one can load 32 bytes into a ymm register (4 cycles) and extract individual bytes from the ymm register to a general purpose register (1 cycle).
In my problem domain I work a lot with loops on characters arrays and given the fact that the Mill does not have long-lived registers, I was wondering if it is worth the effort of extra work and logic to have some kind of L0 data cache with a very small size (between 1 and 8? cache lines) with a latency of 1 cycle, similar to the timings of a Haswell. The intention is to eliminate stalls inside the loops.
- Ivan GodardKeymasterNovember 4, 2015 at 4:31 pmPost count: 689
Several comments. First, the 3-cycle latency is a reasonable approximation for use in sims when we cannot yet measure the Verilog timing. The actual number will be heavily member dependent, and on process and clock rate too. Don’t count on that number, except for ball-park figuring.
Second, code using the vector registers is usually loop code, and the Mill software-pipelining of loops hides the actual latency if the data is in the cache.
Third, there is in fact a D$0 capability, although it is not a cache in the conventional sense the way that the D$1 is. The filing is in, but it needs pictures to explain. Sorry.
Fourth, on a well configured Mill member the load bandwidth is enough to keep up with the compute capacity, so if you are doing scalar compute then doing scalar load doesn’t slow you down. The only real advantage to doing a vector load followed by an explode and scalar compute is that it might save a little power compared to doing scalar load and scalar compute without the explode. Of course, if you can do vector compute then you certainly should do vector loads, and we support that.
You must be logged in to reply to this topic.