Several comments. First, the 3-cycle latency is a reasonable approximation for use in sims when we cannot yet measure the Verilog timing. The actual number will be heavily member dependent, and on process and clock rate too. Don’t count on that number, except for ball-park figuring.
Second, code using the vector registers is usually loop code, and the Mill software-pipelining of loops hides the actual latency if the data is in the cache.
Third, there is in fact a D$0 capability, although it is not a cache in the conventional sense the way that the D$1 is. The filing is in, but it needs pictures to explain. Sorry.
Fourth, on a well configured Mill member the load bandwidth is enough to keep up with the compute capacity, so if you are doing scalar compute then doing scalar load doesn’t slow you down. The only real advantage to doing a vector load followed by an explode and scalar compute is that it might save a little power compared to doing scalar load and scalar compute without the explode. Of course, if you can do vector compute then you certainly should do vector loads, and we support that.