It’s not clear to me why it is taking so long on the Intel.
The data seems small enough that the V.j will stay in the D$1, as will the N.i; there are only 12 lines for N.i and probably only one for the V.j if I have understood the problem. Hence the inner loop has a load (of N.i) and 12 compares and a branch and some overhead, but it shouldn’t be over 64 instructions. That makes 5×2^30 operations, which with a 2.5GHz clock should be under a second.
Hence I have not understood the problem, or you are getting something odd.
First, are you repeating the same N.i for each iteration (as I assume above), or are you following a different linked list each time? If different then you will be memory bound and the timing you give is plausible.
However, the fact that you can cut the time in half doing the compares in SIMD argues that you are not memory bound. One possibility is that you are thrashing in the D$1 or the TLB. If all the N.i addresses are in the same D$1 cache way then you will be using the D$2 cache because of thrashing. Are the addresses used for the N.i commensurate with some stride that might put them all in the same cache way?
Thrashing in the TLB is a different problem. You say that the twelve are in different pages, so you will be touching at least 12 TLB entries, plus whatever gets used for stack and the V.j etc. I don’t know how many TLB entries your Intel has, but if you are using more than are there then you will thrash in the TLB replacement algorithm, which is a known performance problem. You can test this hypothesis by increasing the size N of the N.i list over several runs; if there’s a sudden knee drop in performance at some point going from N to N+1 then you have likely hit the size of the TLB.
Now, to your question about the Mill:
- If (as I surmise) you are reusing the same list for each of the 2**24 runs then the data will be brought in only once from memory and will be in cache thereafter. Mill deferred loads (see the Memory talk) will hide some of that first fetch, as will the Intel OOO, but that first run will be at DRAM speed for both machines.
- Both machines can thrash in the D$1 if the data are all in the same cache way on the particular machine. Way mappings vary, and most machines (including the Mill) skew the mapping so hitting a bad stride mapping would be just bad luck. The chance of it happening is roughly equal on the two machines.
- The TLB issue is a major difference between Mill and Intel architectures. Intel has to check the TLB for every N.i fetch even when the data is in cache; the Mill does not. This is covered in the Memory talk. If the slowness is due to TLB thrash then the Mill will run *much* faster than the Intel. The Mill does have to check protection in the PLB, but the PLB entries cover an entire protection domain, typically the whole program, rather than needing an entry per page, so you won’t thrash.
The speed of the comparison loop on a Mill depends on the vector element count, which in turn depends on the data size (which you don’t give) and the Mill family member involved. Assuming 4-byte data and a mid-range Mill then the vectors will be 16 or 32 bytes, so either 4 or 8 elements, giving three or two iterations for the 12 comparisons. I am assuming that the code doesn’t know there are 12 V.j, but is iterating through a V.j array; if the code statically knows that there are only 12 then you (or rather the compiler) would open-code the comparisons. With software pipelining and again a midrange Mill each iteration will be two cycles. However the iteration count is very short so startup time would be important. At a reasonable guess each set of 12 comparisons using 4-element vectors would take 9 cycles as a loop, and 7 cycles as open code; with 8-element vectors that becomes 7 and 5.
The fetch of the N.i in the outer loop can be fully pipelined, so overall the outer loop iterations should be less than 10 cycles, so say 2^30 cycles for everything. At a 1GHz clock rate (expected for the initial Mills; faster will come later), that’s 250 milliseconds.
This analysis is making rash assumptions about the nature of your problem. I don’t understand why Intel isn’t doing better; the Intel cores are much narrower than a Mill, but they have 2.5X the clock rate which should make up for a lot. TLB thrashing is a plausible explanation but needs to be confirmed.