The most painful aspect of doing this code on a Mill is the pipeline startup costs when the iteration count is so small – half of the total time is spent starting the inner loop. Change the counts to have more vantages and the efficiency improves. I don’t know enough about x86 to know if it would scale similarly.
O.K. please you make a suggestion for a larger number of comparisons (currently 144) which can better address the startup cost, if possible containing just prime numbers 2,3,5 (perhaps 7). Would one of 360, 2520 do it (I’d reduce the number of runs against the same data appropriately). Then I can run the experiment again, pairing smaller inner with larger outer element counts and vice versa (as before).