Mill Computing, Inc

Keymaster

January 13, 2014 at 7:50 pm

Post count: 689

OK, some kind folks on comp.arch have reproduced your result, getting 10 second times in assembler but on a faster x86. So what would you get on a Mill?

I’ll describe what I would do if I were writing asm; I *think* the compiler would find all this, but it’s not certain. I will also assume that the target member is not width-constrained, and will see what width is necessary for best code.

First, the structure is a three-deep loop:
for each many times {for each route node { for each vantage{}}}.
The vantages can be vectorized on a Mill despite being a while-loop. That makes the code look like:

middle loop:
   load route node
   splat node to vector
   init inner counter
   init vantage pointer
   spill middle counter
   spill node pointer
   inner loop:
      load vantage vector
      equal vantage/node vectors
      bool reduce equal vector
      exit if found
      increment vantage pointer
      decrement inner counter
      equal inner counter/zero
      repeat inner loop if not done
   fill middle counter
   fill node pointer
   decrement middle counter
   advance node pointer
   equal middle counter/zero
   repeat middle loop if not done

The outer loop is similar but won’t impact the timing.

The inner loop takes two cycles on a Mill exclusive of the load; the limiting data path is the vector equal and the Boolean reduction, which are one cycle ops but dependent. However, the loop would be pipelined to a single cycle, with five exu (ALU) ops and three flow (mem/control) ops. The dataflow dependency and the latency of the load don’t disappear though, they wind up in the pipeline startup time. The minimum latency for the load is three (hit in D$1) plus the two cycles for the dataflow, means that it takes five cycles for the first iterations and one each thereafter. If we assume a 16-byte vector size (four elements) and 12 vantages then there are three iterations of the inner loop. Hence the whole inner loop takes 8 cycles.

The middle loop must save its loop control data over the execution of the inner loop, spilling the data to scratchpad and then filling it back. It should also keep the initialization of the inner loop around for reuse in the same way. That means 4 spills and 4 fills, plus its own decrement/test/branch and the load of the node. This also pipelines, but not to one cycle, because part must be before and part after the inner loop. However, both the front and back parts become single cycles.

The width required for the front part is four writers (the spills), two flow ops (the load and a conform to start the inner loop), and one exu (the decrement). For the back part it’s four readers (the fills), one exu (the equal), and one flow (the branch). The middle loop takes one cycle (the front), 8 cycles (the inner loop) and one cycle (the back), for 10 cycles, but needs a pipeline startup time of 4 cycles. It executes 12 times, so the total time for the middle loop is 124 cycles. Note that 64 cycles of that are pipeline startup costs 🙁

The outer loop will add a couple of cycles to that on each iteration, but to a first approximation the overall time should be 2^7 cycles per outer iteration, and 5×2^31 for the whole thing. With a 1GHZ clock that is 10 seconds, remarkably consistent with the x86 time on the faster machines reported on comp.arch. The maximal width required is 4 readers, 4 writers, 5 exu and 4 flow, which is roughly a Silver.

There are a few rash assumptions in this analysis:

Everything is in top-level cache after the first iteration. This is a determining factor for both Mill and x86 on this code.
The loop predictor can predictor the inner and middle branch every time. This also is necessary for both Mill and x86 on this code.
The Mill tool chain can figure out how to wrap the nested loop. This is believed true, but hasn’t been proven yet. As a fallback, the inner loop could be wrapped in a function that the middle loop calls, which works in this code but not in arbitrary code.

The most painful aspect of doing this code on a Mill is the pipeline startup costs when the iteration count is so small – half of the total time is spent starting the inner loop. Change the counts to have more vantages and the efficiency improves. I don’t know enough about x86 to know if it would scale similarly.

There are some aspects of the Mill that are relevant here but I haven’t used because Not Filed Yet.

Overall: Silver (a mid-range Mill) performance on this code is roughly equal to a high-end x86 with three times the clock rate and a much larger power consumption.

This reply was modified 10 years, 6 months ago by Ivan Godard.

Reply To: Searching