Mill Computing, Inc. › Forums › The Mill › Tools › Applications › Searching

- AuthorPosts
- #461 |
This is a reposted email from Klaus Witzel:

==========================================================

Hello,my background is in software production for large institutes (financial industry, manufacturers, military) who seek to harness their Big Data. Can I ask how the general search problem is addressed by the Mill, which parts of it are involved and how. I’ve seen the videos about the Belt and must say I’m very impressed.

The general search problem can be illustrated as a route from A to Z with intermediate stops, e.g. A -> B -> … Y -> Z. The task at hand is to compare a list of *known* vintage points V.j for a match with one of the nodes N.i on the *unknown* route. The following pseudo code does this job:

– while (N.i not at end of route)

– if (any of V.j = current N.i) then exit (found a match).

– else N.i < = next node on the routeThe problem with present CPU architecture is that the route is a linked list and each of its nodes N.i can be on a different memory page. The plain list of vintage points V.j is small but each node N.i has to be compared to each V.j vintage point.I've performed experiments on Intel(R) Core(TM) architecture, on my cheap M480 2.67 Ghz it takes ~60secs with SIMD instructions and ~30secs with SISD instructions for doing the search 5 x 2^24 times on 12 nodes and 12 vintage points (144 comparisons per loop).I'd be delighted if you'd shed some light on how the Mill can attack the general search problem. Companies like Facebook, Google, etc, with their huge graphs of nodes and vintage points have this Big Data search problem.With kind regards,Klaus D. Witzel It’s not clear to me why it is taking so long on the Intel.

The data seems small enough that the V.j will stay in the D$1, as will the N.i; there are only 12 lines for N.i and probably only one for the V.j if I have understood the problem. Hence the inner loop has a load (of N.i) and 12 compares and a branch and some overhead, but it shouldn’t be over 64 instructions. That makes 5×2^30 operations, which with a 2.5GHz clock should be under a second.

Hence I have not understood the problem, or you are getting something odd.

First, are you repeating the same N.i for each iteration (as I assume above), or are you following a different linked list each time? If different then you will be memory bound and the timing you give is plausible.

However, the fact that you can cut the time in half doing the compares in SIMD argues that you are not memory bound. One possibility is that you are thrashing in the D$1 or the TLB. If all the N.i addresses are in the same D$1 cache way then you will be using the D$2 cache because of thrashing. Are the addresses used for the N.i commensurate with some stride that might put them all in the same cache way?

Thrashing in the TLB is a different problem. You say that the twelve are in different pages, so you will be touching at least 12 TLB entries, plus whatever gets used for stack and the V.j etc. I don’t know how many TLB entries your Intel has, but if you are using more than are there then you will thrash in the TLB replacement algorithm, which is a known performance problem. You can test this hypothesis by increasing the size N of the N.i list over several runs; if there’s a sudden knee drop in performance at some point going from N to N+1 then you have likely hit the size of the TLB.

Now, to your question about the Mill:

- If (as I surmise) you are reusing the same list for each of the 2**24 runs then the data will be brought in only once from memory and will be in cache thereafter. Mill deferred loads (see the Memory talk) will hide some of that first fetch, as will the Intel OOO, but that first run will be at DRAM speed for both machines.
- Both machines can thrash in the D$1 if the data are all in the same cache way on the particular machine. Way mappings vary, and most machines (including the Mill) skew the mapping so hitting a bad stride mapping would be just bad luck. The chance of it happening is roughly equal on the two machines.
- The TLB issue is a major difference between Mill and Intel architectures. Intel has to check the TLB for every N.i fetch even when the data is in cache; the Mill does not. This is covered in the Memory talk. If the slowness is due to TLB thrash then the Mill will run *much* faster than the Intel. The Mill does have to check protection in the PLB, but the PLB entries cover an entire protection domain, typically the whole program, rather than needing an entry per page, so you won’t thrash.

The speed of the comparison loop on a Mill depends on the vector element count, which in turn depends on the data size (which you don’t give) and the Mill family member involved. Assuming 4-byte data and a mid-range Mill then the vectors will be 16 or 32 bytes, so either 4 or 8 elements, giving three or two iterations for the 12 comparisons. I am assuming that the code doesn’t know there are 12 V.j, but is iterating through a V.j array; if the code statically knows that there are only 12 then you (or rather the compiler) would open-code the comparisons. With software pipelining and again a midrange Mill each iteration will be two cycles. However the iteration count is very short so startup time would be important. At a reasonable guess each set of 12 comparisons using 4-element vectors would take 9 cycles as a loop, and 7 cycles as open code; with 8-element vectors that becomes 7 and 5.

The fetch of the N.i in the outer loop can be fully pipelined, so overall the outer loop iterations should be less than 10 cycles, so say 2^30 cycles for everything. At a 1GHz clock rate (expected for the initial Mills; faster will come later), that’s 250 milliseconds.

This analysis is making rash assumptions about the nature of your problem. I don’t understand why Intel isn’t doing better; the Intel cores are much narrower than a Mill, but they have 2.5X the clock rate which should make up for a lot. TLB thrashing is a plausible explanation but needs to be confirmed.

I don’t know if it was a typo, but he actually said SIMD *doubled* the time it took.

So he did. I’m mystified.

As I read the problem description, N is in a set of 2^24 records of 12 members each, so that data clearly would not fit in cache. If N is in a linked list with scattered order, cache thrashing could be fierce, with near every new N hitting the full latency to RAM. ~5M cycles total estimate using 300 cycles/RAM access. The size of an individual member isn’t given, guessing 32bits. I’m also figuring 156 loads from cache (small register set) to perform a set of 144 comparisons between members of N.i and V.j in the SISD case. ~18M cycles total estimate. Still a log way from accounting for the time reported.

Another email from the OP, now containing source code which I have put in DropBox at https://www.dropbox.com/s/9jggxl33qu5wrny/Implementors.c. I will try to improve my Mill guesstimate in another post after looking at the code more closely.

=============================================================================Hi Ivan,

I’m not the expert for discussing TLB thrashing or thrashing in the D$1 in a forum, and therefore take the liberty and send the small source code file of the experiment together with my observations. All parameters (# of nodes, dword distance/page(s) between nodes, # of dword vantage points, # of repetitions of the run against the same data) can be configured by passing named parameters through the compiler. The experiment is compiled with the -m32 option and -O3 for gcc, on Vista and on Ubuntu (gcc versions are in the .c source).

There appears no noticeable difference between taking a) 1 dwords, b) 3 dword, c) etc, k) 8191 dwords of distance between the linked nodes, so I cannot say that anything was observable regarding the page size or TLB.

The experiment was run for combinations of 144 comparisons, divided into n = 3,4,6,8,12,18,24,36,48 vantage points and the corresponding m = 48,36,24,18,12,8,6,4,3 entries in the linked list, that is: n*m = 144 was kept constant. There appears a rather small (< 1/3) difference and this could be attributed to the variable cost of loop control; but YMMV.A completely difference problem space is to increase the number of comparisons, since this grows quadratic. I simply have no tool for measuring what effect was caused by what, and therefore leave it at that.I will study your detailed response in the forum for which thank your very much.With kind regards,Klaus D. Witzel

OK, some kind folks on comp.arch have reproduced your result, getting 10 second times in assembler but on a faster x86. So what would you get on a Mill?

I’ll describe what I would do if I were writing asm; I *think* the compiler would find all this, but it’s not certain. I will also assume that the target member is not width-constrained, and will see what width is necessary for best code.

First, the structure is a three-deep loop:

`for each many times {for each route node { for each vantage{}}}.`

The vantages can be vectorized on a Mill despite being a while-loop. That makes the code look like:`middle loop: load route node splat node to vector init inner counter init vantage pointer spill middle counter spill node pointer inner loop: load vantage vector equal vantage/node vectors bool reduce equal vector exit if found increment vantage pointer decrement inner counter equal inner counter/zero repeat inner loop if not done fill middle counter fill node pointer decrement middle counter advance node pointer equal middle counter/zero repeat middle loop if not done`

The outer loop is similar but won’t impact the timing.

The inner loop takes two cycles on a Mill exclusive of the load; the limiting data path is the vector equal and the Boolean reduction, which are one cycle ops but dependent. However, the loop would be pipelined to a single cycle, with five exu (ALU) ops and three flow (mem/control) ops. The dataflow dependency and the latency of the load don’t disappear though, they wind up in the pipeline startup time. The minimum latency for the load is three (hit in D$1) plus the two cycles for the dataflow, means that it takes five cycles for the first iterations and one each thereafter. If we assume a 16-byte vector size (four elements) and 12 vantages then there are three iterations of the inner loop. Hence the whole inner loop takes 8 cycles.

The middle loop must save its loop control data over the execution of the inner loop, spilling the data to scratchpad and then filling it back. It should also keep the initialization of the inner loop around for reuse in the same way. That means 4 spills and 4 fills, plus its own decrement/test/branch and the load of the node. This also pipelines, but not to one cycle, because part must be before and part after the inner loop. However, both the front and back parts become single cycles.

The width required for the front part is four writers (the spills), two flow ops (the load and a conform to start the inner loop), and one exu (the decrement). For the back part it’s four readers (the fills), one exu (the equal), and one flow (the branch). The middle loop takes one cycle (the front), 8 cycles (the inner loop) and one cycle (the back), for 10 cycles, but needs a pipeline startup time of 4 cycles. It executes 12 times, so the total time for the middle loop is 124 cycles. Note that 64 cycles of that are pipeline startup costs ๐

The outer loop will add a couple of cycles to that on each iteration, but to a first approximation the overall time should be 2^7 cycles per outer iteration, and 5×2^31 for the whole thing. With a 1GHZ clock that is 10 seconds, remarkably consistent with the x86 time on the faster machines reported on comp.arch. The maximal width required is 4 readers, 4 writers, 5 exu and 4 flow, which is roughly a Silver.

There are a few rash assumptions in this analysis:

- Everything is in top-level cache after the first iteration. This is a determining factor for both Mill and x86 on this code.
- The loop predictor can predictor the inner and middle branch every time. This also is necessary for both Mill and x86 on this code.
- The Mill tool chain can figure out how to wrap the nested loop. This is believed true, but hasn’t been proven yet. As a fallback, the inner loop could be wrapped in a function that the middle loop calls, which works in this code but not in arbitrary code.

The most painful aspect of doing this code on a Mill is the pipeline startup costs when the iteration count is so small – half of the total time is spent starting the inner loop. Change the counts to have more vantages and the efficiency improves. I don’t know enough about x86 to know if it would scale similarly.

There are some aspects of the Mill that are relevant here but I haven’t used because Not Filed Yet.

Overall: Silver (a mid-range Mill) performance on this code is roughly equal to a high-end x86 with three times the clock rate and a much larger power consumption.

- This reply was modified 7 years, 4 months ago by Ivan Godard.

The most painful aspect of doing this code on a Mill is the pipeline startup costs when the iteration count is so small โ half of the total time is spent starting the inner loop. Change the counts to have more vantages and the efficiency improves. I donโt know enough about x86 to know if it would scale similarly.

O.K. please you make a suggestion for a larger number of comparisons (currently 144) which can better address the startup cost, if possible containing just prime numbers 2,3,5 (perhaps 7). Would one of 360, 2520 do it (I’d reduce the number of runs against the same data appropriately). Then I can run the experiment again, pairing smaller inner with larger outer element counts and vice versa (as before).

No need to rerun your tests; assuming four-element vectors the Mill inner loop will be ceil(N/4) + 5 cycles. The “5” is only significant with N < ~40. The next pressure point will occur when the vantage array no longer fits in the D$1 cache, which will happen (for a 32K cache) around N > ~8000.

- AuthorPosts

You must be logged in to reply to this topic.