Forum Replies Created
- AuthorPosts
Another email from the OP, now containing source code which I have put in DropBox at https://www.dropbox.com/s/9jggxl33qu5wrny/Implementors.c. I will try to improve my Mill guesstimate in another post after looking at the code more closely.
=============================================================================Hi Ivan,
I’m not the expert for discussing TLB thrashing or thrashing in the D$1 in a forum, and therefore take the liberty and send the small source code file of the experiment together with my observations. All parameters (# of nodes, dword distance/page(s) between nodes, # of dword vantage points, # of repetitions of the run against the same data) can be configured by passing named parameters through the compiler. The experiment is compiled with the -m32 option and -O3 for gcc, on Vista and on Ubuntu (gcc versions are in the .c source).
There appears no noticeable difference between taking a) 1 dwords, b) 3 dword, c) etc, k) 8191 dwords of distance between the linked nodes, so I cannot say that anything was observable regarding the page size or TLB.
The experiment was run for combinations of 144 comparisons, divided into n = 3,4,6,8,12,18,24,36,48 vantage points and the corresponding m = 48,36,24,18,12,8,6,4,3 entries in the linked list, that is: n*m = 144 was kept constant. There appears a rather small (< 1/3) difference and this could be attributed to the variable cost of loop control; but YMMV.A completely difference problem space is to increase the number of comparisons, since this grows quadratic. I simply have no tool for measuring what effect was caused by what, and therefore leave it at that.I will study your detailed response in the forum for which thank your very much.With kind regards,Klaus D. Witzel
OK, some kind folks on comp.arch have reproduced your result, getting 10 second times in assembler but on a faster x86. So what would you get on a Mill?
I’ll describe what I would do if I were writing asm; I *think* the compiler would find all this, but it’s not certain. I will also assume that the target member is not width-constrained, and will see what width is necessary for best code.
First, the structure is a three-deep loop:
for each many times {for each route node { for each vantage{}}}.
The vantages can be vectorized on a Mill despite being a while-loop. That makes the code look like:middle loop: load route node splat node to vector init inner counter init vantage pointer spill middle counter spill node pointer inner loop: load vantage vector equal vantage/node vectors bool reduce equal vector exit if found increment vantage pointer decrement inner counter equal inner counter/zero repeat inner loop if not done fill middle counter fill node pointer decrement middle counter advance node pointer equal middle counter/zero repeat middle loop if not done
The outer loop is similar but won’t impact the timing.
The inner loop takes two cycles on a Mill exclusive of the load; the limiting data path is the vector equal and the Boolean reduction, which are one cycle ops but dependent. However, the loop would be pipelined to a single cycle, with five exu (ALU) ops and three flow (mem/control) ops. The dataflow dependency and the latency of the load don’t disappear though, they wind up in the pipeline startup time. The minimum latency for the load is three (hit in D$1) plus the two cycles for the dataflow, means that it takes five cycles for the first iterations and one each thereafter. If we assume a 16-byte vector size (four elements) and 12 vantages then there are three iterations of the inner loop. Hence the whole inner loop takes 8 cycles.
The middle loop must save its loop control data over the execution of the inner loop, spilling the data to scratchpad and then filling it back. It should also keep the initialization of the inner loop around for reuse in the same way. That means 4 spills and 4 fills, plus its own decrement/test/branch and the load of the node. This also pipelines, but not to one cycle, because part must be before and part after the inner loop. However, both the front and back parts become single cycles.
The width required for the front part is four writers (the spills), two flow ops (the load and a conform to start the inner loop), and one exu (the decrement). For the back part it’s four readers (the fills), one exu (the equal), and one flow (the branch). The middle loop takes one cycle (the front), 8 cycles (the inner loop) and one cycle (the back), for 10 cycles, but needs a pipeline startup time of 4 cycles. It executes 12 times, so the total time for the middle loop is 124 cycles. Note that 64 cycles of that are pipeline startup costs 🙁
The outer loop will add a couple of cycles to that on each iteration, but to a first approximation the overall time should be 2^7 cycles per outer iteration, and 5×2^31 for the whole thing. With a 1GHZ clock that is 10 seconds, remarkably consistent with the x86 time on the faster machines reported on comp.arch. The maximal width required is 4 readers, 4 writers, 5 exu and 4 flow, which is roughly a Silver.
There are a few rash assumptions in this analysis:
- Everything is in top-level cache after the first iteration. This is a determining factor for both Mill and x86 on this code.
- The loop predictor can predictor the inner and middle branch every time. This also is necessary for both Mill and x86 on this code.
- The Mill tool chain can figure out how to wrap the nested loop. This is believed true, but hasn’t been proven yet. As a fallback, the inner loop could be wrapped in a function that the middle loop calls, which works in this code but not in arbitrary code.
The most painful aspect of doing this code on a Mill is the pipeline startup costs when the iteration count is so small – half of the total time is spent starting the inner loop. Change the counts to have more vantages and the efficiency improves. I don’t know enough about x86 to know if it would scale similarly.
There are some aspects of the Mill that are relevant here but I haven’t used because Not Filed Yet.
Overall: Silver (a mid-range Mill) performance on this code is roughly equal to a high-end x86 with three times the clock rate and a much larger power consumption.
- This reply was modified 10 years, 10 months ago by Ivan Godard.
It’s not clear to me why it is taking so long on the Intel.
The data seems small enough that the V.j will stay in the D$1, as will the N.i; there are only 12 lines for N.i and probably only one for the V.j if I have understood the problem. Hence the inner loop has a load (of N.i) and 12 compares and a branch and some overhead, but it shouldn’t be over 64 instructions. That makes 5×2^30 operations, which with a 2.5GHz clock should be under a second.
Hence I have not understood the problem, or you are getting something odd.
First, are you repeating the same N.i for each iteration (as I assume above), or are you following a different linked list each time? If different then you will be memory bound and the timing you give is plausible.
However, the fact that you can cut the time in half doing the compares in SIMD argues that you are not memory bound. One possibility is that you are thrashing in the D$1 or the TLB. If all the N.i addresses are in the same D$1 cache way then you will be using the D$2 cache because of thrashing. Are the addresses used for the N.i commensurate with some stride that might put them all in the same cache way?
Thrashing in the TLB is a different problem. You say that the twelve are in different pages, so you will be touching at least 12 TLB entries, plus whatever gets used for stack and the V.j etc. I don’t know how many TLB entries your Intel has, but if you are using more than are there then you will thrash in the TLB replacement algorithm, which is a known performance problem. You can test this hypothesis by increasing the size N of the N.i list over several runs; if there’s a sudden knee drop in performance at some point going from N to N+1 then you have likely hit the size of the TLB.
Now, to your question about the Mill:
- If (as I surmise) you are reusing the same list for each of the 2**24 runs then the data will be brought in only once from memory and will be in cache thereafter. Mill deferred loads (see the Memory talk) will hide some of that first fetch, as will the Intel OOO, but that first run will be at DRAM speed for both machines.
- Both machines can thrash in the D$1 if the data are all in the same cache way on the particular machine. Way mappings vary, and most machines (including the Mill) skew the mapping so hitting a bad stride mapping would be just bad luck. The chance of it happening is roughly equal on the two machines.
- The TLB issue is a major difference between Mill and Intel architectures. Intel has to check the TLB for every N.i fetch even when the data is in cache; the Mill does not. This is covered in the Memory talk. If the slowness is due to TLB thrash then the Mill will run *much* faster than the Intel. The Mill does have to check protection in the PLB, but the PLB entries cover an entire protection domain, typically the whole program, rather than needing an entry per page, so you won’t thrash.
The speed of the comparison loop on a Mill depends on the vector element count, which in turn depends on the data size (which you don’t give) and the Mill family member involved. Assuming 4-byte data and a mid-range Mill then the vectors will be 16 or 32 bytes, so either 4 or 8 elements, giving three or two iterations for the 12 comparisons. I am assuming that the code doesn’t know there are 12 V.j, but is iterating through a V.j array; if the code statically knows that there are only 12 then you (or rather the compiler) would open-code the comparisons. With software pipelining and again a midrange Mill each iteration will be two cycles. However the iteration count is very short so startup time would be important. At a reasonable guess each set of 12 comparisons using 4-element vectors would take 9 cycles as a loop, and 7 cycles as open code; with 8-element vectors that becomes 7 and 5.
The fetch of the N.i in the outer loop can be fully pipelined, so overall the outer loop iterations should be less than 10 cycles, so say 2^30 cycles for everything. At a 1GHz clock rate (expected for the initial Mills; faster will come later), that’s 250 milliseconds.
This analysis is making rash assumptions about the nature of your problem. I don’t understand why Intel isn’t doing better; the Intel cores are much narrower than a Mill, but they have 2.5X the clock rate which should make up for a lot. TLB thrashing is a plausible explanation but needs to be confirmed.
No need to rerun your tests; assuming four-element vectors the Mill inner loop will be ceil(N/4) + 5 cycles. The “5” is only significant with N < ~40. The next pressure point will occur when the vantage array no longer fits in the D$1 cache, which will happen (for a 32K cache) around N > ~8000.
Does the calling code always know how many results it is expecting, or does it look at one result and then figure out whether it wants any more?
The Mill call operation explicitly states how many belt results it is expecting; the wrong number returned is a fault. The need for this is a consequence of the belt actually being a renaming mechanism rather than a shift register; the hardware needs to know how many drops will happen (i.e. how much to advance the names) before the actual drops happen if it is to be able to advance the belt without a delay cycle.
So if the caller always knows, and the callee always does what is expected, then there’s no problem if different calls to the same function expect (and receive) different counts. But if it is genuinely dynamic then I see no alternative in the present design to always returning a result count as a second result (which is no problem) and returning the excess in some agreed location.
If varying returns are frequently used then this implementation would be unfortunate and we should consider extending the call protocol to deal with it. If however they are only as common as (for example) VARARGS calls in C then the overhead of returning the (mostly unused) count is minor. In what percentage of all calls does the caller care but not know how many results it is expecting?
- in reply to: Many core mill (GPU) #485
Standard-conforming FP addition can’t be as fast as integer (one cycle) unless the integer is very slowed down. Normalization and rounding must be done before (and after) the integer add that is inside the FP add. Some GPUs and other special-purpose engines just do everything in FP, and an integer is just an FP number with a zero exponent; for those machines int and float trivially take the same time, but that not something you’d do in a general-purpose machine.
- in reply to: Many core mill (GPU) #466
Reply to bhurt #464
For graphics-like loads you would configure a Mill member that was narrow (6-8 total slots) but very high (perhaps 64-byte vector size) so you would have 16-element single-precision SIMD in each of possibly two arithmetic slots . That would give you a respectable number of shaders, but the problem is the load on the memory hierarchy. Each one of those vectors is a cache line, so to saturate the function units you are pulling four and pushing two lines every cycle. Granted, everything used for the drawing is going to live in a whopping big LLC, but the sheer bandwidth at the top is going to be hard.
There are ways to handle this – don’t use cache for the data, but configure NUMA in-core memory for example and push the problem to the software. But the result is pretty special-purpose; a chip with one of those and a handful of regular Mill cores is possible; we’d do fine for less graphics-intensive work. Nevertheless, for Call of Duty go to Nvidia.
In general all ops are fully pipelined, so each slot can issue one op per cycle. Consecutive ops in a pipeline can be the same or different, it doesn’t matter. Ops do not overtake one another in the pipeline, but they do differ in latency so a shorter latency op can retire before a longer-latency op that was issued earlier.
There are a few cases of intra-pipeline hazards that code gen must be aware of and statically avoid. An example is the FMA (fused multiply-add, sometimes called MAC for multiply-accumulate). If, for example, a regular multiply takes four cycles and a regular add takes two for some operand width, it is possible to do a FMA in five, not six, and with better precision than with separate operations. Hence, as the intermediate result is coming out of the multiplier in cycle 4 it needs to be fed directly into the adder. However, the code could be issuing an add operation in cycle 4 that is also goung to try to use the adder, and they can’t both use it.
As a result, while the FPU is pipelined and you can issue one FMA per cycle, or one add per cycle, with no problem, the right combination of FMA and add causes a pipeline hazard. The compiler should avoid this, and the hardware will throw a fault if it detects a hazard violation.
Some implementations of FMA have all their own hardware and don’t share the regular adder. These have no hazard, but do have more costly hardware; it’s all in the specification.
- in reply to: How can I get involved? #454
LLVM is on our list.
There are issues though, which makes a Mill backend a non-trivial undertaking. Essentially the whole backend will be unique to the Mill, because LLVM assumes a register-based target. There are issues in the front and middle ends too – LLVM assumes that pointers are integers, for example, and it can only vectorize counting loops whereas the Mill does while-loops too.
Lastly, the Mill family member operation sets are built from individual specification of the member; nobody manually assigns bits to fields, it’s all done by configuration software. Consequently the backen will be generated in large part, the same way the present asm and sim are, which requires an understanding of the generation system. There’s no instruction set manual to work from; the bit layouts change from member to member, slot to slot, and day to day as we work on it.
We would very much like to talk with people who have already done significant work in LLVM. It’s not a port for a compiler beginner though.
- in reply to: How can I get involved? #452
Reply to roliver #451 –
Thank you. You may be an undergraduate now, but it won’t be forever now matter how it seems 🙂
We expect to have a white paper on the belt implementation fairly soon, for some value of “soon”.
Ivanm
“It’s also used for bit-packing with shifts, but of course is limited to 1-bit sizes when relying on Carry.”
There are bit test/set/clear/flip ops for both dynamic and immediate bit numbers. You no longer need to use shifts.
“Marshaling and [de]serialization is one aspect that’s always a thorn in the side of execution speed, so being able to directly shift various N-bit-wide values in & out of wider numbers with under/overflow detection would be nice. Of course, if the individual steps in such an operation can be automatically executed in parallel, then the need for more powerful instructions goes away.”
A full funnel shift is complicated on any machine; we have worked on it, but are not yet satisfied with the general case of parsing a stream of varying-length items. For static fields (such as bit-fields in C) there is a merge operation that is a bitwise “pick” (take a bit from src0 or src1 based on the corresponding bit in a mask src2), which replaces (a&m)|(b& ~m). However, it’s not clear whether a compiler will recognize that idiom. We have gone back and forth on whether there should be ops for field extract (otherwise it’s two shifts) and field insert (otherwise rotate/mask/or/rotate). The problem is that there can be at most two belt inputs to an operation per slot.
“There are quite a few ABIs that use Carry as an in/out boolean flag, or side-band error return indicator. This is, of course, a result of registers being “expensive”, due to fixed register files. If belt slots are “cheaper”, then those can become normal parameters. However, there still is pressure on spilling that comes with this.”
Those ABIs assume one operation at a time execution. When there could be half-a-dozen ops trying to use that same carry flag… 🙂
“One interesting scenario that comes to mind is the SBCL compiler for Common Lisp. When calling a function in Lisp, it is unknown how many values it will return. In SBCL x86-64, a single valued return returns with carry clear. For multi-valued return, carry is set and the count of returned values is held in RCX (which is a scratch register otherwise).”
For the Mill I’d guess you would simply return the count as a second result in all cases. Is the value returned (when more than one) a list, so the flag is really saying whether thre resulting list is a primitive of a collection?
“While it’s probably a more general question than metadata, how would you address older values on the belt after a function call which returns a variable/unknown number of return values?”
In general you cannot do so; you would have to run everything you wanted to save off to scratchpad, and even that wouldn’t work because you would have no way to know how many results you got if you wanted to save them.
I confess my ignorance of Lisp implementations; how does the received of multiple results discover that happened, and what would it want to do with them when it does? I suspect that the Mill would handle this at a higher level, but I need to understand the use-case rather than the conventional implementation for that case.
The FP error flags are still live until realized by a non-speculative operation (or fall off the belt), so it is not possible to reuse them for integer. Consider:
float f = 10.0; int i = int(f/3.0) + 0x7fffffff;
The float divide will set the inexact flag, while the add will integer overflow. Neither is real until out of speculative context, so both must be preserved. There are many (including me) who feel that the inexact flag is bogus, should never have been introduced, and should be deprecated now, but as a commercial product we have to live with the stands we have, and in any case code could have set the other flags instead.
As for “large enough”, a major focus of the Mill design is to avoid corner cases, especially those that get fobbed off on the compiler. We have tried hard to make everything in the operation set uniform. You have identified one case where we failed: the FP flags only exist for FP-sized values, so code like:
float f = ..; uint8_t b = f < 0.0;
is a problem, because the float flags propagate through the eql operation and subsequent narrowing to a one-byte value. This is not a problem in scalar, because there is a flag-set in the operand even though it is small. But it is a problem with vector narrowing, because we couldn’t see paying for five bits of flags in every byte of a vector. Hence we gritted our teeth and defined vector narrow to be a realizing operation that cannot be speculated. 🙁
It would certainly be possible to have a carry metabit in each byte, and we considered it. After all, there is already a NaR bit in each byte. However, we could not find enough usage for such a design to justify the cost. Perhaps we overlooked some important use for carry – did you have a use-case in mind?
- This reply was modified 10 years, 10 months ago by Ivan Godard.
- This reply was modified 10 years, 10 months ago by staff.
There are no condition codes as such; it’s kind of hard to do when you can have half-a-dozen CC-setting operations in each instruction, although some architectures try, using multiple CC registers.
To have the effect of a carry bit, if you know that no argument was a Nar and you do an addux and get a NaR out, then you must have gotten an overflow, but that’s pretty kludgy. Or you could do an adduw and test the first bit in the extended data, but that’s pretty kludgy too.
However, there is an alternative: the current operation set includes addc and subc, with two results: a wrapped value and a carry/borrow. It is not clear whether these ops will stay around though; the only use is in an arbitrary-precision library, and for quad arithmetic on members that don’t support native quad; neither of these seem that common, and it’s not clear that the ops (which would have two-cycle latency) are worth it.
We expect a general triage of the operations set later in the process of an FPGA implementation.
- in reply to: Site-related issues (problems, suggestions) #413
A reply to post #408
- AuthorPosts