Mill Computing, Inc. › Forums › The Mill › Architecture › Belt saturation in short belts
- AuthorPosts
- #2768 |
Spill arises from two reasons: long-lived values, where belt churn between production and consumption will force an operand off the end of the belt, and belt saturation, in which there are more live operands, even if short lived, than will fit on the belt. Long lived operands are not really much of a problem regardless of belt length: you produce it, you spill it, and later you fill and consume it.
Belt saturation is a much larger problem. If you reduce the number of live operands by spilling some, you will eventually have to fill them again and the fills themselves also drop to the belt. The fill drops in turn increase the live-count at point of fill, which can force something else out. If the lifetimes are short (so spill and fill would be close together) the situation can degenerate into a spillstorm, in which you get only one op through per cycle and are executing out of scratchpad the way an accumulator machine would. There’s an analogous situation in genreg machines, which is usually avoided by requiring a reg population at least 4x the maximal issue rate. “Usually” is the operant word here.
Here’s a test code for our handling of belt saturation. In the target configuration the belt is 8 long, the maximal issue rate is two (one exu numeric, add in this case, and one flow, call in this case). There’s enough scratchpad bandwidth that that’s not an issue. The test case:
int foo(); int bar(int); int _start (int argc, char** argv) { int i1 = foo(); int i2 = foo(); int i3 = foo(); int i4 = foo(); int i5 = foo(); int i6 = foo(); int i7 = foo(); int i8 = foo(); int i9 = foo(); return bar(i1)+bar(i2)+bar(i3)+bar(i4)+bar(i5)+bar(i6)+bar(i7)+bar(i8)+bar(i9); }
Because the function bodies are not visible to the compiler, the calls must be strictly ordered and you (and LLVM) cannot hoist bar calls over foo calls. Consequently at the first bar there are nine foo results on an eight-long belt, and we have belt saturation. You can use this test model on any belt length (or genreg count) by adjusting the number of foo’s. Try it on your own machines to see how good your register allocator is.
The above code gives us this genAsm out of LLVM into the specializer:
define external function @_start w (in w, in d) locals($1, %28, &0, ^27) { label $0: %2 = call w(@foo, ... ) ^0; %3 = call w(@foo, ... ) ^1; %4 = call w(@foo, ... ) ^2; %5 = call w(@foo, ... ) ^3; %6 = call w(@foo, ... ) ^4; %7 = call w(@foo, ... ) ^5; %8 = call w(@foo, ... ) ^6; %9 = call w(@foo, ... ) ^7; %10 = call w(@foo, ... ) ^8; %11 = call w(@bar, in w %2) ^9; %12 = call w(@bar, in w %3) ^10; %13 = add(%12, %11) ^11; %14 = call w(@bar, in w %4) ^12; %15 = add(%13, %14) ^13; %16 = call w(@bar, in w %5) ^14; %17 = add(%15, %16) ^15; %18 = call w(@bar, in w %6) ^16; %19 = add(%17, %18) ^17; %20 = call w(@bar, in w %7) ^18; %21 = add(%19, %20) ^19; %22 = call w(@bar, in w %8) ^20; %23 = add(%21, %22) ^21; %24 = call w(@bar, in w %9) ^22; %25 = add(%23, %24) ^23; %26 = call w(@bar, in w %10) ^24; %27 = add(%25, %26) ^25; retn(%27) ^26;
which is a commonsense SSA version of the source.
The specializer is the first point in the tool chain that knows the target belt length. Because the belt has saturated there will necessarily be at least some spills. However, there are also pushes that happen because early live values from foo calls get pushed off the end by later add results, leading to a sparse belt. Sparsity can be dealt with by the “rescue” op that compacts the belt by moving all live values to the front. Note that there is no actual data movement in rescue; it’s all just renaming.
Here is the resulting conAsm machine code out of the specializer for this target:
F("_start") %0 %1; call1("foo") %2; call1("foo") %3; call1("foo") %4; call1("foo") %5; call1("foo") %6, scratchf(s[7]), spill(s[4], b1 %5), spill(s[0], b2 %4); call1("foo") %7; call1("foo") %8; call1("foo") %9; call1("foo") %10; call1("bar", b8 %2) %11; call1("bar", b8 %3) %12; add(b1 %12, b2 %11) %13, call1("bar", b1 %30) %14, fill(s[0]) %30; add(b2 %13, b1 %14) %15, call1("bar", b1 %31) %16, fill(s[4]) %31; rescue(b12 %6, b11 %7, b10 %8, b9 %9, b8 %10, b1 %15, b0 %16) %32 %33 %34 %35 %36 %37 %38; add(b1 %37, b0 %38) %17, call1("bar", b7 %32) %18; add(b1 %17, b0 %18) %19, call1("bar", b8 %33) %20; rescue(b1 %19, b0 %20, b8 %34, b7 %35, b6 %36) %39 %40 %41 %42 %43; add(b4 %39, b3 %40) %21, call1("bar", b3 %41) %22; add(b1 %21, b0 %22) %23, call1("bar", b4 %42) %24; add(b1 %23, b0 %24) %25, call1("bar", b5 %43) %26; add(b1 %25, b0 %26) %27, retn(b0 %27);
There are two spills (and two fills) and two rescues. The code is flow-side-bound because of the calls, return and rescue; there’s a lot of unused exu left over from the adds. We also don’t need the two scratchpad ports although both are used (the two spills in the same instruction); if the config had only one port then the specializer would have produce a schedule of the same length but put the spill ops in different instructions.
Eight bytes of scratch are used. The code occupies 21 instructions, the minimum possible because there are 21 flow ops and only one flow slot in the hardware. There are 33 operations total. The program runs in 21 cycles exclusive of the bodies of foo and bar.
For contrast, here’s the same genAsm scheduled for Silver with a belt of 16 and more slots:
F("_start") %0 %1; call1("foo") %2, call1("foo") %3; call1("foo") %4, call1("foo") %5, call1("foo") %6; call1("foo") %7, call1("foo") %8, call1("foo") %9; call1("foo") %10, call1("bar", b8 %2) %11, call1("bar", b8 %3) %12; add(b0 %12, b1 %11) %13, call1("bar", b9 %4) %14; add(b1 %13, b0 %14) %15, call1("bar", b10 %5) %16; add(b1 %15, b0 %16) %17, call1("bar", b11 %6) %18; add(b1 %17, b0 %18) %19, call1("bar", b12 %7) %20; add(b1 %19, b0 %20) %21, call1("bar", b13 %8) %22; add(b1 %21, b0 %22) %23, call1("bar", b14 %9) %24; add(b1 %23, b0 %24) %25, call1("bar", b15 %10) %26; add(b1 %25, b0 %26) %27, retn(b0 %27);
There are no spills nor rescues, and the program schedules in 12 instructions and runs in 12 cycles exclusive of the foo/bar bodies. The difference between 21 and 12 cycles is why the Mill is a configurable family.
It seems to me that these calls would likely get reordered using link time optimization. Even then, unless I’m missing something and assuming they have to be in that order because of side effects, it feels like a very unusual case. Do you feel there are more cases less obvious to the eyes, where the lower end belts suffer from this saturation?
It is good to see some code generation getting done though. When can we expect being able to play with the toolchain?Noit, reordering to mix foo() and bar() calls would be a no brainer in a strictly functional programming language. In that case reworking the code to
return bar(foo()) + ...
But in C a function may have side effects, and the order in which calls are made may impact the results. And so the compiler must issue the calls in the order given in the source code to insure correctness.Even in C, when LLVM has the bodies visible it marks functions as being pure or read-only as applicable, which lets the specializer reorder them. Of course, separate compilation hoses that, unless the specializer is run after a LTO pass. We have not yet integrated LTO into the tool chain; most reports say it has little value except for inlining, which we do in the specializer anyway.
This is a test case to confirm that the specializer can handle belt overflow. It is deliberately simplified to the minimum necessary to expose the event of interest. Belt overflow also occurs in real codes, but real codes have so much else going on that it’s hard even for us to understand the behavior.
When can you play with it? Waiting is!
Very interesting Ivan, could you elaborate on how it is that you can issue multiple function calls in the same instruction? When you return from one function does it jump straight into the next function?
The call op is in the flow block on the flow side of the decoders. That block parses in D0, so the presence of calls is known at the start of the D1 cycle, and there is all of D1 and D2 to get organized. When there are cascaded calls the hardware connects the return of the first to the entry of the second and so on; there is no cycle in between. You can think of it as hardware tail recursion removal. It’s not hard on a Mill because there cannot be any hazard between the two calls; on a genreg machine you’d have to check whether the first function did something nasty to the caller’s registers or stack, and even without that you still would have inter-call rename, something I don’t want to think about.
Art claims that he can cascade the trailing call with a branch or return too, so long as the instruction does not have any pick ops. I’m not sure I believe him.
Here’s an example of the kind of code you get for cascaded calls. Given source:
void foo1(); void foo2(); void foo3(); void bar(int i) { if (i == 0) foo1(); else foo2(); foo3(); }
on Silver (with three branch units) the code is:
F("bar") %0; eqlb(b0 %0, 0) %1, calltr0(b0 %1, "foo1"); callfl0(b0 %1, "foo2"), call0("foo3"), retn();
- This reply was modified 7 years, 8 months ago by Ivan Godard.
- This reply was modified 7 years, 8 months ago by Ivan Godard.
- This reply was modified 7 years, 8 months ago by Ivan Godard.
- AuthorPosts
You must be logged in to reply to this topic.