The Compiler

ivan

Generally close to the consumer is best because it reduces belt pressure; you comput a value early and it may have fallen off the belt before it gets used, forcing spill/fill ops to retain it. The exception is loads, which should be issued as early as possible while still retiring as late as possible; maximizing load deferral gives the memory hierarchy the most time to come up with a value.

Because the inputs are themselves computed as late as possible, usually you don't have the inputs waiting around on the belt either. There are exceptions: when the inputs are call arguments, or when an input is used more than once, with an early consumer and a late consumer. However, all of these cases are examples of a general pattern: inputs must be produced early for outside reasons, and outputs must be consumed late also for outside reasons, so there is a range between production of input and consumption of output in which the op could be placed. It turns out that where you place the op makes no difference in this pattern; if you place it early then its outputs may have to spilled, but if you place it late then its inputs may have to be spilled; the impact on belt pressure is roughly the same, assuming that the number of inputs and outputs are roughly the same. Because we are placing as late as possible in general, we place this pattern late too, just for simplicity even though it could have been placed earlier without affecting overall latency.

Incidentally, operation placement is equivalent to the NP bin-packing problem, so it's a game of heuristics because optimal placement is impractical.

Power is a different issue that our hardware team will address.

akahlich

If so, is there an advantage to issuing instructions late as opposed to issuing them as soon as their arguments are available? In this particular case one could, at least naïvely, think that larger part of the logic could have been clock gated on cycles 3 and 2, possibly leading to power savings if ‘sub’ and ‘mul’ would have been issued together.

The TL;DR answer: there is no power difference as long as the number of belt value spills/fills remains the same.

As far as power savings due to clock gating is concerned, there are 2 major factors:

The power consumed in performing the operation itself (the add, sub, mul, etc.)

The power consumed in maintaining operands on the belt

The power consumed by the operation itself is independent of when the operation is performed. It is what it is, the same every time for a particular set of input values. When an operation is not being performed by a particular functional unit its clock is gated off. When the clock is gated off, the functional unit power consumption is only that due to static leakage current.

The power consumed in maintaining an operand on the belt is nearly constant and depends greatly upon the number of new result values arriving each clock times the number of potential destinations for each result.

The conclusion is that the biggest factor in reducing power is the number of belt spills/fills that must performed. The lower the number of spills/fills, the lower the power.

Witold Baryluk

I think the talk could be a bit cleaner.

Ivan used word "schedule" in few places to talk about about very different things.

For example, the schedule at the bottom of slides 16 to 21, in "scheduling pass". It is not a schedule, nor a scheduling pass. I would call it "prioritization pass", and result a "operation priority list". This is an input to a next pass, creation of a tablau, that is actually a scheduling pass and the result is a schedule (aka the list of instructions and ops in their intended order). The intent of the prioritization pass, is to schedule ops within highest length chain first, so they are going to use FUs preferentially. This works, because the other ops, that are not in the highest length chain, can be moved around (scheduled later, in the sense of being considered later during scheduling pass, they can be schedule later or sooner in the actual tablau), without changing the total execution time, otherwise they would be in the longest chain by definition.

As for the number of in-flight loads, Ivan kind of answered it, but not extremely clean. The number of retire stations can be smaller or bigger than maxLatency or the size of the belt. I.e. you can have 2 load units, 32 belt positions, maxLatency 32, and yet have 10 retire station, or 2, or 50, or 64. Having less than 1 per load unit would be extremely stupid, and probably actually hard to implement, so 2 would be minimal. As for the maximum, in theory it can be even higher than 64 (2*32), in the range of 80-100, but it would be hard to actually use due to the way the delays are expressed. (From my benchmarks and tests on some Intel hardware Sandy Bridge era core has ability to do about 70 loads in-flight, even issued by a single core, but some of load queues are shared by all cores, in the uncore and memory controllers, so there are limits, and you can't do more loads in parallel just by using more cores. also there are limits in how many loads requests you can actually send to DRAM/DDR chips, but it helps with accessing caches).

Scratchpad. The "static" part of the size of the scratchpad, is that each function (frame) is analyzed by compiler, and on the entry to the function, the known constant value (for this function) is supplied to the CPU core. That is on the entry to the function foo, CPU is informed that foot needs a scratchpad of size 17 bytes, that is it wants to be able to use up to 17 bytes of memory for its spill and fill operations (many functions will use 0 bytes of scratchpad). The byte-addressable part means, that "spill b1, 3", for example will put value from b1 (which can be more than one byte) into address 3 of this function scratchpad. The actual position IN the scratchpad hardware buffer, will depend on the context, and the offset 3. I.e. it can be 170 (determined on the entry to the function by the hardware), and upper bound of 187 (170 + 17), and offset 3 will put data in 173 in SRAM. This obviously ignores the fact that there are other facilities in the Mill, to also spill to DRAM, and to make OS do store and restore of scratchpad for context switching. The main idea for the static size, (like 17 here, that will be determined by compiler, and made as small as possible), is that it allows for cheap execution of function calls and interrupts from within another functions, i.e. nested call of a new function, will get a new scratchpad starting at position 187. If we do deeply recursive call, we might end up going into the end of the SRAM buffer, (lets say 8kB), and what will happen it will be acted like a circular buffer, the values from the start of the circular buffer, will be asynchronously be copied to DRAM by separate piece of hardware (even before we actually reach end of the buffer), and scratchpad addresses will wrap around SRAM addresses, and core will start using the start of this SRAM as a new scratchpads. On returns back from functions, the opposite will be happening, and hardware will be loading back data asynchronously as we moved toward the beginning of the SRAM. One can think of it as a two level circular buffer, with in and out queues, and some logic to make this operations basically have zero impact on latency of spill and fill.

For the question "are there optimizations on tablau done by specializer?". Well, no, and there is no need for that. The specializer and the scheduling to a target machine, produces nearly optimal code, and optimal schedule in a lot of cases. So by construction the tablau is already optimized. As for the other optimizations, there is no need to do them. Specializers job is not to optimize the code (that is a job of the compiler and middle end), but to generate the code for specific operations. I.e. if middle end creates some redundant operations, or operations that are ultimately ignored and not used by anything, specializer is not going to care, and will still schedule it with all operations in the end code still present. And this is fine too, because this makes specializer faster, and whatever is feed into specializer will be produced by compiler that already get rid of such stuff in practice.

As for the other questions and answers, it is obvious for Ivan, because he sits in this every day, so sometimes it might be hard for him to know that other people have problems with terminology or even most obvious answers. Usually giving more context, is the best options, not just simply giving yes/no answer.

ivan

Please accept my apologies, folks. Somehow a recent update of the Forum software removed me from the notify list, so I was unaware of the discussion. I’ll be working through the backlog promptly.
Ivan

ivan

Scheduler:
As the specializer is now implemented, the prioritization and the scheduling are not split into two passes. Instead the prioritization is done on the fly during scheduling.

Scheduling itself is fairly straightforward; dependencies and latency determines a target cycle, and the op is placed therein if a supporting slot is free. If not then we search for one, up or down the tableau depending on the op. Prioritization is a more complicated set of heuristics, that use both a priority among different categories of ops without consideration of dataflow, and among ops in the same category based on dataflow and belt size considerations. For more detail you'd need to work through the code, and the heuristics are likely to change as we gain experience.

Retire stations:
As with other aspects of family member configuration, the number of stations will be set based of extensive tuning and measurement for workloads characteristic of the intended market. As a rough starting point, a station count equal to the belt length seems to work well.

Scratchpad:
That's an excellent description of scratchpad behavior - would you be interested in writing a white paper? The only change needed: the OS doesn't spill the scratchpad on task switch. We just note where in the circular buffer the task switch occurred, so the buffer can be being unloaded into one task's save space while it is simultaneously being loaded or used by a different task from its own save space. The only thing that gets changed at task switch is the address that the hardware uses for load/unload.

Questions and responses:
Yes, you are right: it is very hard to set aside one's own familiarity and answer (or prepare a talk) that addresses the level of knowledge and experience of the audience. Especially so when the audience is quite varied and it's necessary to neither bore those familiar with the subject nor lose those not so familiar. We do our best.

« Previous Page