Mill Computing, Inc

Participant

December 16, 2018 at 2:45 pm

Post count: 33

I think the talk could be a bit cleaner.

Ivan used word “schedule” in few places to talk about about very different things.

For example, the schedule at the bottom of slides 16 to 21, in “scheduling pass”. It is not a schedule, nor a scheduling pass. I would call it “prioritization pass”, and result a “operation priority list”. This is an input to a next pass, creation of a tablau, that is actually a scheduling pass and the result is a schedule (aka the list of instructions and ops in their intended order). The intent of the prioritization pass, is to schedule ops within highest length chain first, so they are going to use FUs preferentially. This works, because the other ops, that are not in the highest length chain, can be moved around (scheduled later, in the sense of being considered later during scheduling pass, they can be schedule later or sooner in the actual tablau), without changing the total execution time, otherwise they would be in the longest chain by definition.

As for the number of in-flight loads, Ivan kind of answered it, but not extremely clean. The number of retire stations can be smaller or bigger than maxLatency or the size of the belt. I.e. you can have 2 load units, 32 belt positions, maxLatency 32, and yet have 10 retire station, or 2, or 50, or 64. Having less than 1 per load unit would be extremely stupid, and probably actually hard to implement, so 2 would be minimal. As for the maximum, in theory it can be even higher than 64 (2*32), in the range of 80-100, but it would be hard to actually use due to the way the delays are expressed. (From my benchmarks and tests on some Intel hardware Sandy Bridge era core has ability to do about 70 loads in-flight, even issued by a single core, but some of load queues are shared by all cores, in the uncore and memory controllers, so there are limits, and you can’t do more loads in parallel just by using more cores. also there are limits in how many loads requests you can actually send to DRAM/DDR chips, but it helps with accessing caches).

Scratchpad. The “static” part of the size of the scratchpad, is that each function (frame) is analyzed by compiler, and on the entry to the function, the known constant value (for this function) is supplied to the CPU core. That is on the entry to the function foo, CPU is informed that foot needs a scratchpad of size 17 bytes, that is it wants to be able to use up to 17 bytes of memory for its spill and fill operations (many functions will use 0 bytes of scratchpad). The byte-addressable part means, that “spill b1, 3”, for example will put value from b1 (which can be more than one byte) into address 3 of this function scratchpad. The actual position IN the scratchpad hardware buffer, will depend on the context, and the offset 3. I.e. it can be 170 (determined on the entry to the function by the hardware), and upper bound of 187 (170 + 17), and offset 3 will put data in 173 in SRAM. This obviously ignores the fact that there are other facilities in the Mill, to also spill to DRAM, and to make OS do store and restore of scratchpad for context switching. The main idea for the static size, (like 17 here, that will be determined by compiler, and made as small as possible), is that it allows for cheap execution of function calls and interrupts from within another functions, i.e. nested call of a new function, will get a new scratchpad starting at position 187. If we do deeply recursive call, we might end up going into the end of the SRAM buffer, (lets say 8kB), and what will happen it will be acted like a circular buffer, the values from the start of the circular buffer, will be asynchronously be copied to DRAM by separate piece of hardware (even before we actually reach end of the buffer), and scratchpad addresses will wrap around SRAM addresses, and core will start using the start of this SRAM as a new scratchpads. On returns back from functions, the opposite will be happening, and hardware will be loading back data asynchronously as we moved toward the beginning of the SRAM. One can think of it as a two level circular buffer, with in and out queues, and some logic to make this operations basically have zero impact on latency of spill and fill.

For the question “are there optimizations on tablau done by specializer?”. Well, no, and there is no need for that. The specializer and the scheduling to a target machine, produces nearly optimal code, and optimal schedule in a lot of cases. So by construction the tablau is already optimized. As for the other optimizations, there is no need to do them. Specializers job is not to optimize the code (that is a job of the compiler and middle end), but to generate the code for specific operations. I.e. if middle end creates some redundant operations, or operations that are ultimately ignored and not used by anything, specializer is not going to care, and will still schedule it with all operations in the end code still present. And this is fine too, because this makes specializer faster, and whatever is feed into specializer will be produced by compiler that already get rid of such stuff in practice.

As for the other questions and answers, it is obvious for Ivan, because he sits in this every day, so sometimes it might be hard for him to know that other people have problems with terminology or even most obvious answers. Usually giving more context, is the best options, not just simply giving yes/no answer.

Reply To: The Compiler