Member-independent form vs. member-specific optimizations, such as Mach 3?

Tagged: compiler, ganging, optimization, specializer

Author
Posts
LarryP
Participant
June 5, 2014 at 6:43 am
Post count: 78
#1114 |
If I understand correctly, Millcomputing’s compiler strategy is to compile to a member-independent intermediate form, which is then converted to a particular Mill target’s binary by the specializer. How does this approach make important model-specific optimizations, such as ganging? Does the specializer perform such optimizations?
In the execution talk, Ivan describes several methods for improving IPC, such as ganging and the (almost free via phasing) pick operation. However, ganging depends not only on the target Mill’s functional unit population, but also on the slot ordering of those functional units, both of which can vary significantly across the Mill family. Similarly, the number of simultaneous adds or picks (for example) depends on how many of the corresponding functional units the target has. In order to achieve “Mach 3” in other than hand-generated Mill assembly code, it seems to me that some tool(s) need to know both the functional unit mix and their slot order on the target machine (since ganging is restricted to adjacent slots.) If the compiler generates a member-independent intermediate form, is it left to the specializer to identify and make use of optimizations that depend on functional unit population and ordering? Or are such member-dependent optimizations handled in some other manner?
Similarly, if the latency of some operations is model specific, as might happen from trading off time vs. chip area (and/or power) in implementing multiplication, then scheduling operations must be target specific. This suggests that the specializer needs to do considerably more than simply translate the compiler’s output to a specific target’s binary encoding. Thus it appears that the specializer has to (re)schedule operations for the target’s specific functional unit population, slot order and the latency of operations. Is this the case, or have I missed something?
Will_Edwards
Moderator
June 5, 2014 at 7:32 am
Post count: 98
#1115
If the compiler generates a member-independent intermediate form, is it left to the specializer to identify and make use of optimizations that depend on functional unit population and ordering?
Yes, this is how responsibilities are split between compiler and specializer.
The compiler targets an abstract Mill – infinite belt, etc – and serializes its AST to file for distribution. Its not actually a single AST, its a forest of candidate control-flow graphs so the compiler can provide alternatives for the specializers to choose between.
The scheduling of operations and scratch and such is performed by the specializer, which knows the parameters of the target Mill.
Whilst non-trivial to think about, the specializer’s scheduling isn’t really comparable to the heavyweight optimisations that a compiler makes. The specializer is very very fast, as it can be used in JIT as well as AOT generation.
I dug up this excellent post by Ivan that covers this in more detail: http://millcomputing.com/topic/introduction-to-the-mill-cpu-programming-model-2/#post-889
Ivan Godard
Keymaster
June 5, 2014 at 8:30 am
Post count: 689
#1116
Specific to the question of ganging:
All ganged operations are bound; the two (or more) parts must be in adjacent slots, in order. The hardware slot population is defined to ensure this. Thus for example, in the FMA op (which is a gang because it needs three inputs) may be specified that the gang[0] is in slot 1 and gang[1] is in slot 2, or any other adjacent pair, or even all adjacent pairs if the specification is spendthrift of hardware, but cannot have gang[0] in slot 1 and gang[1] in slot 5. This makes scheduling gangs no more difficult than scheduling non-gang ops.
As Will explains, the compiler does not know the FU and slot layout of the target member, and the same compiled code may be targeted at quite different members. Only the specializer knows the specific target, and it knows everything about that target, including such things as latency (you are right that that varies too). All that is dealt with during scheduling, which is done entirely in the specializer. The compiler output is a dataflow dependency forest graph, structured for easy and fast scheduling, but it is not scheduled until the specializer.
The specializer does three main tasks: substitute emulation graphs for ops that are not native on the target; schedule; and emit binary.
Author
Posts

You must be logged in to reply to this topic.