Mill Computing, Inc

Keymaster

June 5, 2014 at 8:30 am

Post count: 689

Specific to the question of ganging:

All ganged operations are bound; the two (or more) parts must be in adjacent slots, in order. The hardware slot population is defined to ensure this. Thus for example, in the FMA op (which is a gang because it needs three inputs) may be specified that the gang[0] is in slot 1 and gang[1] is in slot 2, or any other adjacent pair, or even all adjacent pairs if the specification is spendthrift of hardware, but cannot have gang[0] in slot 1 and gang[1] in slot 5. This makes scheduling gangs no more difficult than scheduling non-gang ops.

As Will explains, the compiler does not know the FU and slot layout of the target member, and the same compiled code may be targeted at quite different members. Only the specializer knows the specific target, and it knows everything about that target, including such things as latency (you are right that that varies too). All that is dealt with during scheduling, which is done entirely in the specializer. The compiler output is a dataflow dependency forest graph, structured for easy and fast scheduling, but it is not scheduled until the specializer.

The specializer does three main tasks: substitute emulation graphs for ops that are not native on the target; schedule; and emit binary.

Reply To: Member-independent form vs. member-specific optimizations, such as Mach 3?