The load format knows that it is a Mill and has done operation selection under the assumption that the full abstract Mill operation set is available. The specializer replaces missing ops with emulation subgraphs, often calls but sometimes in-line, that do the same job. The canonical example is floating point, which some Mill family members lack completely.
The load format also knows the widths of each operation. Even though width is not in in the final encoding (it’s represented in the metadata), the compiler that generates load format knows the widths of the arguments, so load format distinguishes add-byte from add-half, and so on. The specializer also substitutes emulation for missing widths, notably on members that lack hardware quad precision.
Load format itself is a serialized forest of data- and control-flow graphs. Emulation substitution is O(N) in program size, and is done on the fly as the graphs are de-serialized.
The major task of the specializer, and the time critical one, is operation and instruction scheduling. We use the classic VLIW time-reversed deterministic scheduler, which is O(N) for feasible schedules. Not all schedules are feasible because we can run out of belt and must spill to scratchpad. In such a case, the specializer modifies the graph by inserting spill-fill ops and reschedules, repeating if necessary. A feasible schedule is guaranteed eventually, because the algorithm reduces to sequential code which is feasible by hardware definition. The time depends on the number of iterations before feasibility, which is at most N giving an O(N^2) worst case. However, studies show O(N^1.1) is typical, i.e. few EBBs need spill, and more than one iteration is essentially never seen.
That’s a big-O analysis, and the constants matter too in the actual speed of the specializer. However, the constants depend on the horsepower of the Mill doing the specialization, and we don’t have sims yet for anything that complex. We don’t expect speed to be an issue though; disk latency is more likely to dominate.
There are a few cases in which best member-dependent code is not just a matter of schedule. An example is the decision of whether a given short-count loop is worth vectorizing, which decision depends on the vector size of the member. The compiler normally does the vectorization, and the only part the specializer plays is to supply a few constants, such as the amount to advance addresses by with each iteration. However, in some cases it is better not to vectorize at all, and the vector and scalar loops are quite different.
In such cases the load format gives alternate code choices and a predicate that the specializer evaluates. The chosen code is then integrated with the rest of the program in the same way that emulation substitution is done. We want to compiler to be sparing in its use of this facility though, to keep the load-format file from blowing up too big.
I expect that there will be a Specializer talk, but not until the new LLVM-based compiler is at least up enough for public view.