Thank you for your very detailed answer.
You flatter me with your offer to contribute on the LLVM side but I’m really not so deep into compilers, I’m a hardware designer with just a basic course in compilers back at unitversity.
During the last few weeks I was peeking into the LLVM optimization stages to find out what is there already and what’s missing from my POV. I have a completely different microarchitecture in mind but it shares some of the problems of producing optimal code with Mill (in fact only the loop portion, I handle open code differently).
As to your comments:
I understand that most SWP code generation steps must happen in the specializer. Here’s a bunch of comments to your answer:
Why do you need the heuristic? If you compare latency of code scheduled following the dataflow with the latency the SWP schedule gives you, you can select the best option every time. Is it to save compile time?
If I were to generate the packed schedule, I’d formulate it as a MIP (Mixed Integer Program) and let the solver do the hard part for me. IMO, using a greedy algorithm at the prototype stage is fine, but it has to be replaced by an optimized algorithm most of the cases anyway.
In your algorithm to determine spills and fills you only guarantee a feasible schedule in the degenerated case. If there are too many inter-loop dependencies it might get spiller bandwidth limited. In these cases it might me worthwhile to additionally use the caches on top of the scratchpad. Of course this introduces the well-known problems with caches.
The None-insertion matches what I figured very closely. I was thinking about using a proof-engine to check reachability before inserting the recurs statements.