> It seems to me that the compiler will consider most loops pipeline-able for the *abstract* Mill and thus emit its intermediate representation for a fully-software-pipelined loop — only to have the specializer potentially need to (partially, or worse fully) “de-pipeline” it, due to the target’s constraint on the number of simultaneous in-flight loads it can do.
I’m only speculating here, but I get the impression that this resource allocation problem is of the same class as the varying amount of belt slots on the different cores. And will be solved the same way: the compiler emits code as if there was no restriction at all, i.e. as if there is an unlimited number of belt slots and load/retire stations. It doesn’t even know of the spill and fill instrucions for example, since it wouldn’t know where to place them.
The specializer knows those exact limits and then schedules/inserts loads and stores and spills and fills exactly to those limits at the appropriate place. Figuring out how many parallel loads and stores you can have and how much pipelining/unrolling loops you can do on a core is pretty much the same as figuring out how many belt slots you have and thus how to link consumers and producers together, except in one case you emit spills and fills at the limits and in the other you emit branches at the limits. The loads and stors in both cases are just placed according to best latency hiding behavior.