Whilst the conform instruction is an extra instruction, its arguably doing what a conventional register machine has to do anyway, but spread out across the instructions of the loop. Ideally, such as when there is only a single place in the code that can branch to a block, you don’t need it as can setup the receiving block based on the incoming belt layout. But when you do need it you are just bringing back the overhead that a conventional register machine has in the encoding of the output registers of instructions.
As for the idea of a conflict, I’m not sure where you got that from. It might be worth going back to the execution talk. The mill is a statically scheduled VLIW arcitecture. So microparallism is handled by an explicit ordering of instructions and statically defined latency (for a specific family member, see the specification talk for how this is handled) which allows the specializer to dispatch multiple opcodes per instruction, and know exactly when each will drop its result onto the belt. Thus very few instructions can actually stall. pickup being one of the few, and there’s very little you can do if you are stalled on a pickup.
And for context switches the amount of data would depend on family member, but is handled in the background by the spiller. And anyway the register file/belt on a modern processor is a tiny part of the cost of a context switch as opposed to cache flushing/TLB flushing and these days flushing the speculative execution pipeline. So the differences in the mills memory system will hide any impact the belt has on context switches.