The penny drops. I wasn’t fully appreciating how the physical belt worked in relation to the logical. Thanks.
It also makes sense that multiple call ops in the same instruction are not always cascaded. I misunderstood what was claimed in the talk. Your explanation of conditional returns makes that clear. If I’d thought just a little more about it I think it would have been clear. For example, you can’t logically cascade something like F(G(a+b) * c). The mul would have to take place after G returns and so they couldn’t be cascaded (well, and the call ops couldn’t be in the same instructions, either, I suppose).
Is the decision whether to cascade done by the specializer or by the core on the fly?
Also, are there possibly timing issues if the function is quick to execute? Take for example F(G(), x * y). Suppose G is simply a return of a literal value (one supported by cons so no load is needed). The call takes a cycle, the cons takes a cycle, and the return presumably also takes a cycle, for three cycles total. If x * y is a multiply that takes more cycles than the call (an fmul, for example taking more than the 3 cycles accounted for above) and the compiler didn’t schedule the multiply early enough to retire at the right spot (silly compiler), would the cascaded call simply stall waiting for x*y to be computed? Would the specializer know enough to simply not cascade the call in the first place? If done by the core, does it have better information to make that decision?
I apologize if these questions seem too basic, and I appreciate the answers.