That’s an interesting paper; thank you for posting it.
We looked at splitting branches into a herald (their “bb” instruction) and a transfer, essentially as described in the paper, though without their attention to Spectre. After all, it is a natural companion to our split loads, which also count out an instruction-supplied cycle count. And we ultimately decided against.
The authors make a reasonable case that they will find a big enough latency gap between resolution of the predicate (the branch condition) and the transfer to save a few cycles. But their study deals with a single-issue, in-order architecture, which a Mill most emphatically is not. Instead it is a wide-issue, and a single instruction bundle can (and often does) have several branch instructions. The multiple branches would muck up the author’s scheme, but not irretrievably.
But the wide-issue mix of compute and control flow in the same bundle obviates their approach. In open code, the predicates are commonly computed by the same bundle that contains the branch. Even legacy ISAs commonly have “compare-and-branch” ops that combine determining the predicate and taking the transfer. With no advance notice of the predicate it’s not possible to gain any static cycles on the transfer. Of course, in loops (“closed code”) you can roll forward the predicate, potentially arbitrarily far at the cost of extending the loop prefix, but loop branches predict extremely well.
What really decided us against a variable delay slot was the Mill replacement of branch prediction with exit prediction. Exit prediction lets the predictor run arbitrarily far ahead of execution, prefetching as it goes but without needing to examine the fetched code. The real win with a predictor is not in avoiding miss rewinds (at least on a Mill where a miss is five cycles) which the authors scheme helps with, it’s in moving instruction lines up the memory hierarchy.
Of course, there will be blocks with a single exit (avoiding the multi-branch problem) whose predicate is known early (avoiding the lacking latency problem). The question is whether their frequency would justify the feature, and we decided that no, they would not.