Rather than manual annotation, why not just run the program a few times and let it train?
Modifying the binary, such as by adding bits to the branch ops, hits the problem that the decoders cannot know they have a branch until after they have already committed to taking or not taking the transfer. This is a consequence of the pipelining of the decode logic; we have to know to issue the line fetch at least a cache-latency ahead of getting the line, and it’s a couple of cycles more until decode has figured out that a transfer is to be executed. There are two possible resolutions of this: predictive fetch, or retarded execution. We use the former, and have generalized predictive fetch to achieve arbitrary run-ahead. Retarded execution is used by OOO machines, by delaying operations that depend on control flow until the control flow is resolved, using the OOO hardware to keep the execute units busy once the pipe has started up.
The two approaches can approximate each other in the steady state, but predictive fetch has much less startup cost than retarded, so we have a five cycle mispredict penalty rather than the 15 or so typical of retarded OOO. Predictive does require more table state than retarded for equivalent accuracy, but the Mill avoids that issue through our trainable prediction, which in effect gives us arbitrarily large tables in fixed and small hardware.
There may be a third approach to deal with pipelined decoder timing beyond predictive and retarded, but I don’t know of one.