This goes about half-way to what the Mill does. You are correct that the usual legacy branch-predictor doesn’t know its got a branch until too late, after it has decoded the branch. There are known improvements, such as marking branches in the I$1 cache lines, which lets you know you have a branch coming when you fetch the line. However this is also usually too late in branch-rich code. Still more complex approaches use the regular predictor to build branch-free code sequences, often already partly decoded; this can save a lot in loops but doesn’t help in open code.
The need to look at “here” before you can realize that you will need to go “there” introduces potential memory delays in prefetch. Adding an explicit prefetch operation doesn’t really help, because you have to decode and execute the prefetch op and generally that comes too late too. A prefetch…load sequence would hide the I$1 latency, but not a 300 cycle DRAM latency – the compiler has no idea where it will be 300 cycles in the future. However, runtime often can have a pretty good idea, and we use that.
The Mill moves the whole predict and fetch machinery away from the code into a side structure with its own engine unrelated to the decoders. That engine can run arbitrarily far ahead of decode; in particular, it can run more than a memory-fetch-time ahead.
Your approach of a second buffer might help on a legacy single-issue architecture, but not on a Mill where an instruction can contain several branches, only one (or none) taken; a side buffer for each would blow out the hardware, especially as the following instruction can have several too – see the code example in the “Switches” talk. There is an I$0 microcache in the Mill, which is vaguely similar to multiple side buffers, but this is to cut the power/bandwidth demand in loops more than to cover mispredictions.
But even if you keep a side buffer per instruction the side buffers greatly increase the bandwidth demand into the decoders; instruction bandwidth is a limiting factor for high-end Mills. The Mill only moves the predicted path to the decoders; the alternate paths are prefetched to the caches but not fetched to the decoder. That lets prefetch be slow and relatively rare (and so not in the way), with the run-ahead prediction giving the necessary advance notice. We still have to get a missed target from the caches, but we avoid the great bulk of the cost of getting from memory.
To answer your final question: no, branch prediction is not actually necessary, and Mill uses exit prediction instead. But so long as memory latency is much longer than a compiler can see, and path fanout remains larger than hardware can handle in parallel, prediction of some kind will be necessary for performance.