So, the loads will still pollute the cache, and a timing attack can be performed.
I read somewhere that one way around the issue is to have some amount of “speculative” cache. That is, data which doesn’t get stored in the normal cache and can only be accessed by the speculatively executing code, and said speculative code accessing any data already in the cache doesn’t affect whether said cached data gets evicted from said cache. If the CPU runs out of “speculative” cache, the speculative execution would stop pending resolution of the branch.
I think you could also fix it by making all* of the cache be one physical level which the CPU could then dynamically partition into per-process private caches and multiple logical shared caches. Processes could then tell the CPU whether they want a larger private cache or request access to one (or more, I suppose) of the logical shared caches if it’s one app that needs to share data across multiple concurrent threads. Since cache wouldn’t be shared between unrelated processes, malicious code wouldn’t be able to observe any changes to other processes’ cache, and non-malicious code couldn’t accidentally leak data by changing malicious code’s timing. Dunno what that’d cost in transistors, though, nor do I know what the performance implications would be for only having one physical cache level.
*Except maybe some amount per-core, small enough that it could be flushed along with the belt every context switch.