The largest difference is that Mill in intended to be a commercial product and so must adapt to commercial realities. Thus (for example) we must use commodity memory, so Wavescalar processor-in-memory techniques are off the table.
The classical problem with dataflow, since the MU-5 days, has been the match-box where retiring operations are matched with those awaiting arguments to mark them as ready for issue. The match-box is functionall the same as the scoreboard or similar device in a OOO. The matchbox is in effect a CAM, and multi-hit CAMs (a single retire can enable arbitrary numbers of other operations) scale lousy, which is in part why OOOs don’t have more pipelines. One-or-none CAMs, like in caches, scale fairly well but multi-hit has been a problem.
These research machine attempted to get around the match-box problem by changing the granularity of the match: from single operations (as in the MU designs) to whole subgraphs, where the subgraph was located by a compiler. You still have multi-hit, but if the subgraphs are big enough to keep the FUs busy for a few cycles your required match rate could drop low enough to handle the multi-hits. That was the idea, anyway.
However, that calls for a smart compiler to isolate the subgraphs. Research has shown that the average local ILP (i.e. what a dataflow needs) is 2; Mill phasing gets that to 6 essentially by overlapping instructions, but it is still a very small number for the kind of massive machines these projects anticipate. However, research also shows that there is very high remote ILP available; 10k instructions and more, far more than will keep a machine busy. The problem is that languages and programs make it very hard to isolate subgraphs that are spread far enough over the program to pick up the remote ILP.
I personally have long felt that dataflow machines are well-worth re-exploring using modern fabrication tech; the MU-5 was from the discrete-component generation. If I worked in academia that’s where I would put some of my research effort. But the Mill’s goal is to sell chips, not sheepskins, and so we must content ourselves with the available local ILP in unchanged codes written in unmodified languages. So no Mill will give a 1000X performance boost the way a dataflow might if the problems could be fixed.
In some pipelinable loops we can get 20X performance, but over the breadth of general-purpose code we are happy to see 5X average, because there’s just no more there regardless of how many function units are present in the hardware. A good OOO can also see better-than-two ILP, although because they do it at runtime they suffer by the limited number of instructions they can keep in flight and match for issue; scaling again. Still, a high-end OOO can see the same ILP numbers our lower-mid-range sees, and in that performance region we must compete in power and area. Only in the higher end do we have performance headroom that an OOO cannot reach.