wren6991
At 33:24 in the video, it seems like both decoders will see non-zero lag counts at the same time: 1 on the left side, and 2 on the right side. Since there are no explicit no-ops here, each of these non-zero-lag instructions must immediately follow the previous instructions in memory.
So how does the machine figure out that the right side NOPs first, and the left side NOPs second? Is there some canonical order? Is this encoded? Or is it something clever? :-)
ivan
The running lag on each side is maintained in an internal register in that side, which are independently counted down each issue clock and are reset up as added lag is decoded. The difference in the contents of the lag resisters tells which side restarts first, or rather that is the effect, because each is independent and restarts when its private lag runs out. Of course, if the registers have the same value they restart together. In your example, because both are non-zero they both stall and the left counts to zero and the right to one. The next cycle the left (at zero) issues and the right counts to zero. The cycle after that they both issue.
This of course leaves the question of how to know the running lags at some random point of the code. The call operation saves the lags and return restores them, which takes care of calls. However, the debugger is more of a problem. The compiler statically knows when each half-instruction issues relative to the half-instruction on the other side and can put the relative lag in the debug info. However, if both are lagging, there's no way (at present) to tell the debugger/sim to startup (or examine) one, two, ... cycles before lag-out and issue resumption, and instead you see the program as of the first actual issue after the point you select if that point is in the middle of bilateral lagging. We could add a command to the UI to let the user set the lag point, but it oesn't seem to be actually useful and we haven't.
goldbug
So I have been taking a class on computer architecture (I am a software guy). The more I learn the more in awe I am with the beauty of the Mill instruction encoding and other features.
CISC sucks. It needs millions of transistors just to decode 6 instructions.
RISC is a clear improvement, but the superscalar OoO design is ridiculously complicated, as I learn about the Tomasulo algorithm, wide issue/decode, speculative execution, I can't help but think "this is insane, there has to be a better way". It feels like the wrong path.
VLIW seems like a more reasonable approach. I know binary compatibility problems and stalls have been a challenge for VLIW architectures.
The Mill is just beautiful, it has a sane encoding and simplicity of a VLIW. But phasing and double instruction stream really take it to the next level.
The separate load issue and retire is in hindsight the obvious way to solve the stalls due to memory latency that is so common in VLIW.
The branch predictor is so cool too, you can predict the several EBB's in advance, even before you start execution. Mainstream predictors have to wait until they get to the branch instruction.
The specializer is a neat solution to binary compatibility.
I really hope to see this CPU make it to silicon.
davidm
As a programmer of many decades the parts I like most about what I've learned about the mill are the innovations around memory. The implicit zero for stack frames is a thing of beauty. You get a strong guarantee of initialization that's actually faster for the hardware.
Pushing the TLB to the periphery is also genius. A 64-bit address space really is a tremendous amount. We all "know" that statements like "640k is enough for anyone" are laughably short-lived, but that's only a joke until it isn't. If "enough is as good as a feast", then an exponential number of feasts must truly be enough for anyone. That one restriction of living in a single 64-bit address space yields so many benefits if you take advantage of it in the right way. You just have to have a wider perspective and a willingness to rethink past decisions (i.e. separate address spaces for each process).
That's just a few of my favorites (NaR-bit? Implicitly loading zero? Static scheduling with the performance of OoO?). There have been so many times when learning about the mill architecture that I've had that a-ha moment. Once a problem is framed correctly the solutions seem obvious. It reminds me of the joke about the two mathematicians who meet at a conference and are both working on the same problem. They get to talking in the bar at night and start working through it together. They work until the bar closes and then continue to work in the lobby until well into the next day. Finally, they make a break through and one says to the other "Oh! I see! It's obvious!"
ivan
So do we :-)
Would your instructor welcome a class presentation by you about the Mill?
goldbug
No, I am long out of school.
I just take courses online (Udemy) of things that I find interesting.