Forum Replies Created
- David McCammond-WattsParticipantJanuary 25, 2021 at 10:57 pmPost count: 13
As a programmer of many decades the parts I like most about what I’ve learned about the mill are the innovations around memory. The implicit zero for stack frames is a thing of beauty. You get a strong guarantee of initialization that’s actually faster for the hardware.
Pushing the TLB to the periphery is also genius. A 64-bit address space really is a tremendous amount. We all “know” that statements like “640k is enough for anyone” are laughably short-lived, but that’s only a joke until it isn’t. If “enough is as good as a feast”, then an exponential number of feasts must truly be enough for anyone. That one restriction of living in a single 64-bit address space yields so many benefits if you take advantage of it in the right way. You just have to have a wider perspective and a willingness to rethink past decisions (i.e. separate address spaces for each process).
That’s just a few of my favorites (NaR-bit? Implicitly loading zero? Static scheduling with the performance of OoO?). There have been so many times when learning about the mill architecture that I’ve had that a-ha moment. Once a problem is framed correctly the solutions seem obvious. It reminds me of the joke about the two mathematicians who meet at a conference and are both working on the same problem. They get to talking in the bar at night and start working through it together. They work until the bar closes and then continue to work in the lobby until well into the next day. Finally, they make a break through and one says to the other “Oh! I see! It’s obvious!”
- David McCammond-WattsParticipantAugust 22, 2016 at 9:45 amPost count: 13
A couple of questions–sorry if they’re basic:
1. You mention that when the return operation cuts back a stack that it clears the valid bits on the stack frame’s cache lines. Does the clearing of the valid bits have to cascade to all levels of cache?
2. Unless I’m mistaken, the TLB is a cache of PTEs and might not contain all the PTEs in the system (i.e. it’s a cache over operation system tables, right?). You mention in the talk that that the during a load-miss that gets to the TLB that also misses in the TLB the TLB directly returns a zero, without having to go to main memory. Wouldn’t the TLB have to go to main memory for PTEs, even if it doesn’t have to go to main memory for the actual value to be returned, at least some of the time? Are you using a data structure that makes this unlikely (i.e. you can answer “not found” queries without having access to the whole set of PTEs in the TLB) or is it just the fact that you have a large TLB and the “well known region” registers cover a lot of what would otherwise be PTEs and that makes it likely that all PTEs are in the TLB?
Thanks for the answers. I’m a software guy and not a hardware guy, so I’m sorry if the questions betray a lack of understanding.
- David McCammond-WattsParticipantApril 6, 2014 at 9:36 pmPost count: 13
In the talk you mention that if you have multiple calls in an instruction that the return from one call returns directly to the subsequent called function. You give the example of F(G(a+b)). The instruction would be “add-call-call” where the add is in op phase and the calls are in call phase. A couple of questions:
1. Is this true if the code were G(a+b); F(c+d); (i.e. does the retn in G go directly to F)?
2. How are the belt and stack setup for F? Presumably G’s stack must be cut back and F’s stack established (although with IZ, this should be a simpler process on the Mill) but G’s belt must be established as well. Is this also information stored by the spiller (in addition to return addresses, as described in the security talk)? Does the FU that processes the retn have access to that saved state (something like a belt list you’d pass with a call) to setup the next call (i.e. the call to F)?
Thanks in advance for your answers.
- David McCammond-WattsParticipantApril 6, 2014 at 1:19 pmPost count: 13
I was wondering about what happens to FUs that are occupied at the time of a call. Do they complete while the call is executing and simply output their value to an output buffer for the slot (potentially to get saved and restored by the spiller) or is their internal execution state saved and restored?
Suppose you have a mul and a call in the same instruction and that the mul take 3 cycles. With phasing the mul will have completed 1 cycle’s worth of work prior to the call. Will the mul complete its work (in two more cycles) during the called function’s cycles or would its work be stopped and its internal state be saved (again, to the spiller?) and then restored on return? If the mul completes during the called function then it could show up in the caller’s belt prematurely (unless that’s handled by a naming scheme?). If the state is saved and restored, then potentially more information would need to be spilled and filled and the mechanism for doing such a thing must be costly in complexity (wires, area, power, etc.).
- David McCammond-WattsParticipantMarch 27, 2014 at 5:29 pmPost count: 13
How in the heck were you able to keep these advances to yourselves for so long? I’m reminded of the experience of reading a particularly clever proof–that “oh of course that’s how you do that!” feeling. So many “obvious” (after the fact) enhancements. Keeping all this quiet for so long, well that’s willpower.
I have a question about threading. In the security talk you mention that threads are “…the same familiar thread you’ve sworn at so many times in your programming career.” I thought I’d read on comp.arch that the Mill won’t support preemptive multitasking, even though it would support having multiple threads each running on a separate core. Did I get that wrong or can you have multiple preemptively switched (i.e. not “green”) threads each getting a time slice on the same core?
- David McCammond-WattsParticipantApril 7, 2014 at 8:41 amPost count: 13
The penny drops. I wasn’t fully appreciating how the physical belt worked in relation to the logical. Thanks.
It also makes sense that multiple call ops in the same instruction are not always cascaded. I misunderstood what was claimed in the talk. Your explanation of conditional returns makes that clear. If I’d thought just a little more about it I think it would have been clear. For example, you can’t logically cascade something like F(G(a+b) * c). The mul would have to take place after G returns and so they couldn’t be cascaded (well, and the call ops couldn’t be in the same instructions, either, I suppose).
Is the decision whether to cascade done by the specializer or by the core on the fly?
Also, are there possibly timing issues if the function is quick to execute? Take for example F(G(), x * y). Suppose G is simply a return of a literal value (one supported by cons so no load is needed). The call takes a cycle, the cons takes a cycle, and the return presumably also takes a cycle, for three cycles total. If x * y is a multiply that takes more cycles than the call (an fmul, for example taking more than the 3 cycles accounted for above) and the compiler didn’t schedule the multiply early enough to retire at the right spot (silly compiler), would the cascaded call simply stall waiting for x*y to be computed? Would the specializer know enough to simply not cascade the call in the first place? If done by the core, does it have better information to make that decision?
I apologize if these questions seem too basic, and I appreciate the answers.
- David McCammond-WattsParticipantApril 6, 2014 at 11:06 pmPost count: 13
Yes, I was thinking about C++ style exceptions, not hardware faults. In particular, how this all works if your exception handler (i.e. catch) is several stack frames up from the place the exception was thrown. I doubt that’s implemented as a normal branch or return (tell me if I’m wrong). How are all the intervening stack frames unwound, the internal spiller-space state including the separate return-address stack, any not yet executed”tail-calls”? What happens to in-flight computation results that should be discarded, etc.? I imagine there must be a way for the compiler to tell the core to unwind all of that. Perhaps something similar to a facility for setjmp, longjmp (but not identical, I would imagine)?
I understand if this has to wait for a future talk, but it’s hard to resist asking questions. It’s like having access to your favorite author who has yet to publish the last volume of a trilogy. How do you not ask?
- David McCammond-WattsParticipantApril 6, 2014 at 9:16 pmPost count: 13
Thanks for the explanations. That makes a lot of sense. Just to make sure I have this right: if you have a “mul-call” instruction followed by an “add” then the mul will complete during the called function (presuming it’s more than a couple of cycles) before the add is issued. Even though the mul ends up being computed before the add (because the mul’s result will show up in an output buffer during the cycles of the called function) it will show up in the callers belt at the appropriate time (via naming) *after* the add. The mul’s result might get saved and restored by the spiller if the called function is of sufficient length (or itself calls functions that are of sufficient, length, etc.) but that is all transparent to the software. An interrupt is handled essentially the same (via result replay). Honestly, that sound pretty straight forward, given the alternatives. Let me know if that’s not correct.
I’m curious to see how exceptions play into all of this. I know I’ll have to wait a while, but that must be an interesting story to tell.
In any event, thanks, again!
- David McCammond-WattsParticipantMarch 27, 2014 at 6:20 pmPost count: 13
Thanks for the answer. Based on the Security talk that’s what I expected.
I can imagine all kinds of novel uses for the “protection environment switch via portal” (i.e. essentially a double function-call sans prediction). “Green” thread implementations will rock on the Mill, as will simple system calls. Simply elegant.