Forum Replies Created
- AuthorPosts
SAS means there are no homonyms on the Mill. Fair enough. However, what about address synonyms? Memory with copy-on-write semantics is obviously not an issue, so this has no effect on
fork()
, but I seem to recall someone mentioning that all of physical RAM is mapped at startup, and there are definitely a few neat tricks involving doubly-mapped pages (such as fast circular buffers and heap compaction without pointer changes).- in reply to: Layout can dramatically affect performance #3935
My guess would be that the Mill is significantly less sensitive to this kind of thing just by accident.
Firstly, laying things out to avoid I$1 conflicts is certainly something that could be implemented, but it would require knowing which functions usually need to be called at the same time during specialization. However, exit prediction prefetch means that a much higher percentage of jumps end up as cache hits anyway, so the impact of not doing that is mitigated.
Secondly, the Mill will inherently store much less data on the application-visible stack because register saving is handled by the hardware and the Mill can store huge amounts of data in the Scratchpad. This makes stack data more compact, inherently giving better locality.
Thirdly, the Mill is smart about stack frames. A stack frame that is evicted doesn’t necessarily incur a cache miss when it is used again, since the Mill tracks stack frames across function call boundaries. In particular, operations on fresh stack frames should just work.
Fourthly, as a SAS-with-fork()-hack processor, the Mill will only hit the TLB if it’s going to memory anyway. Permission checks have to be done on each load, store, branch, and call, but the PLB is effectively fully associative and only stores one entry per range of bytes with a given permission, making it completely insensitive to both code and data layout changes that don’t cross protection boundaries.
Contrary to what everyone else is saying, this isn’t possible on the Mill as we know it.
However, it seems likely that it can be made to be possible. Unfortunately, it may be more trouble than it’s worth…maybe it’ll be in the GenASM but never actually implemented like Decimal. However, here’s my take:
- Have a new type of data that can be operated on aside from float, integer, and pointer: the int60. The int60 is like a pointer in that ops on it are only specified for pointer-size.
- To use int60, a SpecReg has to be set indicating one of the three-bit patterns in the reserved bits as the int60 pattern (default 000).
- SpecRegs also set the handlers for int60 overflow ops. There are ADD, SUB, MUL, DIV, MOD, DIVMOD, AND, OR, NOT, XOR, SHL, SHR, TOINT (with output width, overflow behavior, and output signedness arguments), FROMINT (with input signedness arguments), and DROP (explicitly-called destructor op).
- Basically, treat as 60-bit int when the flag bits are the three-bit “this is actually an integer” pattern and call routines as a pointer when otherwise.
A couple of questions/thoughts.
What are the advantages of using the belt for addressing over direct access to the output registers? Is this purely an instruction density thing?
Why does the split mux tree design prevent you from specifying latency-1 instructions with multiple drops? Couldn’t you just have a FU with multiple output registers feeding into the latency-1 tree? I’m not able to visualize what makes it difficult.
For that matter, how does the second-level mux tree know how to route if the one-cycle mux tree only knows a cycle in advance? It seemed to me like either both mux trees would be able to see enough to route the info, or neither would. Does this have to do with the logical-to-physical belt mapping, because that’s the only thing I can think of that the second-level mux tree would have over the one-cycle tree.
- in reply to: Memory level parallelism and HW scouts #4001
Maybe it was poorly worded. What I meant was that an OoO could find a way to overlap stalls where static instruction bundles could not.
Suppose the scheduling is `load_offset(a, i, delay), load_offset(b, j, delay);
…
// a[i], b[j] drop
con(member_offset), load_offset(b0, b2, delay), load_offset(b1, b2, delay);`Imagine a[i] hits but b[j] misses, now we have no choice but to stall. Then imagine a[i]->member misses but b[j]->member hits, now we have no choice but to stall again. Whereas an OoO processor can see that a[i] hits and issue the load for a[i]->member before retiring the load for b[j].
- in reply to: Fractional byte packing? #3993
Like, if you were to pack five six-bit values and two boolean flags into a 32-bit word, is there a clean way to extract them without clogging the ALU pipelines with a bunch of intermediary shift-and-adds?
- in reply to: Is the Mill ISA classic virtualizable? #3669
Your reply makes me wonder what it would take to make a fully position-independent operating system that can be started in either the ALL turf or a smaller turf under a “hypervisor” in ALL. There’s certainly no advantage to doing it that way, and in fact at that point you might genuinely be in danger of running out of address space, but it’s an interesting thought experiment.
- in reply to: MILL and OSS #3247
So I assume that means that the open-source specializer generator will itself be written in ordinary C++ and be able to generate x86 or ARM code that can do the specialization, right? Because if the only way to generate conASM for a specializer is the closed-source specializer, you’re going to have an obvious trusting trust problem.
Of course, adding complex, self-hiding backdoor code in the specializer and debugger would probably make specialization and debugging very slow, which would make most genASM JIT very slow–and as you often note, you’re in it for the money.
- in reply to: Meltdown and Spectre #3239
Well, I guess it’s mostly proof against optimizing compilers that do if-conversion on bounds check code when generating genASM. Since most bounds checks are going to pass, this is a valid optimization IFF the index doesn’t come from untrusted input. Of course, I suppose you can’t really stupid-proof an architecture against a compiler.
There’s no problem with the routing itself; everything is known for all latencies when the L2P mapping is done. The problem is in the physical realization of that routing in minimal time. A one-level tree would have so many sources that we’d need another pipe cycle in there to keep the clock rate up.
So, to clarify: the problem is not that the routing information isn’t known far enough in advance, but that the results for the one-cycle latency ops don’t exist that far in advance? And anything specified as latency 2 or more will exist a full cycle in advance?
Sorry. Anyway, I meant like instead of an entire belt, the code specifies the sources based on the latency 1/2/3/4 results of each FU, plus the actual spiller registers. Since a specializer can know statically where the spiller puts overwritten results, that’s not a problem in conASM. Is belt renaming that important?
- in reply to: Pipelining #3226
So, Rust? I mean, I think traits are a little less powerful than full typeclasses with all the bells and whistles enabled, but they’re pretty good for most work. Shame that the Rust guys decided to keep the angle-bracket syntax, though.
- in reply to: Prediction #3223
Don’t some Mills also have deferred branches? Since multiple branches that retire in the same cycle are defined to resolve to the first address in issue+slot order, this has some potential to lock in a code path for prediction: issue a deferred branch some cycles ahead, and if it’s known taken, we know enough to lock in that address and start the I$ load immediately. Of course, that mechanism works better for open code, since most Mill loops will be single-cycle pipelined loops that can’t schedule the exit branch as anything other than immediate.
- AuthorPosts