Forum Replies Created
- NXTanglParticipantMarch 4, 2019 at 11:17 amPost count: 14
Contrary to what everyone else is saying, this isn’t possible on the Mill as we know it.
However, it seems likely that it can be made to be possible. Unfortunately, it may be more trouble than it’s worth…maybe it’ll be in the GenASM but never actually implemented like Decimal. However, here’s my take:
- Have a new type of data that can be operated on aside from float, integer, and pointer: the int60. The int60 is like a pointer in that ops on it are only specified for pointer-size.
- To use int60, a SpecReg has to be set indicating one of the three-bit patterns in the reserved bits as the int60 pattern (default 000).
- SpecRegs also set the handlers for int60 overflow ops. There are ADD, SUB, MUL, DIV, MOD, DIVMOD, AND, OR, NOT, XOR, SHL, SHR, TOINT (with output width, overflow behavior, and output signedness arguments), FROMINT (with input signedness arguments), and DROP (explicitly-called destructor op).
- Basically, treat as 60-bit int when the flag bits are the three-bit “this is actually an integer” pattern and call routines as a pointer when otherwise.
- NXTanglParticipantFebruary 25, 2018 at 8:00 pmPost count: 14
A couple of questions/thoughts.
What are the advantages of using the belt for addressing over direct access to the output registers? Is this purely an instruction density thing?
Why does the split mux tree design prevent you from specifying latency-1 instructions with multiple drops? Couldn’t you just have a FU with multiple output registers feeding into the latency-1 tree? I’m not able to visualize what makes it difficult.
For that matter, how does the second-level mux tree know how to route if the one-cycle mux tree only knows a cycle in advance? It seemed to me like either both mux trees would be able to see enough to route the info, or neither would. Does this have to do with the logical-to-physical belt mapping, because that’s the only thing I can think of that the second-level mux tree would have over the one-cycle tree.
- NXTanglParticipantFebruary 19, 2018 at 11:36 amPost count: 14
On at least some architectures, you could use the spiller with the prefetching service to mitigate (not eliminate, but reduce the reliability of) cache snooping techniques. Since fast PLB lookups would drop
NaRs on the belt before the cache gets touched at all, the most reliable way to snoop would be to fill the cache, portal out (to the task switcher or the victim code), and then time loads when the portal call returns.
However, there’s nothing stopping you from spilling the cache line base addresses across calls, so you can reload the old cache data in advance of a predicted return. Of course, that reduces other processes’ ability to snoop.
Not sure how viable this is; regardless, it’s offered as an open-source technique, free of charge.
I came up with this while thinking about how to potentially fix Spectre on OOO machines; I was thinking about a scheme to eliminate the side-channel by tagging every cache line with the hardware turf that originated it, so that IPC could avoid a full cache invalidation, but then I realized that the attacker could still snoop to see which of their own lines were evicted so we’d need a full cache invalidation to hide it anyway.
- NXTanglParticipantJanuary 21, 2021 at 2:49 pmPost count: 14
Your reply makes me wonder what it would take to make a fully position-independent operating system that can be started in either the ALL turf or a smaller turf under a “hypervisor” in ALL. There’s certainly no advantage to doing it that way, and in fact at that point you might genuinely be in danger of running out of address space, but it’s an interesting thought experiment.
- NXTanglParticipantFebruary 26, 2018 at 2:03 pmPost count: 14
So I assume that means that the open-source specializer generator will itself be written in ordinary C++ and be able to generate x86 or ARM code that can do the specialization, right? Because if the only way to generate conASM for a specializer is the closed-source specializer, you’re going to have an obvious trusting trust problem.
Of course, adding complex, self-hiding backdoor code in the specializer and debugger would probably make specialization and debugging very slow, which would make most genASM JIT very slow–and as you often note, you’re in it for the money.
- NXTanglParticipantFebruary 26, 2018 at 12:37 pmPost count: 14
Well, I guess it’s mostly proof against optimizing compilers that do if-conversion on bounds check code when generating genASM. Since most bounds checks are going to pass, this is a valid optimization IFF the index doesn’t come from untrusted input. Of course, I suppose you can’t really stupid-proof an architecture against a compiler.
- NXTanglParticipantFebruary 26, 2018 at 12:21 pmPost count: 14
There’s no problem with the routing itself; everything is known for all latencies when the L2P mapping is done. The problem is in the physical realization of that routing in minimal time. A one-level tree would have so many sources that we’d need another pipe cycle in there to keep the clock rate up.
So, to clarify: the problem is not that the routing information isn’t known far enough in advance, but that the results for the one-cycle latency ops don’t exist that far in advance? And anything specified as latency 2 or more will exist a full cycle in advance?
- NXTanglParticipantFebruary 26, 2018 at 12:09 pmPost count: 14
Sorry. Anyway, I meant like instead of an entire belt, the code specifies the sources based on the latency 1/2/3/4 results of each FU, plus the actual spiller registers. Since a specializer can know statically where the spiller puts overwritten results, that’s not a problem in conASM. Is belt renaming that important?
- NXTanglParticipantFebruary 25, 2018 at 7:42 pmPost count: 14
- NXTanglParticipantFebruary 25, 2018 at 7:12 pmPost count: 14
- NXTanglParticipantFebruary 23, 2018 at 11:13 amPost count: 14
Don’t some Mills also have deferred branches? Since multiple branches that retire in the same cycle are defined to resolve to the first address in issue+slot order, this has some potential to lock in a code path for prediction: issue a deferred branch some cycles ahead, and if it’s known taken, we know enough to lock in that address and start the I$ load immediately. Of course, that mechanism works better for open code, since most Mill loops will be single-cycle pipelined loops that can’t schedule the exit branch as anything other than immediate.