Forum Replies Created
- Ivan GodardKeymasterJune 28, 2018 at 12:48 pmPost count: 506
A “one-shot” drop is an interesting idea we have looked at. It would not actually simplify the spiller, which would still need to be able to handle cases in which all drops were multi-shot. The savings would come in power, and to a lesser extent in spiller bandwidth in its paths to memory. The cost is entropy in the encoding and complexity of the implementation.
It is clear that bit per drop encoding has the same functionality as an op with a bitmask. Roughly 65% of ops (dynamic) have drops, while multidrop ops are rare enough to ignore. Consequently bit-per-drop costs ~0.65 bit per op. If you figure IPC of 6 for a mid-range Mill, the overall entropy would be ~4 bits per instruction. In comparison, an op and a bit mask in complete generality (like rescue) would need a belt’s worth of bits (16 for midrange), plus the entropy of the op itself, say 9 to 15 bits depending on slot load. It’s clear that bit-per-drop has the better entropy, even before considerations of slot pressure.
The separate op approach is also somewhat redundant with the rescue op. An unrescued belt entry (i.e. one that occupied a position when the rescue executed but wasn’t mentioned by the rescue) doesn’t get spilled; it’s just marked as invalid and a reference to it will fault. The same applies to branches that reconfigure the belt. A oneShot op says which operands will be used once before they are used, while the rescue op says which drops were one-shots after they were used. We don’t currently use rescue just to winnow dead values from the belt, but we could if power measurements showed it would be worth while.
- Ivan GodardKeymasterJune 21, 2018 at 2:45 pmPost count: 506
The Microsoft press release is pretty content free, but it appears to be a direct descendant of the Texas TRIPS system, see https://en.wikipedia.org/wiki/TRIPS_architecture. TRIPS was in response to the barrier to parallelism represented by the quadratic complexity of reservation and issue stations in out-of-order CPUs. The basic idea was to statically isolate closely-connected (by dataflow) sets of instructions in which intra-set flows were numerous and extra-set flows were few. These sets could then be scheduled as a unit, sharply reducing the amount of hardware dependency analysis required because only the extra-set flows need to be reserved, whereas a conventional OOO must reserve for every flow of every instruction.
The reservation station barrier was widely recognized at the time, and both TRIPS and EPIC (Itanium) architectures were a response, as in a way is the Mill phasing. There have also been a lot of paper/sim designs that try to recognize micro-threads that can be optimally scheduled statically within the thread and dynamically scheduled outside the thread. The architectural problem is two-fold: how to statically recognize dependencies in large enough code granules to be useful, and how to package and issue the recognized granules in a low-overhead way. The recognition problem immediately bumps into C pointer semantics; despite a ton of academic work, compiler micro-task recognition doesn’t work except for BLAS-like loops with Fortran semantics.
Whereas recognition is a compiler problem, package-and-issue is a hardware problem, known as the “dispatch box problem” in dataflow architectures (https://en.wikipedia.org/wiki/Dataflow_architecture). Academic work has shown that large real programs have ILPs in the hundreds, but only at very long range; ILP at short range is widely believed to be around two, except for “embarrassingly parallel” applications. To the best of my knowledge TRIPS never cracked the recognition problem, but their dispatch hardware was one of the more interesting approaches that did not bound the granule size. EPIC (IA64) abandoned trying to bundle dataflows in favor of temporally horizontal execution (VLIW), though I think Bob Rau would have returned to dataflow had he lived long enough. Mill has both horizontal (wide issue/VLIW) and vertical (phasing/dataflow) scheduling, and might be seen as a conceptual descendant of a marriage TRIPS and EPIC. However, we made the dispatch problem tractable by statically limiting the dataflow granule, to three cycles in the current Mills, avoiding the variable-sized approach of TRIPS.
The comments above are mostly about TRIPS and not about Microsoft. TRIPS I rate as a fascinating but ultimately unsuccessful approach. What Microsoft has done with it I don’t know.
- Ivan GodardKeymasterMay 21, 2018 at 7:55 pmPost count: 506
You wouldn’t do swapping in the core using order tagging. The way that hardware works there would be a swap delay on every operation, needed or not, which would cost you at least 25% of your clock rate.
Ad hoc swapping is rare and is best done with explicit operations. The only important case is when there is a file of the wrong endianness and the buffer must be swapped, or in general when an app is written to presume an endianness that is different from the native (little) of the core. The Mill has a status bit that lets it act like a big-endian machine, swapping on all access to memory. This is the same approach as used by PPC, MIPS, and some others; it’s not original with us.
- Ivan GodardKeymasterFebruary 26, 2018 at 12:37 pmPost count: 506
- Ivan GodardKeymasterFebruary 26, 2018 at 12:33 pmPost count: 506
- Ivan GodardKeymasterJuly 16, 2018 at 9:31 pmPost count: 506
I’ll give you the best present most recent final word about GC, but must caution that we have not yet implemented a GC and this may not be as final as we’d like.
1) Pointers reserve three bits for use by the garbage collector; hardware knows this and for example does not examine these bits when testing pointer equality. There are a set of hardware masks that are checked when a pointer is used to access memory; the three bits index into the mask, or, when storing a pointer, the concatenated bits of the pointer being stored and the address being stored to are used as the index. If the indexed bit is set then the operation traps to software. With this, one can write a generation GC that does not “stop the world” because by setting the masks and allocators right one can get a trap whenever there is an up-generation store of a pointer and add the stored pointer to its roots.
2) There are several situations in which software must be able to access program state held in the spiller. GC is one such situation, but also debuggers and various statistics and diagnostics tools as well. System software presents a trusted API for such inspection; there is no direct access because spiller state is not in application address space. The API interacts with the portal call mechanism and permission granting. For example, if the app makes a portal call passing a callback function which the service then calls, and then takes a segv in the callback, the debugger will be able to inspect the state of the callback and of the app, but not the state of the service that is between the two (unless it requests and is granted permission for that address space too, of course).
The API only has bitwise access to the contents of the spilled data stack, but not the spilled control stack. The API provides enough information to construct a backtrace, but the debugger does not have direct access to return addresses, downstack links, and the rest of the call machinery; Mill is very attentive to matters of security.
3) GC and other software such as debuggers can get permissions to mutate saved state such as spilled pointers.
- Ivan GodardKeymasterMay 28, 2018 at 12:40 amPost count: 506
It’s taken a while for me to wade through the RISC-V document; sorry for the delay in response. The current Mill bit manipulation operations are a historical grab bag that we had planned to review and cull/extend when we had enough of a code base to get useful numbers. Your post prompted me to start that process; it’s relatively low priority so I don’t know when changes will happen.
I’m pretty sure that we should have a funnel shifter unit for the present shift and rotate group, although that implies more gates for the widest operands (compared to a normal shifter), and there are issues applying funnel shift to vector arguments.
There has been a long-standing desire for Galois Field operations, i.e. bit matrix multiply. A 8×8 BMM is straightforward, but things get out of hand at greater width – the control data for a quad x quad BMM is 4k bytes. That much data sounds like an I/O device rather than an operation.
Another desire is bit-stream parsing, chopping dynamic numbers of bits off the front of a byte-stream. For this the bit-chopping is not hard, but dealing with whether a chop has to advance the source or not is hard in a static machine. This one will likely wait until we add general streamer support.
Lastly, even simple bit-field extract/inject is problematic when the field is dynamic. An inject needs four arguments – source1, source2, position and size – and four-argument slots would be very expensive. We could construct such an op with ganging, but then you use two slot positions, while building the inject out of shifts and masks only needs six ops. The PPC version of funnel includes a mask step the would reduce the cost of an idiom even further. It’s not clear that dynamic inject is common enough to be worth the trouble. A static inject only needs two belt arguments, but it then needs logN encoding bits for the position/size info, which may increase the entropy of the slot supporting the op.
We’ll chew on these and other issues; thank you for shaking our tree.
- Ivan GodardKeymasterApril 26, 2018 at 1:34 pmPost count: 506
- Ivan GodardKeymasterFebruary 26, 2018 at 12:48 pmPost count: 506
Not quite. Everything is known well in advance, and the time for the actual computation may be the same for one and two res ops. However, after a result is computed it must be routed from the producer to all the consumers it will have in the next cycle. That route time is independent of which op produced the result. The route time is largely dependent on the number of sources, i.e. the number of distinct operands that can be dropped in each cycle, which is the fan-in to the routing crossbar.
We want to keep the routing time low, because every operation pays that cost. So we want few belt sources, and need to move low-value ops out of the way. We do that with the two-stage crossbar, and relegate all multi-result ops and all non-lat-1 ops to the second stage. There may be a two-res op that is fast enough that its result is available as fast as a simple add. But putting those two results on the fastpath would add a cost to every other fastpath op, including critical things like add.
- Ivan GodardKeymasterFebruary 26, 2018 at 12:35 pmPost count: 506
The specializer doesn’t know where the spiller puts things, because it cannot predict the dynamic control flow and hence the belt demand. As soon as there is a trap/interrupt or the main program goes through a function or label pointer, or even a conditional branch, well, that’s all she wrote.