Forum Replies Created
- AuthorPosts
- in reply to: Single register #2164
In this code fragment “endIndex” is a literal and does not need to be spilled; the comparison will test against the literal directly. However, if “endIndex” is a computed value then it must be available for each iteration. In that case, “endIndex” will not be run out to scratchpad unless the EBB that forms the loop body is so long that it would fall off the end anyway if the EBB were straight line code rather than a loop. That is, the fact that it’s a loop doesn’t matter to the spilling.
Looping doesn’t matter because belt values carried over a control flow join (such as the bottom of the loop) are kept in the belt; the inbound branches carry the values as arguments and the join makes the belt congruent regardless of where control flow came from; this is a hardware version of the “phi” operation used in Static Single Assignment code representation. Irrelevant belt positions are discarded and “endIndex” will start the next iteration at or near the front of the belt.
When live values (including loop-carried vales, as here) drop off the belt before passing through a join, they are spilled to scratchpad and then filled for the next use. In the hardware, the technology used for the scratchpad is in fact the same as is used for a large register file; the only difference is how you get at the bits, and that save-restore for function calls is automatic. Consequently there is no power or latency improvement in using a register over using scratchpad, and adding a register for loops would needlessly complicate the compiler and access logic.
IMO Moore’s Law has been dead for a decade already;
There’s been quite a bit of marketing around Soft Machines, but a dearth of “how it works” technical content. Some patents are beginning to show up, but so far nothing on the critical part: the thread cracker. Everybody and his brother claims that they can split a conventional thread into independent threads (of some granularity) and run them on independent cores. Nobody has succeeded except in carefully chosen test cases that show embarrassing parallelism already; if your app is loop-heavy, the iterations are independent, and the work in one iteration is complicated enough to justify the cost of migrating the cracked thread’s pieces to another core – then a system like Soft Machines should work just fine. Such loads are common in HPC – and rare elsewhere.
I’d be delighted if Soft can get it to work. But I’d like to see it work, or a paper describing how the cracker works in enough detail to evaluate from first principles, like we have done for the Mill.
I have seen nothing so far.
- in reply to: Open Architecture? #2151
Years of hard work given away for free? Your “backing” is another way to spell “theft” perhaps? 🙂
- in reply to: EDGE Architecture / DATA FLOW #2149
The largest difference is that Mill in intended to be a commercial product and so must adapt to commercial realities. Thus (for example) we must use commodity memory, so Wavescalar processor-in-memory techniques are off the table.
The classical problem with dataflow, since the MU-5 days, has been the match-box where retiring operations are matched with those awaiting arguments to mark them as ready for issue. The match-box is functionall the same as the scoreboard or similar device in a OOO. The matchbox is in effect a CAM, and multi-hit CAMs (a single retire can enable arbitrary numbers of other operations) scale lousy, which is in part why OOOs don’t have more pipelines. One-or-none CAMs, like in caches, scale fairly well but multi-hit has been a problem.
These research machine attempted to get around the match-box problem by changing the granularity of the match: from single operations (as in the MU designs) to whole subgraphs, where the subgraph was located by a compiler. You still have multi-hit, but if the subgraphs are big enough to keep the FUs busy for a few cycles your required match rate could drop low enough to handle the multi-hits. That was the idea, anyway.
However, that calls for a smart compiler to isolate the subgraphs. Research has shown that the average local ILP (i.e. what a dataflow needs) is 2; Mill phasing gets that to 6 essentially by overlapping instructions, but it is still a very small number for the kind of massive machines these projects anticipate. However, research also shows that there is very high remote ILP available; 10k instructions and more, far more than will keep a machine busy. The problem is that languages and programs make it very hard to isolate subgraphs that are spread far enough over the program to pick up the remote ILP.
I personally have long felt that dataflow machines are well-worth re-exploring using modern fabrication tech; the MU-5 was from the discrete-component generation. If I worked in academia that’s where I would put some of my research effort. But the Mill’s goal is to sell chips, not sheepskins, and so we must content ourselves with the available local ILP in unchanged codes written in unmodified languages. So no Mill will give a 1000X performance boost the way a dataflow might if the problems could be fixed.
In some pipelinable loops we can get 20X performance, but over the breadth of general-purpose code we are happy to see 5X average, because there’s just no more there regardless of how many function units are present in the hardware. A good OOO can also see better-than-two ILP, although because they do it at runtime they suffer by the limited number of instructions they can keep in flight and match for issue; scaling again. Still, a high-end OOO can see the same ILP numbers our lower-mid-range sees, and in that performance region we must compete in power and area. Only in the higher end do we have performance headroom that an OOO cannot reach.
It is not clear whether such conversion should be a scalar operation (which seems to be what you are suggesting), or a stream node; or perhaps both. Variable length stream parsing is a bear (and bit-stream parsing is worse), but we are trying for more general mechanisms rather than spot solutions. However, Mill streams are NYF.
#1: they could, and perhaps some members might be implemented that way. However the store might be only partly overlapping the load, so the logic to do the grab might have to do a shift and merge, which is non-trivial hardware and there are a lot of retire stations. The L1 already has the shift-and-merge logic (it must, to put stored data into the right place in the cache line), and aliasing interference is rare, so it is cheaper to let the store go to the L1 and reissue the load from the station.
Note that the first try for the load will have caused the loaded line to be hoisted in the cache hierarchy, so the retry will find the data (both the newly stored part and any that had not been overwritten) higher – and faster to get – than it did on the first try.
#2: Cache policies are full of knobs, levers and buttons in all architectures, and Mill is no exception. It is quite likely that both the complement of policy controls and their settings will vary among Mill family members. What you suggest is one possible such policy. The current very-alpha-grade policies in the sim try to preemptively push updates down, except if the line has never been read; this distinction is an attempt the avoid pushing partial lines that are output-only, to avoid the power cost of the repeated partial pushes. This and other policies are certain to change as we get more code through and start to move off sim so we get real power numbers. Fortunately, none of this is visible in the machine model so apps can ignore it all.
- in reply to: Instruction Encoding #2137
From comp.arch, on how the Mill operation specification machinery works:
On 5/20/2016 8:04 PM, Ivan Godard wrote:(in the “instruction decoding perfected” thread, to deprecate manual instruction bit-layout):
> In our spec machinery you supply traits for each encodable attribute.
> One of those traits tells the bit-allocator when the decoder will need
> the value. An attribute can be “pinned”, in which case it gets its own
> bitfield (which need not be contiguous) in every operation in the slot,
> needed or not; or “direct”, in which case it gets its own field, but
> that field is only in the ops that use that attribute and may be in
> different location in different ops; or “merged”, in which case it is
> cross-producted with other attribute to make a single larger value-set
> that is encoded as if it were an attribute itself; or “uncoded” in which
> case it doesn’t appear in the binary at all but is carried metainfo for
> the specification machinery.People have asked for me to post Mill details occasionally. In case anyone is interested, as a p.s. on how specification driven layout works here’s a cut from the actual source code that declares the operation attributes for Mills. The enumerations for those attributes that are enums are declared elsewhere. An example is:
enum directionCode {
leftward,
rightward
};
which attribute is used in shifts and a few other operations. As you see below, the value of this attribute is given (in conAsm) by the operation mnemonic rather than by an argument (“byMnemonic”); is encoded as part of a cross-product with other attributes (“merged”); and has the same meaning and potential valueset across all Mills (“universal”).To add a new attribute the architect must declare the enum (if there is one), add a line for the attribute to the declares below, and add a traits specification for the possible values; here’s the traits for directionCode:
const
attrTraits
attrTraits::insts[intTraits ::count] =
{
attrTraits(leftward, “l”,
“toward greater significance”),
attrTraits(rightward, “r”,
“toward lesser significance”)
};This gives the text that will be used in the mnemonic in conAsm to indicate the desired value (“l”/”r”), plus a short description that is used in online help and other documentation. For a simple attribute like this, the whole process takes around 15 minutes, including adding an operation that uses the attribute and rebuilding the world to generate the new binary encodings.
Here’s the specs for all instruction attributes used in the Mill as specified 2016/05/21.
declareEnumeratedAttrTraits(accessCode, byMnemonic, merged, universal);
declareEnumeratedAttrTraits(awarenessCode, byMnemonic, merged, universal);
declareNumericAttrTraits(base0Code, byParam, uncoded, byMember);
declareEnumeratedAttrTraits(basedCode, byMnemonic, merged, universal);
declareNumericAttrTraits(bit0Code, byParam, direct, byMember);
declareNumericAttrTraits(bit1Code, byParam, direct, byMember);
declareEnumeratedAttrTraits(blockCode, byMnemonic, uncoded, universal);
declareEnumeratedAttrTraits(ccGenCode, byMnemonic, uncoded, universal);
declareEnumeratedAttrTraits(conBytesCode, byDerivation, pinned, universal);
declareNumericAttrTraits(con0Code, byParam, uncoded, universal);
declareEnumeratedAttrTraits(condSenseCode, byMnemonic, merged, universal);
declareEnumeratedAttrTraits(conSet, byMnemonic, uncoded, universal);
declareNumericAttrTraits(count0Code, byParam, direct, byMember);
declareEnumeratedAttrTraits(directionCode, byMnemonic, merged, universal);
declareEnumeratedAttrTraits(domainCode, byMnemonic, merged, universal);
declareNumericAttrTraits(elem0Code, byParam, direct, byMember);
declareNumericAttrTraits(field0Code, byParam, direct, byMember);
declareEnumeratedAttrTraits(exactitudeCode, byMnemonic, merged, universal);
declareEnumeratedAttrTraits(exclusivityCode, byMnemonic, merged, universal);
declareNumericAttrTraits(extCount, byDerivation, pinned, universal);
declareNumericAttrTraits(fault0Code, byParam, merged, universal);
declareNumericAttrTraits(imm0Code, byParam, direct, bySlot);
declareNumericAttrTraits(lit0Code, byParam, pinned, byMember);
declareNumericAttrTraits(lit1Code, byParam, pinned, byMember);
declareNumericAttrTraits(lit2Code, byParam, pinned, byMember);
declareEnumeratedAttrTraits(memCode, byMnemonic, uncoded, universal);
declareNumericAttrTraits(memAttrCode, byParam, merged, universal);
declareEnumeratedAttrTraits(memorySubOpCode, byMnemonic, merged, universal);
declareNumericAttrTraits(NaRcode, byParam, direct, byMember);
declareEnumeratedAttrTraits(off0Code, byParam, direct, universal);
declareNumericAttrTraits(opand0Code, byParam, pinned, byMember);
declareNumericAttrTraits(opand1Code, byParam, pinned, byMember);
declareNumericAttrTraits(opand2Code, byParam, pinned, byMember);
declareNumericAttrTraits(opand3Code, byParam, pinned, byMember);
declareEnumeratedAttrTraits(opCode, byMnemonic, merged, universal);
declareEnumeratedAttrTraits(overflowCode, byMnemonic, merged, declareEnumeratedAttrTraits(scaleCode, byParam, direct, byMember);
declareNumericAttrTraits(scratchCode, byParam, direct, byMember);
declareNumericAttrTraits(specr0Code, byParam, direct, byMember);
declareNumericAttrTraits(specw0Code, byParam, direct, byMember);
declareNumericAttrTraits(streamr0Code, byParam, direct, byMember);
declareNumericAttrTraits(streamw0Code, byParam, direct, byMember);
declareNumericAttrTraits(tag0Code, byParam, direct, byMember);
declareNumericAttrTraits(trap0Code, byParam, merged, universal);
declareNumericAttrTraits(widthCode, byParam, direct, byMember);
universal);
declareNumericAttrTraits(pop0Code, byParam, direct, bySlot);
declareEnumeratedAttrTraits(resCount, byMnemonic, merged, universal);
declareEnumeratedAttrTraits(resultCode, byMnemonic, uncoded, universal);
declareEnumeratedAttrTraits(roundingCode, byMnemonic, merged, universal);
declareEnumeratedAttrTraits(shrinkCode, byMnemonic, merged, universal);
declareEnumeratedAttrTraits(vectorFill, byMnemonic, merged, universal);Feel free to ask questions.
Obligatory plug: if you’d like to be part of the Mill effort then see http://millcomputing.com/join-us/
- in reply to: How does multiple core shared memory work? #2135
We explicitly state that the Mill does not support inter-chip coherence. The experience with shared memory in such large scale has been disappointing, and the trend is to use packet technology beyond the chip.
Of course, it will be a while before we have to deal with customers at that scale, and we may someday reevaluate the decision. But for now we expect coherence, and sharing, to stop at the pins.
- in reply to: Single register #2170
Each instruction (really a half-instruction) on each side comprised some number of blocks, each of which carries a variable amount of fixed-size info. On the Mill there are three blocks per side, but the encoding mechanism can have any number (but an odd number is more sensible). On each size, one block decodes in the initial cycle, just as you describe; thereafter two blocks can decode per cycle, because we decode the array of blocks from both ends; this is why an odd number of blocks makes most sense.
On the exu side (where the usual two-in-N out ops reside) the info in all three blocks is treated as operations. Consequently, we get one block’s worth of ops at the end of D0, and two at the end of D1. D2 is used to match the logical belt numbers from the ops to the physical data location, and the crossbar routing happens during the transition between D2 and X0 (aka D3). One-cycle ops (ALU typically) execute in X0, and their result is available to other ops in the following cycle. Thus the basic pipeline has four stages, three decode and one execute.
The flow side is slightly more complicated. Only the initial (D0) block contains operations; the remaining two blocks contain arguments to those operations. All flow ops contain two two-bit fields that specify how many arguments that op takes from the other two blocks. The flow decoder can pick out those fields at the start of D0, run them through what amounts to a priority encoder, and have control lines at the start of D1 to route the arguments from their block to the intended flow operation.
Of the two arg blocks, one is a simple vector of morsels. Each morsel is big enough to hold a belt operand number, so (depending on member) morsels are 3/4/5/6 bits long, and the two-bit field selects 0/1/2/3 of them. The morsel arguments are used for belt references, small literals and other incidental data. The other flow-side arg block is a byte array. The two-bit field in the flow op selects 0/1/2/4 bytes from this array.
Consequently each flow-side op has a main opcode (from the initial block), up to three argument morsels, and up to 4 bytes of immediate data. The opcode is available at the end of D0, the args at the end of D1, and D2 is used to map logical belt numbers to physical, just like the exu side. In fact the two half-instructions become one during D2.
Back to your question: where is the bitmap? In the literal argument of course. And it only uses as many bytes are actually needed. And there can be two or more ops using the bitfields, because the arg bytes that hold the bitfields are not in the block that has the opcodes and the two-bit fields.
So the only part of Mill encoding that is actually variable length is the block, and each block’s content is fixed-length. We do have a linear parse at the block level, but not at the operation level. It’s as if we needed to parse two consecutive x86 instructions, and had two cycles to do it in, which is easy.
BTW, some flow ops can need more arg data than the 4 byte+3 morsel that comes with the slot – a call with a long arg list for example, or a con(stant) op with a quad (128-bit) argument. For these we gang additional slots to the main slot, each adding four bytes plus three morsels to the available data arguments. By ganging you lose a potential operation in that instruction (the gang slot has a unique opcode), but you can have all the data you want up to the maximum that the block shifters handle. The shifters can handle the maximal single operation/argument combination, so all possible operations can be encoded, if necessary in one-op instructions. The specializer knows the instruction capacity, and won’t put more ops in (with their arguments) than will fit in an instruction.
So if you have three slots that support the call op, a Mill instruction can contain three calls: each has the code branch offset in its immediate, and has morsels for up to three belt arguments to the call. If one call takes more arguments then that will reduce the number of calls you can have in the instruction as the big call grabs slots for its additional arguments. In practice slot overflow for call is uncommon; most functions have three or fewer arguments. The big consumer of flow-side argument encoding is floating-point literals.
- in reply to: Single register #2166
Belt contents are immutable; the belt is strictly SSA. However, although you cannot move a value on the belt, you can create a new value that is the same as an existing value, and your example of adding zero would do that. OR with zero would likely consume marginally less power than add with zero, especially if the operand you want to save is a vector.
So there’s a better way: the rescue op. This does the same belt normalization that control flow operations do, except without actually transferring to a new address. Rescue comes in two forms. One specifies the belt positions to move to the front using an argument list. This form can arbitrarily reorder the rescued operands, and can also duplicate operands; in this is acts exactly like (and uses the same hardware as) the call and retn operations. However, the position-list argument uses one morsel per position of encoding space.
The other form of rescue specifies the desired operands using a bitmap of the positions, which are moved to the front in numeric order and dups are not possible. The underlying implementation and effect is exactly the same, but the bitmap is a more compact encoding if you rescue more than two operands.
Because rescue is done entirely in the name-mapping logic it has zero latency, just like the equivalent functionality in branches and calls. It also does not require a trip through an ALU so it uses less power than e.g. adding zero, especially as the name-mapping is updated every cycle anyway. And although the belt effect is the same as that of a branch to the locally following instruction, there is not actual control flow and so the transfer-prediction hardware (with its power overhead and risk of prediction or icache miss) doesn’t get involved.
Sound questions.
1) Frame exit invalidation is done in the Well Known Region for the stack. It is not necessary to clear the valid bits, because the exited frame is no longer in the user’s address space; return automatically cuts back the stack permission WKR. However, there is an issue with inter-core references to foreign stacks. Core B has no WKR for core A’s stack, so in normal mode threads cannot reference each others stacks and threads cannot reference their own exited frames. However, there is a legacy mode in which inter-stack references are permitted. In legacy mode the entire stack, both used and unused, has a PLB entry for the turf. So if there are two legacy threads running in the same turf then they can browse each others stack, and can browse the rubble in their own stack. We have never found a way to restrict interstack references to only the live portion; all practical implementations seem to require exposing rubble too.
2) The implementation of the TLB is per-member; it is transparent to the programmer. As currently in the sim, a miss in the TLB requires a search of the PTE tables. However, unallocated regions of the 60-bit address space do not need PTEs and do not occupy physical memory. If the search leads to an address where a PTE might be but is not, then the Implicit Zero logic will return a synthetic line of zeroes. The PTE format is carefully arranged to that a zero PTE means that the region is unallocated in DRAM – and the load in turn returns implicit zero. Thus the page table is in virtual memory, not physical, and there are only entries for mapped space.
Thus the miss will trigger a search, which may miss, which triggers a search… The Tiny Alice algorithm we use bounds the recursion to the number of distinct range sizes (current 5), but in practice the intermediate entries are in the LLC and search depth is nearly always <= 2.
- in reply to: code density #2144
They do have to be held somewhere; “somewhere” is called the scratchpad. Each function appears to get its own scratchpad. There is only one physical scratchpad, which is an in-core SRAM, not memory, and is reachable only by the “spill” and “fill” ops. The “appearance” of one per frame is provided by the spiller. Mill operation phasing lets an instruction contain “fill(), fill(), add(), spill()” to execute scratchpad-to-scratchpad without stalls, if desired. The instruction takes a single cycle, every time, sustained as long as you like. The equivalent in a general register RISC that has run out of registers would be “load; load; add; store” but because it uses memory the latency is longer, even if there are no cache misses. Machines like the VAX or x86 that have M2M operations suffer the same latency/miss problems, just encoded differently. Mill avoids all that.
- in reply to: When should we expect the next talk? #2139
Same answer: when we have some real numbers we are happy with to demo. Actually, we might do one on some of the system facilities first. The system software for backless memory and the TLB recently struggled to its feet, and a talk about how Tiny Alice works would be a worthy topic.
But the real problem is that creating a talk is so much work, and we’re busy. 🙂 We really want to put the tool chain and sim up on the web so people can play with it, and that probably should come before more talks.
- in reply to: spiller work optimisation #2130
Yet another possibility is a call variant that implicitly discards all the belt. If there were any belt operands actually live over the call then they would have to be explicitly spilled. While loops generally have loop-carried live operands, it’s fairly common in open code for a call to consume everything live. Worth doing? No idea; measurement needed.
- AuthorPosts