Forum Replies Created
- AuthorPosts
- in reply to: The Compiler #3437
Please accept my apologies, folks. Somehow a recent update of the Forum software removed me from the notify list, so I was unaware of the discussion. I’ll be working through the backlog promptly.
Ivan - in reply to: Convertible notes? #3436
Please accept my apologies. Somehow a recent update of the Forum software removed me from the notify list, so I was unaware of the discussion. I’ll be working through the backlog promptly.
A new subscription to the mailing list triggers a notification to the list maintainers and eventually to me, which in turn results in the subscriber receiving an investment package describing the Notes, the company, and the SEC-mandated process for investing. Because Mill Computing is not a public company, the SEC rules prevent us from posting the information publicly (such as here), because that would constitute an “advertisement” which is forbidden unless you file a registration statement with the government. However, when someone voluntarily joins the mailing list we are legally permitted to reply. The process may seem convoluted, but in practice has worked well to protect the naive from boiler-room securities salesmen. In most cases, anyway.
Ivan
Efficient representation of arbitrary precision math is a specific example of the more general problem of runtime typing, which is a problem we have long had on our backlog.
Of the various suggestions on this thread:
1) math ops do not trap, they fault: when they happen, you’re not coming back. Think C++ “throw”.
2) the “excepting” forms only fault on overflow, and while they could be used to catch a wide add for example, they then would not catch a subtract.
3) NaRs do not fault on speculable operations, they just pass through, and all the math ops are speculable.
4) We could indeed define a new “APinteger” domain. Currently the domains are signed and unsigned integer, signed and unsigned fixed point, logical, boolean, binary and decimal float, and some pseudos for various unique ops. The new domain would add the set of ops to the ISA, which would have entropy consequences in the encoding. However, implementing the behavior via traps is real problematic on a Mill because of the wide issue – a single instruction could have a ton of distinct traps. We currently do not have any result-replacing traps defined; all the traps are for things like page faults which are fundamentally asynchronous to the core and belt. Trap-based programming could be done, but it would thoroughly muck up the execution engine, and we haven’t and probably won’t put such a thing in.
5) AP integer is a special case; the Mill tends to do the general case of things. The general case is an efficient runtime-typing mechanism that is type-system agnostic.We have done some thinking about how to do runtime-typing in the Mill environment, but don’t have anything yet ready for publication, or even for patent filing. I wish I could be more informative; sorry.
Please accept my apologies, folks. Somehow a recent update of the Forum software removed me from the notify list, so I was unaware of the discussion. I’ll be working through the backlog promptly.
Ivan- in reply to: Pipelining on a Gold member of the Mill family #3343
The Wikipedia article (https://en.wikipedia.org/wiki/Software_pipelining) is a pretty good intro to pipelining; the academic papers that Google turns up are more for the specialist due to often opaque notation, or are CS slide decks that don’t get very far into the subject.
In general piping lets you fold the body of a loop in such a way that more than a single iteration is in flight concurrently. As well as other constraints on folding, there is a fundamental resource constraint: if the loop has two floating-point operations and the hardware has only one FP unit then you can’t fold the loop to fewer than two instructions. If the hardware has two FP units then you can (well, may be able to) fold the whole loop into one instruction. And if it has four then you may be able to fold two iterations into a single instruction. You do that by unrolling the loop once and then schedule it as if the result was a single, bigger, iteration. There is also a notion of fractional iterations: if the hardware has three FP units you can unroll twice to get a body with six FP ops that will schedule into three instructions with no FP units left over.
As a matter of actual practice, the demands of a particular loop very rarely exactly match the resources available, so you schedule for the tightest constraint and leave some resources left over. In your example loop, on a Gold, there is one FP op but at least three flow ops (load, store, branch) and maybe more depending on how the compiler did the address update recurrence. Consequently on a Gold the tightest constraint will be the flow ops. Currently Gold has five flow slots, not quite enough to fit two iterations, although we have been playing around with a slightly richer configuration that has six and would handle two iterations. So yes, a Gold can issue ~30 ops per cycle, but only if they are the right kinds of ops that match the available set of functional units. Your loop is very unbalanced in its demands: Gold has four FP units, so your expression in the loop could have been:
A[i] = A[i]*2.0 + A[i+1]*3.0 + 4.0
and it would still fit in one instruction.The FU population selected for each Mill family member is partly a matter of intended market and partly a matter of hardware practicality. Thus a member intended for HPC supercomputer use will have more FP than one intended for use as a micro-controller (which may have no hardware FP at all, and will rely on software emulation). On the hardware side the belt limits the total number of values dropped per cycle, and it’s unclear whether we can build belts larger than 32 positions. It’s easy and relatively unconstrained to add more ALUs and FP units, but four accesses to the data cache per cycle is about all the current hardware design can handle at the desired clock rate.
About the version with the “i-1”. Loops like this have what are called loop-carried variables, where a value computed in one iteration is used in a following iteration. The standard technique for dealing with them on a genreg machine, as originated by Monica Lam, is to unroll and inject reg-to-reg copies. This is completely unnecessary on the Mill, because the drops from prior iterations are still on the belt and can be addressed directly without unrolling or copying. The Cydra5 and later the Itanium also addressed this issue, and Bob Rau invented rotating registers to deal with it. His solution made the registers into a circular queue, whereas the belt is non-circular, but for the purposes of loop-carried data they are the same.
- in reply to: Control flow divergence and The Belt #3328
The passing branches are the same regardless of the number passed, just if there are more arguments the list is longer: brs(“foo”, b4, b6) with two, brs(“foo”, b4, b6, b2, b2, b3, b8, b0, b9, b3, b11) with ten.
All inbound arcs must carry the same number of operands, so if B passes 10 to C then A must also. Consequently C sees the same belt structure regardless of how or from where it received control.
Perhaps you are thinking that C will only need so many from B and fewer from A, and can tell the difference based on the values of the common arguments. Such a situation would be analogous to a VARARGS function call. This notion does not exist in common languages and is not directly supported by the Mill, although in principle the specializer could supply dummy arguments for A->C to fill out the branch.
- in reply to: Mill vs. Spectre: Performance and Security #3349
We don’t do SMT. Our numbers suggest that available chip area and power resources are better put into more cores instead of into SMT sharing of a single core. It may seem wasteful to have a core idling in stall when a load misses in cache, but for this architecture the overhead of trying to use the core for something else is greater than any saving from sharing. And, for free, having no SMT means a large number of security attacks are impossible on a Mill.
- in reply to: Pipelining on a Gold member of the Mill family #3347
Yes, phasing lets us put entire multi-operation dataflows in a single instruction. Piping lets us overlap what would be different instructions if it were open code and not a loop.
- in reply to: Pipelining on a Gold member of the Mill family #3345
Phasing is a separate issue, having nothing to do with piping. It makes a difference only in open code, not loops, where it is redundant to piping. We would still pipeline to the same degree if there were only one phase.
Analysis inside LLVM discovers that the assignment to A[i] is the same value as the A[i-1] of the next iteration. This is a routine analysis/optimization done by all production compilers. As a result of this analysis, the genAsm the Mill specializer receives has only one load op in the loop body, and a “warm-up” load of A[-1] before the loop is entered. The modified loop body pipes normally. Doing two iterations per cycle is essentially unrolling the loop once and then piping the unrolled loop; it adds complications in the prologue and epilogue if it is not demonstrable that the total iteration count is evenly divisible by two, but the piping process itself is normal.
The Mill has some special operations to simplify the prologue (retire(), inner()) and epilogue (leave()); these are touched on in the videos.
- in reply to: Mill vs. Spectre: Performance and Security #3341
Protection granularity will bite you in the butt. There is no one size – er, granularity – fits all. The way we organize programs these days seems to lead to three broad granule ranges on interest in different contexts.
There’s atomic granules, in which the item is indivisibly unitary; you deal with it as a whole, and it’s not meaningful to treat it as a composite with substructure. An int or char, say.
Then there’s collection granules, in which the item is composite and you expect to access substructure and move around the inside, but it has a hull that you are not supposed to wander beyond. Think array, struct, class.
Then there’s arena granules, random assemblages of other items, which are united in that a contained item may refer to another contained explicitly. Think processes and protection spaces.
These different granularities need different implementations, different controls, and different programming practices and language constructs. Which may need different cost models and different points at which you say “don’t do that; if you do you deserve it” and leave it unprotected.
At the atomic granularity most machines, including Mill, encourage but do not enforce granule protection. There’s nothing to prevent you from taking an int apart into pieces; it’s just annoying to do.
At the arena granule level the Mill has turfs, which have been covered in the talks. They give better granule and access control than the processes offered by other machines.
The middle size, collection granularity, is poorly supported except by segmented architectures and capability machines. The problem is that putting every array or malloc result in a separate process or turf is way overkill and will hose machine throughput because the turnover rate is too much to track. And you can’t use an atomic granularity mechanism because these things have substructure. The industry has followed the lowest common approach, and there’s no collection-granularity protection at all in mass market machine.
Surprise! the Mill offers collection-level granularity. A granule can be anything from a byte to all of memory. Tt works with existing code and ordinary languages without reprogramming. You can navigate around the substructure safely and it keeps you from navigating outside. It supports the C/C++ one-past-the-end rule. To support it we lose a couple of bits in the physical address space, and require that the granule be a power-of-two size; if you navigate into padding within the granule, well you deserve it.
We are very unsure how popular this Mill feature will be. Nearly all existing codes violate the language access boundary rules. Run times and OSs too. Think about memmove(…) – would you be surprised to know that it uses C undefined behavior when it decides whether to copy up or down? So the standard Mill runtime will offer a –UNSAFE-MALLOC option that lets you treat the whole process as a single collection-level granule. And you will deserve it.
- in reply to: Control flow divergence and The Belt #3339
There are several possible implementations for the remapper. I’m not a hardware guy, but I’m told that there’s no one-size-fits-all; some approaches work better for short belts but don’t scale to very long belts, which use a different approach, and where the implementation cutover(s) are is also process and clock dependent.
One representative approach is to attach a physical number to each operand when created using a simple take-a-number. Each consumer knows what physical # it wants, and broadcasts that to all the sources; the source with an operand with matching physnum responds with the operand. This is a straightforward MxN crossbar, and comprises a large part of the area and power on the larger members. A conventional genreg machine has a similar crossbar running from all registers to all FUs. I’m told that there are well known ways to use hardware cleverness to reduce the cost of such crossbars, but they are not well known to me.
The crossbar works in physnums; it is the job of the remapper to convert from the belt position numbers (beltnums) to physnums. The mapper (for an example 16-position belt) is a circular array of 4-bit adders, where each adder holds the physnum corresponding to the beltnum index in the array. Translation is just indexing into the array based off the current origin. Advancing the belt is just bumping the origin with wraparound. Dropping new physnums is taking to operation drop count and the statically defined drop order to get a zero-origin sequence of dropping physnums, which is biased by the origin and used to replace the content of the affected adders. Bulk remap (including branch) just builds a new zero-origin drop list by using the beltnums from the op to index the adders (this is just like indexing for arguments of any op) and then dropping the whole thing normally. There is lots of time in the cycle to do a 4-bit index for 4 bit result and a couple of 4-bit adds in time for the physnums to be available for broadcast.
An alternative useful for the larger members does not use beltnum to physnum to broadcast but instead maps directly from beltnum to residence number. Each possible location that can hold an operand gets a resnum; there are a lot of locations on the bigger members so resnums are 6 or 7 bits. The mapper works the same way, but the adders respond to the index step with a resnum, not a physnum, and the drops are resnums not physnums too. The resnums directly address the datapaths feeding the crossbar, so there’s no broadcast although the crossbar itself remains. However, when an operand is moved from one residence location to another the mapper adder content must be updated with the new resnum; this is in effect a drop. In the broadcast approach the physnum of an operand never changes so operands can be moved without updating the mapper.
Or so I’m told. It’s all transparent to the compiler level that I work at.
- in reply to: Control flow divergence and The Belt #3336
Close.
For polyadic arguments the arg list uses the bits in both the bytes in con block and the morsels in ext block, interleaved by flow slot. The decoder has a compositing buffer holding all these bits for all slots. For a four bit morsel machine, each slot provides 44 bits (4 bytes + three morsels) so with 7 slots (typical of a mid-range Mill) the compositor has 308 bits. As the maximal arg list for such a member has 16 args of four bits, the maximal list needs 64 bits and the buffer can hold almost five such lists. More to the point, two slots provide 88 bits, whereas the maximal list needs 64, so any arg list will need at most two slots, one with the main op and one with a flowArgs op for the extras.
Actually, composition demand is greatest for con ops (literals). If our member has 128-bit (quad) operands then the 308 compositing bits can hold two quad literals, each literal using three slots with a slot left over for lagniappe.
In operation, the decoder selects the pair of two-bit fields for each slot right out of the flow block in the instruction buffer. That can be done without even knowing how many flow slots the instruction uses; the decoder will just select garbage bits beyond the end. The selected pairs go through a priority decode to set up muxes to select which bytes/morsels in the con/ext blocks belong to each slot. That decode/select takes a cycle, during which the con and ext blocks are being isolated by the instruction-level shifters. The following cycle the previously set up muxes route bits from the isolated con/ext blocks to the correct position in the compositing buffer, zero-filling as necessary. At this point we have the full 308 bits as one bit-array; any garbage bits are at the end of that array and are ignored because they belong to slots that the instruction doesn’t use.
The compositing buffer has a bunch of selector muxes to peel off particular fields that would be of interest to one or another op. Starting at each 44-bit slot boundary, there’s a selector that grabs a four-byte word that had come from the con block. Each of the three morsels that came from the ext block has its own selector. There a selector that starts at the slot and extends over the bits of adjacent slots that provides big con literals, and another that grabs big arg lists. Con’s are right-aligned on slot boundaries in the compositor, so leading zeroes happen; there’s a “complement” bit in the main opcode that causes the selected literal to be one’s complemented during selection so we can express negative literals (note: the encoding uses ones’ complement literals so we don’t need an adder in the decoder). Arg lists are left-aligned on slot boundaries in the compositor so garbage bits are at the end; we know from other arguments in the the operation how long the list is, so the decoder actually gives the FU a single maximal list (64 bits) plus a one-hot mask (16 bits) that says which list position is valid.
All the various selecting for all the slots takes place in parallel in the second decode cycle. During that same cycle we have decoded the main opcode (from the flow block that was isolated in the first cycle) enough to know which of the selected fields are meaningful for this particular op, so only the meaningful fields go to the FUs as operation arguments. Many of the FU arguments are belt numbers, either singly or as lists. Belt remapping is done by the flow and exu sides together, because both sides are evaluating or dropping operands concurrently. This takes place in the third decode cycle, a cycle ahead of use. That is, the remapper tracks what will be happening on the belt in the following cycle. Bulk rename (br/call/retn/rescue) arg lists are actually easier then simply belt eval/drop (add, LEA, etc) because they don’t happen until after still another instruction boundary. The tightest timing is for con, which drops a cycle ahead of everything else. To make that work at high clock the belt addresses the literal directly inside the compositing buffer without moving it someplace else. Of course, the compositor will be reused for a different instruction the following cycle, so the literal does get moved out too, but there’s a whole cycle for that.
Whew! Well, you wanted to know.
- in reply to: Control flow divergence and The Belt #3330
Yes, each side has three variable-sized blocks, and within each block the element sizes are statically fixed (though not necessarily the same). In the exu side each block has a different op set, with the sets partitioned by phase, and each operation is complete in its set.
The flow side is different: only the flowBlock has operation elements like on the exu side, while in the con block the elements are bytes and in the ext block the elements are morsels big enough to hold a belt position number. Those flow ops that take an argument list (besides the branches there are the calls and various special ops including rescue, retire, and con) encode as much as they can in the flow block entry and put the rest of the list in the con and ext blocks. Each flow block entry has two bits to indicate how much has been put in the other blocks: 0/1/2/4 bytes and 0/1/2/3 morsels.
If the list is long enough that it doesn’t fit in an entry plus 4 bytes and 3 morsels then the op is ganged with a flowArgs pseudo op in the next slot, which provides another 4 bytes and three morsels. Rinse repeat. The maximal sizes of the blocks (for each family member) are set so that it is possible to encode a belt list with entries for every position, or a con with enough bits to express the largest possible operand. The constraints of the ext and con blocks are recognized by the specializer, so that (for example) it may encode two calls with few arguments in a single instruction, but only one call with a long argument list even though there are two slots supporting call.
- in reply to: Control flow divergence and The Belt #3326
Simplifying to two items.
Say A produces two items, which are at b2 and b5 when A->B happens and b4/b7 when A->C happens.A->B does not carry, so B inherits A’s belt. Hence at B those items are still at b2/b5. B adds an item, so the items are now at b3/b6 when B->C happens.
A->C carries, mapping b4 and b7. Let us say b4 maps to b0 and b7 to b1 (it could have been the other way around; source order is not significant).
B-C carries, mapping b3 to b0 and b6 to b1.
C receives items in b0 and b1 regardless of which path was taken. In this case the items themselves are the save regardless of path, although B could have passed different items instead of just relaying what it got from A.
A carrying branch acts like half a call, except there’s no new frame and no return.
- in reply to: Control flow divergence and The Belt #3318
The ISA documentation on the site is woefully out of date and we have no volunteer to resume the work on it; please accept my apologies. Here’s a conAsm fragment (from the printf() function as it happens) that shows carrying branches in use:
eql(b4 %46, 108) %435 ^0, eql(b4 %46, 104) %434 ^0, brtrs(b0 %434, "printf$17_75", b5 %51) ^0, brfls(b1 %435, "printf$17_9", b8 %48/460, b10 %50/461) ^0, brs("printf$17_75", b2 %399) ^0, rd(w(128)) %399 ^263;
You see that the two branches to ebb $17_75 are carrying two different arguments, %51 for one and %399 for the other.
Ebbs don’t really “produce results”, except in the sense that they contain store operations that update memory. I think you mean more a notion of liveness or in-flight operands, those produced by an operation but that have not yet been consumed by all consumers. Live operands are initially created on the belt, and are consumed from the belt. If at any point there are more simultaneously live operands than fit in the fixed capacity of the belt (16 positions on a Silver, from which the above fragment was drawn) then the code must spill the excess, initially to scratchpad and ultimately, if scratch itself fills up, to memory. The specializer uses heuristics to decide which operands to spill. Spill and fill are a fact of life for any machine because there are always codes that exceed local resource constraints, although the compiler algorithms to do spilling are very different on a belt machine than on a general register machine.
Spilling is also done in a few other situations. For example, if an operand is created in one ebb and not consumed until much later after a long chain of branches then it is better code to spill it at point of production and fill it at point of consumption than to pass it along across many value-carrying branches.
- AuthorPosts