Forum Replies Created
- AuthorPosts
- in reply to: Pipelining #1517
Ivan,
Thanks for clarifying. “Effectively unbounded ILP in the loops that matter” may be more accurate, but admittedly doesn’t have the punch that “unbounded ILP” has. IMHO, Intel’s increase to eight micro-ops peak per cycle (e.g. in Haswell) shows that they think there’s enough detectable parallelism in x86 code to warrant the increased power and chip area required.
Non-forgeable turf/thread IDs: What can we assume re their size and properties?
In the security talk (circa slide 19, around 21 minutes in), a turf ID is shown as one field in a region descriptor. Turf IDs are also said to be non-forgeable, though how that’s achieved isn’t stated. If region descriptors must fit into the PLBs, the PLBs must either support variable sized descriptors (hard to do, while keeping the H/W fast and cheap) or must be fixed size.
If region descriptors are of fixed size and are constrained to be a single cache line each (true, or have I over-assumed?), then turf IDs are de-facto constrained in their length by the size of a cache line (on that Mill model), less the bits needed for the lower bound, upper bound and rights. If so, what is the minimum width (#bits) guaranteed for the turfID?
In general, short keys/hashes are less secure against attack, and a key length sufficient to be non-forgeable today is unlikely to remain so for very long. Are Mill turfIDs/ThreadIDs made non-forgeable via hash functions (as are the payloads in NaRs), or is there a different mechanism used to make turf (and thread) IDs unforgeable? What may we assume about the (minimum guaranteed) length and security properties of turf and thread IDs?
- in reply to: Fundamentals & References #1481
IMHO, “Great Microprocessors of the Past and Present (V 13.4.0)” by John Bayko, online as of today at: http://www.cpushack.com/CPU/cpu.html
is worth a look and serves pretty well as a prequel to the article linked by the O.P. Bayko’s article is focused on single-chip (or at least few chip) CPU designs. Greatness is a matter of opinion, but I generally find myself agreeing with the author more often than not. Enjoy! - in reply to: Pipelining #1465
Unbounded ILP in loops vs. average iterations in loops?
In several of the talks, Ivan has said words to the effect that,
“loops have unbounded ILP.”Unbounded ILP would seemingly be true only for unbounded (e.g. while 1 {}) loops; for all others (whether simple counting loops or some form of while loop), the ILP is roughly proportional to the loop’s number of iterations — and thus bounded. Have I missed something/being too pedantic?
I’ll freely concede that loops, including loops repeated enough to benefit from pipelining, are frequent enough in general-purpose code that an architecture that can pipeline virtually all loops should be a genuine boon. So infinite or not ILP, I’m really looking forward to more info on epilogs and LLVM progress for the Mill.
More than two retires of latency-N ops per cycle, how do the crossbars handle this case?
In the belt lecture, in the section on data forwarding (slides 50 and 51, around 42 minutes into the talk), the latency-N crossbar is shown with only two outputs, both feeding into the latency-1 crossbar. In the accompanying description, Ivan says words to the effect that:
everybody else [results from lat-N ops] goes through a separate crossbar…. that winnows all those inputs down to two inputs , which are in turn routed to the latency-one crossbar….
If the (only?) outputs of the lat-N crossbar are the two winnowed values that feed from the lat-N crossbar into the Lat-1 crossbar, how can the results of more than two lat-N operations retiring in cycle N be forwarded for use by ops in cycle N+1? (Or even make it to the belt itself?) Since family members such as gold have FU populations that can execute many more than two latency-N operations per instruction, I suspect there must be a way for the rest of those lat-N results to be forwarded, although exactly how isn’t clear to me.
I’ll suppress my urge to speculate on how this is/could be done, and hope that the answer doesn’t require stalls.
Greetings all,
Where did pick() operations– and the slots that implement then — go when Gold’s spec changed?The an earlier description of Mill slots and pilelines:
http://millcomputing.com/topic/introduction-to-the-mill-cpu-programming-model-2/#post-610
(Will, thanks for citing that!) Art mentions,
“4 pipes that can do pick operations”In Gold in particular — since Gold’s a specific, evolving member for which we have some info revealed —
Which slots on the current Gold Mill can pick operations now be encoded and executed?My guess is that pick Ops remain in the Mill ISA, but have been subsumed into other slots (perhaps exu-side?) I note that in Ivan’s latest slot-tally for Gold, he doesn’t break the slot functions down to as fine a level. In the earlier description, there were examples of further breakdown by function, for example how many exu-side slots could do multiply(), vs. how many others that could do simpler ALU ops, but not multiply().
If possible after considering NYF and and Guru time/effort, I’d like to see a bit more detail about the FU and slot population on at least one member, probably Gold, since some detail on it has been revealed — and the design is evolving.
M evolution, and motivation for the changes will IMHO give readers some insight about Mill performance and the sorts of trade-off that have been made — and can be made — using MillComputing’s specification and simulation tools.
One thing I’d like to see in the Wiki is a description of the instructions and any constraints they impose. (Ivan has mentioned some issues with potential FMA hazards, that the scheduler has to avoid.) Ideally, such wiki entries would automatically generated.) But if not, than user-supplied explanations, eventually corrected by the MillComuting team, would give users a better understanding of what the Mill can do, especially at the abstract Mill level.
Theoretically, the Mill could have a very few operations as its core ISA, with a larger subset emulated with series of the core operation. While this could be taken to extreme (how many operations does the classic Turing machine have?) I expect that for performance reasons — and the Mill is a performance-based sale — the Mill will define a sizeable core subset of Ops that are implemented very efficiently on all members (in hardware, most likely), instead of as graph substitution of (longer sequences of) core ops. I’ve read hints that full divide (or was it just floating-point divides) that were probably not going to be Ops in their own right, but emulated via substitution of divide-helper functions, that were expected to be in the non-emulated core. I know the divide-helper instructions are not yet filed. So I’m hoping the Wiki will soon (hope!) give us some insight into the general core vs. emulated operations, if not divide() itself.
- in reply to: Pipelining #1304
Constraints on Mill software pipelining:
Based on the pipelining talk and the discussion above, I’ve been thinking about how far Mill pipelining can be pushed vs. where it cannot go. Specifically, what properties would make a loop un-pipeline-able (or unprofitably pipelinable) on the Mill? — and how to recognize those cases. So far, I can see two situations where the Mill approach to pipelining will likely have problems. This isn’t meant as a criticism. I suspect that universal loop pipelining is probably infeasible with finite hardware. IMHO, the key is to recognize when pipelining either won’t work, or would give poorer performance than leaving the loop unpipelined.
1. Loops where there are several different pieces of loop-carried data, for example, coming from different arrays. Since the Mill’s rotate() operation acts only on the most recent scratchpad allocation, I don’t see an obvious way to software pipeline a loop that has loop-carried data from multiple arrays, especially if the loop-carried data have differing loop distances. I suspect this is a fairly rare case.
2. Loops with a single chunk of loop-carried data, but where the loop distance (times the size of each loop-carried datum) is greater than the largest-guaranteed scratchpad allocation. Again, I suspect this situation will happen occasionally, though most loops won’t run into this constraint.
The second of these situations suggests that there are some parameters that the compiler will need to know about, in order to emit intermediate code that the specializer can cause to run correctly on all Mills, even the little ones. In my second situation, the maximum scratchpad allocation permitted in a single scratchpad allocation operation (scratchf()?) on the smallest-scratchpad Mill, is probably one such parameter.
Ivan, if it won’t run into NYF or related issues, can you give us an idea of how many belt operands the scratchpad can hold, before scratchpad needs help from the spiller? Even a rough Tin — Gold range would be nice to have a rough idea about. (I have a rough guess, but I won’t speculate here, so as not to cause IP/patent problems.)
Similarly, I’m curious about what other parameters are “baked” into the Mill architecture. If I recall correctly, the bit-width of the address space is another such. (64 bit pointers, with 60 bits of true address + four reserved bits for garbage collection and forking.) Since pointers must fit in (single) belt positions, it sounds like this requires a minimum height of 8 bytes for all Mill family members. The shortest belt length mentioned to date is 8 for the Tin. I suspect that a shorter belt length (e.g. 4) would effectively rule out wide issue (since results need to live on the belt for at least the spill/fill latency.)
Similarly, two streams of half-bundles (at least until we get that hypercube memory ) and three blocks to decode within each half bundle. Other baked-in Mill-architecture parameters?
- in reply to: Specification and Floating Point Numbers #1298
Greetings all,
FYI, I found a reasonably clear set of slides introducing UNUM and some of its properties at the following URL:
http://sites.ieee.org/scv-cs/files/2013/03/Right-SizingPrecision1.pdfTo me, this seems like adding a metadata field in a known location within the variable itself. Unfortunately, I see only one actual, bit-level example of such a UNUM in the entire presentation, on slide# 31. IMHO, it would be far more understandable, in terms of weighing his assertions about economy of storage if the author had included more examples, especially some showing a size other than an IEEE-754 32-bit floating point value, with an (in this one example) eight bit “utag” on the end opposite the sign bit.
I’m somewhat skeptical about some of his arguments about economy. For individual variables, there might be some savings by using fewer bits. However, the number of needed bits (for a desired accuracy) becomes data-dependent.
IMHO, where this falls down is that compilers need to track the locations and sizes of variables. In most programs doing serious floating-point computation, one has lots of values, often in arrays, with all elements at known alignment. “Arrays” without uniform element size become a nightmare for indexing. So any savings in bits in such arrays would seem to come at the heavy price of having to go through the entire “array” of variably-sized elements, to find the one you wanted — every time you wanted to index into such an array. (To say nothing about how much work it would be to move things around, every time any element gains or loses a byte.)
Maybe somebody else can see a way around the apparent indexing issues, but to me it seems unlikely to change the computing world and is IMHO off topic for the Mill, for the reasons Ivan stated.
- in reply to: Pipelining #1228
The retire() operation is brilliant!
To me, it shows the hallmark of a great invention — that it seems “simple” — but only after I fully absorbed the flash of insight.rotate(),
inner(),
leave()….To quasi-quote President Carter, I lust in my heart for this CPU architecture!
Not just because it’ll be fast, but because it’s fast yet clean.
(I suppose it’s irrational, but even when I don’t have to look at the assembly code, I’d much rather work on a clean machine than on something I consider ugly.) - in reply to: Inlining functions vs. call/return #1203
For simple, non-iterative functions, such as getter/setter methods on objects, I suspect there can be substantial performance gains from inlining vs. calling them. By inlining such function invocations, most (if not all) of the function’s load operations can execute earlier (be raised within the calling EBB) than would likely happen if the function were to be called. So I suspect that inlining such simple functions will reduce the number of unused/no-op-ed slots executed, thus improving both speed and code density. In my experience, the sequence of:
1. get (some property of an object),
2. make a simple change,
3. set the property to the changed value
happens so often in object-oriented code that I think aggressive inlining of such sequences will make (compiled) OO-language code really fly on a Mill.
—-
[1] By simple, here I mean a function whose real work could be done in fewer operations than the target Mill could do in three full cycles, if the load latency were minimal (result available in the next cycle after the load.) For most getter methods, the real work is just a single load (usually from a compile-time known offset) from a base address. Likewise, a setter method is often just a store (again, frequently with a compile-time-known offset) to an address.
So an inlined version of a simple getter/setter function could take up merely a single load/store-capable slot in a much wider instruction. A fraction of a Mill instruction is faster and more compact than three full instructions (call, at least one instruction for the function body, return.) And IMHO, simple methods are called frequently in much object-oriented code.
Ivan,
Thank you for revealing the above info about the FU population in the current Mill Gold spec. I understand that it’s subject to change.
I take it that an lsFU is a load/store functional unit. An lsbFU sounds like a close relative of load/store, but what does the b in lsbFU stand for?
—-
With the exception of floating point on the exu side, it looks like slot zero on both sides can perform any operation that any higher slot on the same side can do. Is there an underlying architectural reason for that? (Other than sim indicating that doing so helps performance?)
- in reply to: Pipelining #1334
Ivan,
Thanks for explaining the basic intent of mini and maxi. I appreciate how much of the Mill architecture so far revealed has so few “baked in” parameters. The ability to specify the length, width, height, scratchpad size, etc. parameters per member is a fascinating capability and — I suspect — will supply great leverage in product/market agility.
IMHO, changing the number of blocks per half bundle would be changing the recipe significantly, if an outsider who’s barely peeked under the hood can be allowed such.
- in reply to: Pipelining #1323
Ivan wrote:
size of scratch in different members:
Copper: 128
Decimal16: 256
Decimal8: 256
Gold: 512
Maxi: 2048
Mini: 128
Silver: 256
Tin: 128Ivan,
Thanks for that info!
In the bargain, we learn the names of four, previously unmentioned (so far as I’m aware) Mills. From two of those names, I’ll assume that there’s been some real interest in decimal floating-point capability. I wonder who/what market sector?
I find it interesting that Maxi has so much more scratchpad (4x) than a “mere” Gold. Is Maxi associated with the request (from Lawrence Livermore Labs, if I recall) about a supercomputing version? (The one that would have needed over a thousand contacts bonded out?)
Canedo wrote:
Has anyone made the connection yet between belt machines and queue machines?
From 2004 to 2008 I was part of Prof. Sowa’s lab where we developed the Parallel Queue Processor. We had a similar idea (FIFOs holding operands) with similar objectives: high parallelism, no register renaming, and short instruction set. http://dl.acm.org/citation.cfm?id=1052359
…That’s very interesting; another architecture using a form of FIFO for operands! Thanks for sharing those links.
One thing I admire about the Mill architecture, as described, is that the behavior of the Belt is prescribed in terms of what it does (programmer’s model) independent of how it’s implemented (in general, or on any given family member.) If I’ve understood from a quick skim, the above-mentioned Parallel Queue Processor’s queues are tied to a particular hardware implementation, namely a ring buffer. A ring buffer would not appear to lend itself to Belt operations like conform() — which permits arbitrary reordering of the belt’s contents.
- AuthorPosts