Forum Replies Created
- AuthorPosts
Binary translation is as practical to the Mill as to any other architecture; the task is essentially the same for any S-to-T translation, including the Mill on either end. You un-model the source to an abstract representation similar to a compiler’s IR, optimize that to a fair-thee-well to remove redundant state setting (like condition codes), and then do a normal target code gen from the result.
The difficulty, for all translations, is when the source code embeds implicit assumptions about the platform. Self-modifying code is obvious, but code often assumes word and pointer sizes, endianness, the presence (and behavior) of I/O devices, and so on.
There are platform dependencies beyond the ISA too. You may have a Windows x86 binary and translate it to execute on a SPARC or a Mill, but without a Windows OS to run on it’s not going to be able to draw pictures on the screen.
So the issues with binary translation are roughly the same as for any other port even if you have the source to recompile. If a given code can already successfully run on big- and little-endian machines using both 32-bit and 64-bit models then it is probably port-clean enough to translate well. But even with perfect ISA translation, the port may not succeed.
- in reply to: Pipelining #1317
This reply addresses LarryP’s question about “baked-in” (as opposed to member-dependent) aspects of the Mill.
Similarly, I’m curious about what other parameters are “baked” into the Mill architecture. If I recall correctly, the bit-width of the address space is another such. (64 bit pointers, with 60 bits of true address + four reserved bits for garbage collection and forking.) Since pointers must fit in (single) belt positions, it sounds like this requires a minimum height of 8 bytes for all Mill family members. The shortest belt length mentioned to date is 8 for the Tin. I suspect that a shorter belt length (e.g. 4) would effectively rule out wide issue (since results need to live on the belt for at least the spill/fill latency.)
Very little is baked-in. Cache-in-virtual is, so the address space must be big enough to hold the space of all processes without overlap, which implies 64-bit pointers for general-purpose work. Extending shared addresses off-chip might require 128-bit pointers in some future supercomputer, but that’s not GP. Likewise, single-process apps (some embedded for example) might fit in 32 bits or smaller, but that’s not GP either.
Belt size is not baked; it is set so that spill of single-use operands is rare. Eight might be too small even for a Tin, but we might leave Tin somewhat hamstrung with a too-small belt for market segmentation reasons anyway. All this tuning waits on real data.
Similarly, two streams of half-bundles (at least until we get that hypercube memory ) and three blocks to decode within each half bundle. Other baked-in Mill-architecture parameters?
This is not really baked in; you could add a clock to either side’s decode and get two more blocks if you had a use for them. Earlier Mills had four blocks on the exu side, but we merged two of them to save the clock. If, on some member, we find that a decode stage is clock-limiting the rest of the machine then we might split and rearrange the existing blocks, costing us another clock in decode but speeding up the overall clock rate. Whatever sim says works best overall.
- in reply to: Pipelining #1314
This reply addresses LarryP’s questions on configuration.
1) size of scratch in different members:
Copper: 128
Decimal16: 256
Decimal8: 256
Gold: 512
Maxi: 2048
Mini: 128
Silver: 256
Tin: 128Sizes in bytes. All numbers are placeholders awaiting tuning with real code.
Scratch is byte addressable and packed, so less is needed than for a register file that needs a whole register to hold anything. Ignoring vector data, we project an average width of spilled data ~3 bytes, so a Tin scratch can hold ~40 separate operands. We project a peak non-pathological scratch load in open code to be perhaps 10 operands, so there’s plenty of extra space to buffer the spiller. In a piped loop, LCVs may demand many more than 10, but we expect to stay in such a loop for a while so the spiller won’t be active during the loop and won’t need to buffer, so the loop can use the whole of scratch without causing spiller stalls.
- in reply to: Pipelining #1311
For all posters, I recommend that you put one question in each posting, or at least collect questions on a single topic together so that replies can be coupled to that topic.
This reply addresses LarryP’s scratch-related questions.
1) several different pieces of loop-carried data
This makes no difference at all. The rotate operator is invoked once per iteration, and all the scratchpad data local to the iteration rotates, no matter where it came from. The scratch space must cover a loop’s distance worth of data, and individual LCVs may have shorter distances, for example,
A[i] = B[i] + B[i-6] + C[i] + C[i-10];
Here the loop distance is 10, but B has a distance of only six. It is true that the values from B will reside in the rotating scratch for four more iterations after they are dead before they are overwritten with newer values, but that’s harmless.
2) where the loop distance (times the size of each loop-carried datum) is greater than the largest-guaranteed scratchpad allocation.
This is our old friend the running-out-of-scratch problem. The Mill solution is “extended scratchpad” in spiller-owned memory, with scratch-like spill/fill semantics. There wasn’t time for extended in the talk, but it was described elsewhere here.
As for compilation, we believe (not that far yet to say “we have done”) that it is sufficient for the compiler to identify LCVs and their distances, and let the specializer deal with all spill/fill. Compiler output is a dataflow graph, so if the scheduler finds that a live datum will fall of the belt it must allocate scratch and insert the necessary spill and fill. Because the datum is marked with its distance, the specializer knows that a rotate op is needed and how many copies of the LCV will be live in the scratchpad. The rest is reasonably straightforward.
- in reply to: Specification and Floating Point Numbers #1295
The Mill is a commercial venture and so what we provide is driven by the user community in the form of programming languages and other standards; our job is how we provide it. The UNUM proposal essentially represents the exponent itself in floating point so that (for common values) the significance is improved, to the point of exactness in many computations. This is incompatible with the standard IEEE representation, so adopting it would require changes to language, standards, and much software as well as the hardware, even it it were only an extension and not a replacement to IEEE754.
I am not enough of a numerics guy to judge the merits of UNUM on mathematical grounds; my role on the 754 committee was as an implementation gadfly, not an algorithms specialist. The small examples of UNUM usage provided seemed to work well, and the implementation in hardware would be straightforward, but I don’t know enough to judge its merits in general code. My gut feel is that hardware prices and operating costs don’t warrant another format when one could simply do everything in Decimal or quad Binary precision. The time for UNUM to have been introduces was back in the day of the original 754, 40-odd years ago, when there were many incompatible formats, and a good idea did not have to surpass embedded practice. That chance is gone, perhaps unfortunately.
There exists a standard that tries (in a different way) to preserve precision, the IEEE Decimal standard, which the Mill supports. If UNUM reaches even minimal acceptance then it could also be incorporated in the Mill, and likely would be. Until then, even if the hardware had support there would be no way for you to access that hardware due to absence of support in programming languages.
An initial implementation of UNUM would be more suitable on a register machine that can be usefully programmed in assembler, even without HLL support for the format; the Mill is not a realistic assembler target.
Where did pick() operations– and the slots that implement then — go when Gold’s spec changed?
The pick block was folded into the writer block for encoding purposes, although the pick ops could have been put in any of the exu-side blocks, and might be moved again for entropy balancing.
Which slots on the current Gold Mill can pick operations now be encoded and executed?
The only block whose content is dictated by machine timing is reader block, which decodes a cycle before the others. Reader phase ops, which execute a cycle (or two) before the others, must be in reader block to get time for dispatch. Block assignment of the other ops is essentially arbitrary as far as execution, and the only consideration is code compactness and simplicity of the decoder matrices. As currently organized, any writer block slot on any member supports pick.
I’d like to see a bit more detail about the FU and slot population on at least one member, probably Gold, since some detail on it has been revealed — and the design is evolving.
Well, you asked for it From the file members/src/specGold.cc:
c->exuSlots = newRow
(8);
c->exuSlots[0] < < aluFU << countFU << mulFU << shiftFU <<
shuffleFU;
c->exuSlots[1] < < aluFU << mulFU << shiftFU;
c->exuSlots[2] < < aluFU << mulFU;
c->exuSlots[3] < < aluFU << mulFU;
c->exuSlots[4] < < aluFU;
c->exuSlots[5] < < aluFU;
c->exuSlots[6] < < aluFU;
c->exuSlots[7] < < aluFU;
c->exuSlots[0] < < bfpFU << bfpmFU;
c->exuSlots[1] < < bfpFU << bfpmFU;
c->exuSlots[2] < < bfpFU << bfpmFU;
c->exuSlots[3] < < bfpFU << bfpmFU;
c->flowSlots = newRow(8);
c->flowSlots[0] < < cacheFU << conFU << conformFU <<
controlFU << lsFU << lsbFU << miscFU;
c->flowSlots[1] < < conFU << conformFU << controlFU <<
lsFU << lsbFU << miscFU;
c->flowSlots[2] < < conFU << conformFU << controlFU <<
lsFU << lsbFU ;
c->flowSlots[3] < < conFU << conformFU << controlFU <<
lsFU << lsbFU ;
c->flowSlots[4] < < conFU << lsFU << lsbFU ;
c->flowSlots[5] < < conFU << lsFU << lsbFU ;
c->flowSlots[6] < < conFU << lsFU << lsbFU ;
c->flowSlots[7] < < conFU << lsFU << lsbFU ;Remember: this is an untuned dummy specification.
I know the divide-helper instructions are not yet filed. So I’m hoping the Wiki will soon (hope!) give us some insight into the general core ops vs. emulated operations, if not the details on divide() itself.
The divide helper is rdiv, the reciprocal approximation op. Before you ask, there is also rroot, the square root approximation helper, too. Emulation sequences for both div and sqrt are being worked on, along with the FP and quad integer emulation sequences.
The abstraction is accurate. For a belt with nominal 32 entries, any data sink can take its data from any position. There are actually twice as many physical entries (64) as nominal, to allow for phasing, but each phase can only see a nominal window into the physical belt space. Without that, we would have to renumber the belt at each phase boundary, which is not practical.
There is a crossbar, but it is split with a narrow fastpath and a wide slowpath. One-cycle latency results (a maximum of one per slot) go direct to the fastpath, while two values (for each consumer) are selected from the slow path and passed to the fast path. This is explained in greater detail in the Belt talk.
However, although there are many possible sources for the crossbar, there are many fewer sinks than you assume. In particular, slots accept at most two input arguments from the belt, and quite a few (all the Reader block) accept none at all. The cost of the crossbar is determined almost entirely by the number of consumers, which is one reason why Gold has only 8 exu slots.
Similarly, while a slot can concurrently drop results from differing-latency ops issued in that slot, the long-term steady-state peak throughput is ~one result per slot, which determines how we size the spiller.
We describe the belt implementation as in effect a tagged distributed CAM, because that is easy to explain, and for many people a concrete realization is more understandable than a more abstract description. However, the implementation of the belt is invisible to the program model, and the hardware guys are free to come up with bright ideas if they can, which will likely differ in different family members.
There is expected to be a talk this Fall giving the present most-recent final word about the gate-level hardware of the first Belt implementation.
You’re safe on the NYF department, so far
You are right that the Belt is essentially a code-visible forwarding network and faces the same problems that such networks have on other machines. The Mill partitions the network to get both speed (for some critical results) and volume (overall), in a way that we have filed for.
At the end of the belt talk there are a couple of slides on how the cascaded crossbar works in hardware. Essentially there is a fast path, that gets a single one-cycle result from each slot through in time linear in number of slots (MIMD width), while everything else first goes through a slow path that is (roughly) linear in number of FU results that can be produced in a cycle; remember a single FU can have ops of different latencies retire together.
Without the cascaded crossbar, the network would have been clock-critical on the larger machines. With cascading, it appears (based on very preliminary hardware work) to be out of the critical path.
- in reply to: Pipelining #1327
The Mill architecture does support IEEE Decimal, although only some members will have it in hardware; the rest will use specialize-time emulating substitution. Decimal8 and Decimal16 are test configurations (topping out at 8- and 16-byte decimal representations respectively) for markets that want it: mainframe, DB, and Big Data mostly. As test vehicles they won’t be products, but derivatives may be, when and if we decide to tackle those markets; don’t hold your breath.
Mini is a straw man to see how small we can get the specification before execution becomes completely impractical. It will never be a product.
Maxi is a straw man at the other end, to see how muscle-bound we can configure a Mill before it hits diminishing returns with the clock rate. Also not a product per se, although there may be products between Gold and Maxi.
The reason Maxi needs so big a scratchpad is because it has a huge vector size, taking whole cache lines as single operands (SIMD) and lots of them (MIMD). It won’t have that many more operands in scratch than say a Gold, but they will be much bigger on average (no one would use a Maxi for anything but vector-busting numeric work) so the scratch size, in bytes, must be bigger.
There have been other symmetrically-split machines; for example, the TI 8-wide VLIWs are actually dual 4-wide VLIWs with the ability to communicate between the halves. The main advantage of Paysan’s is that it does not need to encode result locations, and the transients are single-assignment and so are hazard-free; the Mill has these same advantages with the Belt. The major difficulties with Paysan’s design in a modern context (which is unfair; Paysan’s work was done 15 years ago when a three-stage pipe and a 100MHz clock were cutting edge) are the communicating crossbar, and that bane of all wide-issue machines: cache miss stalls. The Mill solutions for both have been covered in the talks.
I also wouldn’t want to have to do an instruction scheduler for Paysan’s machine
- in reply to: Specification and Floating Point Numbers #1302
It is intended to be a selling point. It works for the software. It will clearly cut some of the time and cost from hardware, but there remain parts of the hardware that cannot be automated in the current art. The goal remains to cut our development costs to permit entering lower-volume specialty markets.
Rotate/swap ops are not sufficient to replace spill/fill because the number of long-lived live operands in the queue/belt is unbounded. The length of the belt in individual members is determined by the scratch-pad spill-to-fill latency (three cycles on most members) and the rate at which the core can produce results, which is roughly equal to the number of execution pipelines, running from ~5 to over 30 depending on family member. As a rule of thumb we set the length to hold three cycles of production, and tune from there.
- AuthorPosts