Forum Replies Created
- AuthorPosts
- in reply to: Millberry pi #2786
Dream on – we do 🙂 A dev board won’t be soon. What we hope to have this year is the tool chain and sim available on the cloud – sandbox for you to play in. The sim is quite fast, and on big iron cloud servers may run faster than you’d expect.
- in reply to: Posit math instead of IEEE FP? #2784
Gustafson’s work is interesting but very controversial. I’m not enough of a floating point maven to comment on the mathematics of it. It seems clear that FP apps would need to be re-analyzed and most likely rewritten to take advantage of the format; the economics of that suggest very low and slow adoption rates, even with a hardware implementation available.
However, the Mill is specification-driven. To add Gustafson-numbers (GNs) as a computation domain and their operation set is pretty trivial on the software side -a day or two to get the specification machinery to accept them and the tool chain to handle them and produce code. Then somebody has to figure out what a given operation on GNs actually does at the bit level and code that into the sim – maybe a week or two. Getting from there to gates is more, but not intolerably so. The biggest hassle would be getting them into clang/llvm as a recognized (and properly constant-folded and otherwise optimized) data type.
Would GNs sell more Mills? Almost certainly not. However, certain very specialized needs might want GNs badly enough to NRE a machine with them. And if such a party cared about costs, we surely can produce a GN-supporting Mill a lot cheaper than they could be added to any other machine. Interested parties please apply here 🙂
- in reply to: LLVM pointers #2782
C lets you cast a pointer to a particular numeric type (of implementation-defined characteristics), but the only thing it guarantees is that you can cast the numeric back to the original pointer type.
Most machines today with a wider address space do not support address spaces as big as would fit in the size of a pointer. Thus the pointer may be 64 bits, but only the lower 60 bits (in our case; fewer elsewhere) is meaningful. What happens when your pointer arithmetic overflows the meaningful area is not defined by the language, ignored by most hardware, and uncontrollable in LLVM which has removed pointerhood in the front end.
Mill cares about overflow bugs. We want to fault them, and our pointer arithmetic (folded into the LEA op) checks and will throw. The problem is that we can’t generate a LEA from LLVM because LLVM gives us a 64-bit add and the IR cannot represent an actual pointer add, so all we can do is give you an unchecked ADDU. All we can do for now, anyway. We’ll fix LLVM when we get around to it, if nobody else fixes it first. One may hope.
- in reply to: Belt saturation in short belts #2771
This is a test case to confirm that the specializer can handle belt overflow. It is deliberately simplified to the minimum necessary to expose the event of interest. Belt overflow also occurs in real codes, but real codes have so much else going on that it’s hard even for us to understand the behavior.
When can you play with it? Waiting is!
- in reply to: 2016 closeout thoughts #2710
Look for an announcement January 9 🙂
We are a chip company so we don’t expect to sell boxes except incidentally as development systems. We expect that the community and vendors of Mill-based products will port a wide variety of commercial software to Mill, including various flavors of Linux. It is not our place to choose one flavor over another because that would be choosing one customer over another.
We do expect to provide reference implementations of several versions; priority of versions will be determined by market. We will also need to create a new version to take advantage of aspects of the Mill that existing versions are ignorant of. In particular, we expect to provide a reference micro-kernal based version to take advantage of the reliability and security features.
- in reply to: Continuous Refinements #2192
Mill is not limited to 2x multicore; you can have as many Mill cores on a chip as will fit if you can power and feed them. Our best guess at the moment is that the constraint in current tech will be pin bandwidth to memory. If on-chip memory and/or direct fiber gets real then we expect the constraint to become cooling, although it might be intra-chip inter-core routing. All are WAGs.
At heart though you have identified the fundamental tech problem: CPUs don’t scale. Mills don’t either, it’s just that we have better constants. Details:
* GenAsm is pretty extensible. It does not have any assumptions about belt size, FU population, cache size, etc; all that information is provided by the specializer from the desired target description. The emulation substitution mechanism is quite generic; if the genAsm contains an op invocation that the target doesn’t have then the specializer searches for a function of a related name and signature and substitutes it for the op. So each member carries a bundled specializer for that target and a library of emulation functions for every ISA op that exists on any Mill. That library can be later updated by DLC to handle later Mill versions with new ops. This system breaks down only if a member ISA has an op that cannot be represented as a function in other ISAs. For example, code for a member with a supercollider-management op won’t work in a Mill that doesn’t have a supercollider to manage 🙂
Aside from particular ops, genAsm is fairly high level, a direct SSA representation of the program. The bulk of our problems with LLVM have been because LLVM IR is lower-level than genAsm, and we can’t recover information that clang/LLVM have thrown away. There’s a fair amount of the machine that you can’t reach from C, including most of the NYF.
* We doubt that a 6-bit morsel would pay, and are quite sure that a 7-bit one doesn’t.
* There’s enough flexibility within the Mill architecture to permit a lot of tuning without having to depart incompatibly. To make a change big enough to be worth while then it would be a new architecture, and no longer a Mill. For example, a capability machine might have a belt, but would no longer be a Mill.
- in reply to: Control-flow folding #2811
Perhaps the talk is oversimplified and misleading. In greater detail, the ebb is divided into fragments, one fragment between each possible transfer. Each sequence of fragments that ends with a taken transfer has a prediction, even unconditional transfers, so we are actually predicting exit from a fragment chain rather than from an ebb. The difference is visible when the ebb contains taken calls, as in this example. Thus while all ebbs will have at least one prediction, those containing calls will have an additional one for each (predicted-)taken call.
The behavior, cost, and advantages of the Mill predictor and a legacy predictor are quite different and not readily comparable. The actual prediction mechanism, be it history based, AI or whatever, is irrelevant; Mill can use any of them in a family member, and the chosen method will predict as well for Mill as anyone. What is different is the information retained by the predictor and what we do with it. Conventional predictors carry one bit (taken/not taken) for each conditional transfer in the code, and an address (in the branch target buffer or BTB) for each dynamically-addressed transfer, either unconditional or taken. They also have a queue of decoded instructions awaiting issue. In the usual decode pipeline that instruction buffering gives them enough lead time to fetch a predicted target instruction from the I$1 without a stall. However, there’s not usually enough lead to handle an I$1 miss, and definitely not enough to go deeper in the memory hierarchy. In addition, the decode queue is yet more work that has to be thrown away on a mispredict.
The Mill predictor contains a (compressed) address in every prediction, essentially folding the BTB into the branch predictor. And there’s an entry for every taken transfer, conditional or not and dynamic or not (a conventional BTB holds addresses only for dynamic transfers). As a result, Mill doesn’t have to look at the code itself to find the target address of a transfer, nor to see which instructions are conditional branches to look up. Mill predictions form a chain, each to the next, extending into future execution; that chain is broken only by a misprediction. And that chain can be prefetched, even out to DRAM, with all fragments fetched in parallel rather than sequentially as a conventional must, even on a conventional with a scout thread.
The downside of the Mill approach is that the space occupied by the predictor for a given hunk of code is much bigger than a simple taken/not taken bit table for the same code. Conventional predictors were developed when transistors were scarce and the table size mattered.
The difference between Mill and legacy approaches shows up differently in different workloads. On loop-heavy loads with few transfers and fewer mispredicts both approaches work well. In branch-heavy code with lots of mispredicts but with a code working set that fits in the I$1 (a lexer for example) then Mill wins not because it predicts better but because its mispredict recovery is fast, in part because it doesn’t need an instruction queue. But the big difference shows up in workloads with very large code working sets and fairly common mispredicts, such as database and browser work loads. A mispredict in such loads commonly means that the application has entered a sub-function that has its own working set, and that working set is not in the I$1 and typically not on the chip at all.
That sub-function will need all its working set. A legacy CPU will fetch that working set one transfer at a time as the I$1 misses the fetch of the code that has to be decoded to find the address of the next code to fetch, which will miss in the I$1 again… Meanwhile, the Mill predictor will have issued fetches for the whole working set, as fast as the DRAM bandwidth will go, by chaining in the predictor without looking at the code itself. Yes, the first few transfers will have to wait for the fetched code, but then the fetcher catches up and decode finds the code it wants in cache, just waiting to execute.
There are actually two costs in a mispredict: the immediate stall as state is flushed and reloaded, and the long tail of deferred costs as the instruction caches reload to the new working set. The numbers you see published for mispredict costs show only the first cost, and if the transfer doesn’t change to a new working set then that is the only cost. But (neglecting task swap) any time you see a flurry of instruction cache misses you are seeing a change in working set, and the flurry was preceded by a mispredict and the costs of the cache misses are part of the cost of that prediction miss.
Mill addresses the immediate cost with short pipes and little state; mispredict stall runs a third to a fifth of that of legacy machines. For many applications that’s the only cost that matters, but for others the subsequent fetch-miss cost is very significant, and Mill run-ahead prediction addresses that.
The cost is hardware table size. Because the Mill is configurable, different Mill members can adopt different predictors. The video describes the expected default predictor configuration. A Mill for a market that cares only about price (== chip area) and not about latency can be configured with a conventional predictor, or no predictor at all as is the case of some embedded designs.
- This reply was modified 7 years, 8 months ago by Ivan Godard. Reason: typo
- in reply to: Belt saturation in short belts #2805
Even in C, when LLVM has the bodies visible it marks functions as being pure or read-only as applicable, which lets the specializer reorder them. Of course, separate compilation hoses that, unless the specializer is run after a LTO pass. We have not yet integrated LTO into the tool chain; most reports say it has little value except for inlining, which we do in the specializer anyway.
- in reply to: Belt saturation in short belts #2800
Here’s an example of the kind of code you get for cascaded calls. Given source:
void foo1(); void foo2(); void foo3(); void bar(int i) { if (i == 0) foo1(); else foo2(); foo3(); }
on Silver (with three branch units) the code is:
F("bar") %0; eqlb(b0 %0, 0) %1, calltr0(b0 %1, "foo1"); callfl0(b0 %1, "foo2"), call0("foo3"), retn();
- This reply was modified 7 years, 8 months ago by Ivan Godard.
- This reply was modified 7 years, 8 months ago by Ivan Godard.
- This reply was modified 7 years, 8 months ago by Ivan Godard.
- in reply to: Belt saturation in short belts #2788
The call op is in the flow block on the flow side of the decoders. That block parses in D0, so the presence of calls is known at the start of the D1 cycle, and there is all of D1 and D2 to get organized. When there are cascaded calls the hardware connects the return of the first to the entry of the second and so on; there is no cycle in between. You can think of it as hardware tail recursion removal. It’s not hard on a Mill because there cannot be any hazard between the two calls; on a genreg machine you’d have to check whether the first function did something nasty to the caller’s registers or stack, and even without that you still would have inter-call rename, something I don’t want to think about.
Art claims that he can cascade the trailing call with a branch or return too, so long as the instruction does not have any pick ops. I’m not sure I believe him.
- in reply to: Continuous Refinements #2193
The implementation uses quite conventional clock distribution, and we expect normal binning and overclocking capability. Mills are not asynchronous; different members have different timing, but within any member latencies are fixed.
- in reply to: When should we expect the next talk? #2178
Not sure; probably not until after the next funding round. We’ve been flat out with patents and implementation.
- in reply to: Single register #2172
It does, doesn’t it?
- AuthorPosts