Forum Replies Created
- AuthorPosts
- in reply to: Tokyo University STRAIGHT compiler #3958
The STRAIGHT addressing is not a belt because it uses ordinal instruction addressing, while a belt uses ordinal drop. But the new Clockhands addressing is a belt, pure and simple. Thank you for bringing it to our attention; we have contacted the authors. If it’s just a research project then it is our practice to grant royalty-free research licenses to our IP. However, if they plan to commercialize it, or even to seek patents without mentioning us, well, many happy lawyers.
- in reply to: Binary distribution for a Mill #3927
1) App distribution in binary vs. IR:
Certain parts of the software, such as the minimal BIOS and the boot specializer, must be distributed in binary. These will normally reside on a ROM, and the distribution of changes likely involves reflashing the ROM; vendors other than hardware manufacturers will never need to do this. Kernel vendors will distribute in IR form using the minimal canonical IR which is acceptable to the specializer in the ROM. Included in the kernel package will be a more feature-full app-IR specializer in ROM-IR form. That will be translated to native binary, and then run through itself so that the installed app-specializer has all the app-level features for its own work of translating app-IR to binary. Cascaded compiling-the-compiler is routine for cross-compilation, which is really what a Mill install is and does.
Other than updating the boot ROM there is really no reason to distribute pre-specialized code in binary. Even assembler-like code can be represented in Mill IR as intrinsics. Now it is possible to have Mill configs with extra nonstandard instructions, and code that uses those won’t run if the instruction doesn’t exist in the target config at hand. But you’d still use IR for it – and get a specializer error if the requisite intrinsic is not found.
2) Dynamic exit prediction update in read-only systems:
Prediction update is just an optimization: it starts the predictor table off with the state from prior runs instead of empty. If the load module cannot be written then the optimization doesn’t happen. Conceivably the vendor could build a table using a mutable load module and then distribute the module as binary. The gain from the optimization is unlikely to justify the nuisance of maintaining multiple binaries for the different members.
3) Sharing predictor history:
Certainly predictor history could be separated from the binary code that uses/updates it. However, the history is tied to a particular config just like the specialized code is, so you’d have the administrative booking problem of making sure that both addressed the same config.
3 redux) Varying config binary ISAs:
It’s common that configs lack some instructions that others have. For example, some of our test configs lack floating-point or quad (128-bit) data forms. The specializer still recognizes these in the IR, and generates calls on emulation routines, which are often inlined. The substitution is automatic – the install provides signature info and the corresponding routine for everything potentially in the IR.
The sig/emu info is tied to an IR level. If the IR changes to a new release then the installation must be upgraded with info to match. If you present an object module that uses IR12 but only IR9 is installed then you’ll get an error in the app install. It doesn’t matterwhether the actual hardware has the instructions: the specializer knows what the hardware can do, so it uses hardware if possible and emu otherwise. The IR install may provide emu routines for instructions that the particular hardware actually has; the hardware will be used.
Prediction offers little value in lengthy loops. Even in architectures with no form of prediction, there may be a miss on enter and another on exit, but for all the other iterations the loop body is already in the top level cache. At most a predictor can let fetch and decode proceed through the back branch without waiting for that branch to execute and the target resolve. That wait can be significant in very tight loops, but predictor hardware isn’t free and in cost conscious embedded usage predictors are an engineering choice.
This also applies to the Mill exit predictor. You are right that ours permits fetch-ahead without decode of the branch because the next ebb’s address is in the prediction, similarly to a BTB entry. The entry also contains a duration, in cycles, which tells the decode when to stop the current ebb and start on the new one. If the predicator was correct, the target ebb (and likely many more) is already in the TLC and the correct sequence of bundles passes through decode to issue with no bubbles.
That’s if no miss. When we execute the branch and discover that the prediction was wrong, or if we take a exit that wasn’t predicted, the hardware has already issued two cycles down the wrong path. Miss recovery has to unwind that, fetch the right target, and decode it, which takes 5 cycles if the target is in cache (legacy miss costs are usually around 20 cycles). A naive predictor will always miss at the bottom of a loop at the last iteration. If the loop has a low trip count the miss can be significant. To avoid it, a more sophisticated predictor uses prior branch history or iteration counts to switch the prediction on the last time through. We’ve tried both approaches, and found that which works best depends on the kind of apps being run, so that’s another configuration choice.
- in reply to: Millberry pi #3888
Nope 🙂
This is somewhat more complicated on the Mill, which has a richer notion of shift than the common bit twiddle. In particular, left shifts modl multiplies in that overflow behavior is specified, while right shift models divide in that rounding behavior is specified. Field isolate is a special case of these behaviors, and would be a shift in the FU, but is a separate instruction in the conAsm for notational and reader convenience.
One of the early talks described the smear instruction for SIMD search loop support. Early Mill designs expected more extensive SIMD, but we have decided that most SIMD usage is better done with streamers and width. Sorry, NYF.
Your idea is possible in principle, but the details are problematic. In essence you propose a deferred branch whose transfer is detached from the evaluation of the predicate, similar to the way Mill separates load issue from load retire.
The deferred branch, DB, would contain a predicate, a target address, and a deferral cycle count. At DB issue time the predictor has already made an exit prediction and fetch has already run down the prediction chain loading the code of the predicted path. If the DB predicate evals false then the DB is ignored. If it evals true then the DB target address should be compared with the predicted target and the deferral with the remaining count of the prediction; if equal then the DB is asserting the same transfer as the prediction, and can be ignored. If they differ then the fetch/decode logic needs to be reset, essentially in the same way as with a mispredict, by updating the pending count and target and restarting the fetch chain at the new target.
This could be done. However there are both semantic and implementation issues. One is the time required to reset the fetch. If the target is not in the I0 microcache then reset would take roughly as long as mispredict recovery, i.e. five cycles in our test configs. Even an I0 hit would likely need three cycles to reset. If the deferral was less than the remaining count as predicted then then we would have already executed a ways down the predicted (before the reset) path, and would need a full unwind miss recovery, and DB buys nothing. How often can we eval a predicate five cycles before the transfer? Not often I’d guess, but I have no numbers.
A semantic issue is how the DB interacts with other branches. Suppose a regular branch is executed between DB issue/eval and transfer? Who wins? I’m sure you can think of similar conundrums.
- in reply to: Layout can dramatically affect performance #3936
Remarkably clear exposition. Thank you!
- in reply to: Binary distribution for a Mill #3933
Yes, although we don’t yet support that mode.
- in reply to: Tokyo University STRAIGHT compiler #3930
Itanium uses rotating registers as a cheap form of loop pipelining, but addressing is still spatial. Mill uses temporal addressing, which is quite different.
- in reply to: Tokyo University STRAIGHT compiler #3929
STRAIGHT is interesting; it’s the only other temporally-addressed architecture that I know of, and I thank you for bringing it to my attention.
The addressing (STRAIGHT vs Mill) is different: STRAIGHT refers to the generating instruction in time order, whereas Mill refers to the dropped result. In the hardware STRAIGHT uses a scoreboard approach to block for uncompleted instructions, while Mill uses full static scheduling. Some benefits are shared: no renaming, no encoded result registers. Some are peculiar to one or the other: Mill needs no reorder buffer, while straight need not track instruction retire time in the compiler.
I wish them luck.
DFP is obviously of value for COBOL-type commercial programs. Engineers tend to think of designs as being for the kinds of things that engineers do – CAD, compiles, scientific, games – and forget how much work is done doing payroll and booking reservations. Those won’t be the first Mill markets, but we have defined DFP into the ISA for possible future products. Whether it’s hardware is a config choice, to be made then.
Same answer: it depends on the length of the dataflow gate chain of the combined operations and the gate height of the process being used in the fab. Each pair must be analyzed individually using the CAD tools.
Phasing can be confusing, because it is both a representation of hardware timing and of ordering abstractions within a single hardware time unit. At it’s simplest, it represents “this comes before that”.
In the hardware eacgh clock admits a hardware defined number of gate transits between clock edges. Signals pass across a clock edge from one set of gates to a different one; the signals can be thought of as operand results of the former (the producer or source) and arguments of the latter (the consumer or sink). In conventional circuit design, including ours, there are two kinds of source-to-sink connection of interest. In one, a particular source can only pass signals to a particular sink; this is a pipeline. In the other, there is a many-to-many signal connection between a set of sources and a set of sinks; this is a bypass network or crossbar. Pipeline connections typically take only a few gates to stabilize the signals at the clock boundary; crossbars take many more, roughly growing as to the square of the number of participants, and often with significant wire delay.
You ask about consecutive shifts, and the answer depends on whether the result of the first shift is pipelined to the second, or is routed to the second as one of many possible consumers though a crossbar. From the view of the hardware, time is discrete, quantized and integral, and only happens at the clock boundary. (Yes, there are asynchronous hardware designs in which time is a real number and there is no ticking clock, but those are not mainstream.) Mill phases only happen at clock boundaries in hardware terms; there is no between-the-clocks phasing, because there is no notion of time between the clocks. Consequently the notion of doing two shifts (in the sense of actions with source and destination) in one clock is not semantically meaningful: by definition you can only do one action per clock, because that’s what an action is defined to be.
However, there is another sense in which your question is meaningful: is it possible to take the gates that do one shift action (in one clock) together with the gates that do another shift action (in another clock) and make a bigger pile of gates that together comprise an action whose result is the same as the second of the two shift actions when cascaded together, but without the cascade, and all in one clock? That is, is it possible to replace two consecutive shifts (taking two clocks) with one double-shift (taking one clock)?
That question is meaningful, but has an unfortunate answer: it depends. Hardware shifts are usually implemented as a tree of multiplexers, and the depth of the tree depends on the maximal shift amount, which for us depends on the width of the maximal operands: a Mill with quad (128-bit) data needs deeper muxing that one with only double (64 bit) data. More depth means more gates have to fit between the clocks. The limit to the maximal gate chain length is set by the fab process, the circuit design, and the clock rate, and will vary from chip design to chip design. However, it is likely that two mux trees can be fit into a single clock, at least up to double width. That could be used as a bit-field isolate instruction.
However, for bit fields there is a better way that puts less pressure on the gate chain depth: a shift and a mask. The shift part brings the desired field into the low bits, and a mask AND selects only the bits of the field. An AND does not use a mux tree and its gate depth does not depend on the operand width, so it uses fewer gates than a second full tree would need for the second shift. Mill bitfield instructions use this approach.
- AuthorPosts