Mill Computing, Inc. › Forums › The Mill › Architecture › Could shift instructions be replaced by bit field instructions?
Tagged: shift instructions bit fields
- AuthorPosts
- #3879 |
I found this article that references a clever trick. They replace shift instructions with bit field instructions.
“A neat trick that I learned from the M88k ISA (designed by Mitch Alsup in the 1980s) is that you can replace the common shift instructions (logic/arithmetic shift left/right, i.e. LSL, LSR, ASR) with more powerful and versatile bit field instructions.”
See the original page for more information, but it sounds like a neat trick. Could this idea benefit the Mill ISA?
This is somewhat more complicated on the Mill, which has a richer notion of shift than the common bit twiddle. In particular, left shifts modl multiplies in that overflow behavior is specified, while right shift models divide in that rounding behavior is specified. Field isolate is a special case of these behaviors, and would be a shift in the FU, but is a separate instruction in the conAsm for notational and reader convenience.
On ARM64 and PPC, all the different bitfield operations are implemented by only a couple instructions that perform a rotation and then a bitwise-select using a constructed mask. The rotation amount and bit-range of the mask define whether it is an extraction or injection.
Without dedicated bitfield instructions, extraction, and injection into zero can be done with two shifts: first to the top and then down.
What’s missing is a quick and easy way to bitwise-select a shifted value (or zero) into an existing bitfield.
The Wiki mentions a “Merge” instruction that performs bitwise select but you would first have to construct the mask.I’m not entirely sure I understood how The Mill’s “phasing” works, so please excuse me if the following does not make sense:
Could, in theory, the processor be organised to have a left shift be followed by a right shift of the result in the next phase with only a single cycle delay?Phasing can be confusing, because it is both a representation of hardware timing and of ordering abstractions within a single hardware time unit. At it’s simplest, it represents “this comes before that”.
In the hardware eacgh clock admits a hardware defined number of gate transits between clock edges. Signals pass across a clock edge from one set of gates to a different one; the signals can be thought of as operand results of the former (the producer or source) and arguments of the latter (the consumer or sink). In conventional circuit design, including ours, there are two kinds of source-to-sink connection of interest. In one, a particular source can only pass signals to a particular sink; this is a pipeline. In the other, there is a many-to-many signal connection between a set of sources and a set of sinks; this is a bypass network or crossbar. Pipeline connections typically take only a few gates to stabilize the signals at the clock boundary; crossbars take many more, roughly growing as to the square of the number of participants, and often with significant wire delay.
You ask about consecutive shifts, and the answer depends on whether the result of the first shift is pipelined to the second, or is routed to the second as one of many possible consumers though a crossbar. From the view of the hardware, time is discrete, quantized and integral, and only happens at the clock boundary. (Yes, there are asynchronous hardware designs in which time is a real number and there is no ticking clock, but those are not mainstream.) Mill phases only happen at clock boundaries in hardware terms; there is no between-the-clocks phasing, because there is no notion of time between the clocks. Consequently the notion of doing two shifts (in the sense of actions with source and destination) in one clock is not semantically meaningful: by definition you can only do one action per clock, because that’s what an action is defined to be.
However, there is another sense in which your question is meaningful: is it possible to take the gates that do one shift action (in one clock) together with the gates that do another shift action (in another clock) and make a bigger pile of gates that together comprise an action whose result is the same as the second of the two shift actions when cascaded together, but without the cascade, and all in one clock? That is, is it possible to replace two consecutive shifts (taking two clocks) with one double-shift (taking one clock)?
That question is meaningful, but has an unfortunate answer: it depends. Hardware shifts are usually implemented as a tree of multiplexers, and the depth of the tree depends on the maximal shift amount, which for us depends on the width of the maximal operands: a Mill with quad (128-bit) data needs deeper muxing that one with only double (64 bit) data. More depth means more gates have to fit between the clocks. The limit to the maximal gate chain length is set by the fab process, the circuit design, and the clock rate, and will vary from chip design to chip design. However, it is likely that two mux trees can be fit into a single clock, at least up to double width. That could be used as a bit-field isolate instruction.
However, for bit fields there is a better way that puts less pressure on the gate chain depth: a shift and a mask. The shift part brings the desired field into the low bits, and a mask AND selects only the bits of the field. An AND does not use a mux tree and its gate depth does not depend on the operand width, so it uses fewer gates than a second full tree would need for the second shift. Mill bitfield instructions use this approach.
I did not find bitfield instructions mentioned in the Wiki so I assumed that you had omitted them.
I’m sorry, my question was not if both operations could finish in one cycle, but if the second operation could finish in the next cycle. In other words, if each op would normally have a latency of four cycles each, so that if executed in sequence their total latency would be eight, could that be reduced to five cycles by bundling them together and have them execute in different phases?
- This reply was modified 1 year, 7 months ago by Findecanor.
- This reply was modified 1 year, 7 months ago by Findecanor.
Same answer: it depends on the length of the dataflow gate chain of the combined operations and the gate height of the process being used in the fab. Each pair must be analyzed individually using the CAD tools.
- AuthorPosts
You must be logged in to reply to this topic.