- FindecanorParticipantMay 21, 2018 at 7:15 pmPost count: 30
How would you do byte-order conversion on The Mill in as few instructions as possible?
Some other architectures have a “byte swap” instructions for 32-bit quantities while 16-bit values are often rotated.
I have seen The Mill’s “shuffle” instruction, but that would require going through memory to reinterpret the data between vector and scalar.
The manual way is otherwise to shift, mask and or back together using multiple instructions.
I have long wished that C’s type system would support a type modifier that would allow programmers to declare a struct member’s byte-order once (“big_endian unsigned int;”) and all byte-order conversion then to be handled transparently.
I think that sort of model would work well on The Mill, with metadata on belt items. With little-endian being default, have one instruction for setting metadata making a pointer into a big-endian pointer. Then all loads and stores with a pointer with big-endian metadata flag would incur an automatic byte-order conversion.
The addp/lea instructions would carry the metadata flag over,
I believe most foreign-endian operations are done in conjunction with loading and storing several fields in individual structs, and in that case only one additional instruction might be required for all fields within the entire struct: its base address, as opposed to having a bswap instruction for each individual element. When emulating a big-endian machine (JIT-compiled or not), the savings could be even greater.
I don’t expect anyone to extend C soon but I think that existing LLVM bytecode patterns containing “bswap” instructions could be translated into this model and then code using this model could be optimised.
- Ivan GodardKeymasterMay 21, 2018 at 7:55 pmPost count: 687
You wouldn’t do swapping in the core using order tagging. The way that hardware works there would be a swap delay on every operation, needed or not, which would cost you at least 25% of your clock rate.
Ad hoc swapping is rare and is best done with explicit operations. The only important case is when there is a file of the wrong endianness and the buffer must be swapped, or in general when an app is written to presume an endianness that is different from the native (little) of the core. The Mill has a status bit that lets it act like a big-endian machine, swapping on all access to memory. This is the same approach as used by PPC, MIPS, and some others; it’s not original with us.
- cParticipantMay 23, 2018 at 8:49 amPost count: 1
On the topic of byte order, have you had a look at the proposed riscv bit manipulation instructions? The grev and gzip instructions in particular are very simple and elegant, and may replace a number of more specialized instructions. The first is a generalized reverse, which is capable of both bit reversal and byte swapping. For every bit in the control word, the corresponding pairs of 2^n bits are swapped and the effects are combined. The second is a generalized zip which is more opaque without a diagram, but can zip/unzip any 2^n bit sizes. Both would be a welcome addition to any instruction set.
For detail, see xbitmanip-draft.pdf at https://github.com/cliffordwolf/xbitmanip
It looks like the Mill has a bitwise reverse instruction, and is using shuffle for byte swapping, which limits the functionality to bits and bytes. The shuffle itself appears more limited then some would like, and Grant also mentioned a shuffle that can use two element sources; something which is actually implemented on PPC with vec_perm, and which is extremely useful on that platform.
With that said, I appreciate that the Mill is different enough that it will have its own considerations. I’m still curious if any thought has been given to vec_perm or something similar, and how it might fit into the Mill.
- Ivan GodardKeymasterMay 28, 2018 at 12:40 amPost count: 687
It’s taken a while for me to wade through the RISC-V document; sorry for the delay in response. The current Mill bit manipulation operations are a historical grab bag that we had planned to review and cull/extend when we had enough of a code base to get useful numbers. Your post prompted me to start that process; it’s relatively low priority so I don’t know when changes will happen.
I’m pretty sure that we should have a funnel shifter unit for the present shift and rotate group, although that implies more gates for the widest operands (compared to a normal shifter), and there are issues applying funnel shift to vector arguments.
There has been a long-standing desire for Galois Field operations, i.e. bit matrix multiply. A 8×8 BMM is straightforward, but things get out of hand at greater width – the control data for a quad x quad BMM is 4k bytes. That much data sounds like an I/O device rather than an operation.
Another desire is bit-stream parsing, chopping dynamic numbers of bits off the front of a byte-stream. For this the bit-chopping is not hard, but dealing with whether a chop has to advance the source or not is hard in a static machine. This one will likely wait until we add general streamer support.
Lastly, even simple bit-field extract/inject is problematic when the field is dynamic. An inject needs four arguments – source1, source2, position and size – and four-argument slots would be very expensive. We could construct such an op with ganging, but then you use two slot positions, while building the inject out of shifts and masks only needs six ops. The PPC version of funnel includes a mask step the would reduce the cost of an idiom even further. It’s not clear that dynamic inject is common enough to be worth the trouble. A static inject only needs two belt arguments, but it then needs logN encoding bits for the position/size info, which may increase the entropy of the slot supporting the op.
We’ll chew on these and other issues; thank you for shaking our tree.
You must be logged in to reply to this topic.