Why not encode data types in metadata?

Validark

The Mill's idea for encoding the size and scalarity in metadata is an amazing trick for reducing the number of opcodes. It has been mentioned many times that metadata does not include type information saying whether something is an integer vs a pointer vs a float, etc. I would like to know if this idea was ever given consideration and why it was determined to not be a good tradeoff. Naturally, it would reduce the number of opcodes by a lot, so you wouldn't need e.g. add vs addp vs addf vs addd, you would just have add. This would apply to mul, sub, div, rem, eql, ne, lt, le, gt, ge, max, min, narrow, widen, root, and more. There is the downside that you'd need instructions akin to what widen and narrow do for size, so you'd need some kind of reinterpret_cast instruction which can write to the metadata bits to change the type. This would make casting take a cycle, as opposed to being completely free on conventional machines, but I doubt it actually comes up a lot that people cast from float to integer and vice versa. This could also affect the convert operations if desired, but those are not planned to be implemented in hardware anyway so far as I know (although f64->i32, i32->f64, and a few others are built-in to Arm, which speeds up JavaScript, 4 cycles for float->int and 6 cycles for int->float on A76, but they might/probably use microcode). This idea would have the same net effect as the size and scalarity metadata tags, which is that more of the bits that would have been in the opcodes would be moved into the metadata, which could be a win. There are a few drawbacks I can think of though. For one, not all instructions can benefit, and some probably shouldn't. E.g. "signed" right shift you'd probably still want to be a separate opcode from "unsigned" right shift, since it is not uncommon to want to do a sign-extending right shift on an "unsigned integer" (scare quotes because it might actually be a bitvector, not a numeric value), and you probably wouldn't want to pay an extra cycle to switch back and forth for ops that. (Some languages have two shift operators for this reason, >> and >>>. One nice way to compute the average of two signed integers is to do (a + b) >> 1, which allows overflowing safely into the sign bit and then does an "unsigned" shift right to divide by two.) You also wouldn't necessarily have to move ALL types into metadata, e.g. it's not clear to me whether unsigned/signed should even be encoded in metadata by such a scheme. Maybe it would make more sense to just leave that encoding in the instructions but encode integer/pointer/float? The other downside is that you'd need more metadata bits although I am not sure how many. If you encoded it enum-style using successive integers you'd probably only need 2 or 3 extra metadata bits, but it would take more if you encoded it bitmap style. I wonder if the floating point metadata bits could be repurposed for non-floats? Another potential downside is that loads and constants would need to be able to encode those extra bits unless you want to cast each one (could ganging help?).

This idea could also lead to more instruction-level polymorphism in certain cases too, potentially reducing code size, although obviously if floating point adds take several cycles they couldn't be scheduled interchangeably with integer adds that take one cycle, so it would only work in certain cases but things like pointers and integers should be interchangeable. It could also lead to more instructions being supported for free. E.g. pointer ops are currently limited to addp and subp, but with this scheme the Mill could support all the arithmetic and bitwise operators on unsigned 60 bit integers (where the upper 4 are ignored), if that would be desireable. It also would lower the cost of adding more types, like e.g. 52 bit integers which are important for languages like JS and Lua(JIT) which use NaN tagging and allow 52-bit integers internally (double floats can also represent all integers up to 2^53 exactly, after that you can only represent every 2nd integer until the next power of 2, then every 4th integer until the next power of 2, then every 8th, etc. I know Ivan Godard knows this, as he was on the committee fighting against NaN's.) I have no idea how cheap or expensive it is in hardware to actually execute though. You could also go full LLVM and just support any bitsize smaller than the maximum supported unit, although that could mean adding a whole extra byte of metadata, so probably not worth it.

I also wonder if this would ever improve pipelining (across multiple instructions?) if there was ever a situation where you wanted the same add operation to work regardless of whether the operand is a pointer or an integer, although I doubt that comes up much in a statically-scheduled belt architecture. It could be useful for eliminating branches in some cases though. E.g. (cond) ? x + 1 : y + 1 could pick between x and y and add 1 regardless, even if x is an integer and y is a pointer. However, considering that the pick phase comes after the op phase, that might not work or would be more expensive than just speculatively computing both? Might it come in handy in open-code though to increase ILP?

By the way, this wouldn't necessarily have to effect Mill assembly as it stands today. One could "just" make the specializer handle it, and infer how load/constant ops should encode the type based on usage and insert the necessary casts in places where a value is used in an op for a different type. As mentioned earlier, this could be "tuned" to an extent by making certain ops more agnostic of type metadata, e.g. like zero-filling right shift and sign-extending right shift could stay separate ops and work regardless of signedness, even if signedness goes into metadata. (And you might just want to keep signedness out of the type metadata in general depending on what the tradeoff looks like.)

Anyway, that's all I could think of. What other tradeoffs would there be on the Mill and what would the cost/benefit roughly look like?

Validark

What happened to this post? Did anyone receive the contents of my post in an email and would be willing to paste it here?

ivan

No idea what happened to the original post; it doesn't show on the Forum logs. However, I can address the question implied by the title: types in metadata.

A dynamic language like SmallTalk must keep a runtime representation of type in order to resolve the meaning of the executable. This gives second-order power and flexibility: with cleverness and SmallTalk you can brush your teeth. But type resolution is not fast.

Most programs, even those in second-order languages, are in fact first-order and bindings can be fully resolved at compile time. The rare cases where you really need second-order you can kludge around by having (first-order) programs write (first-order) programs which are then compiled. Our software is full of such generators - it's how we can retarget for different family members based on a specification. But all steps are compiled, and everything that can be done at compile time is done then, which includes all static typing. So there's no reason to carry type in the metadata at runtime.

The compile step also elides type transformations that are no-ops in the execution. With meta-typng every transform of an operand from int to unsigned would have to be an explicit runtime op to change the metadata even though it did nothing to change the operand value. Think how many place Foo is changed to const Foo. Ouch.

staff

I'm not sure how this happened, but your post got flagged as spam. It's now manually marked as "not spam".

ivan

Wow!

There are conceptual issues to what you suggest, and implementation ones.

The biggest problem is that you assume that that "add" can be disambiguated with a type. It can't, because there are multiple flavors of "add" within each type. Floatint point adds have a rounding mode specification, while integer adds have an overflow behavior (saturate, fault, truncate, widen). To make that specification for any given op requires that the author, and later the generated code, assume a type for which the spec is meaningful. And if you make such a static assumption then you might as well put the type in the static opcode (addf vs addb vs addu vs ...) and omit it in the (expensive) metadata.

Then there's implementation. Real code has an immense number of silent type transformations, which would need to become explicit ops (to change the metadata), which would explode the code size, and add to the latency of a dataflow.

And there's hardware issues. For area reasons op kinds are grouped so that related operations are done by related hardware, known as a Functional Unit or FO: ALU, FPU, LSU, etc. Instruction parsing breaks the binary apart to determine which FU of the several available executes the particular instruction, and then routes the rest of the instruction to that FU where further decode and execution happen.

However, the nature of what the ops are to do breaks apart naturally by type: integer operations go to the ALU and so on. In a static system, once the FU is decided the rest of decode and setup can be done in parallel with data argument fetch, using data paths that go only to the selected FU. In a type-dynamic system, the decoder could not choose the FU until the arguments had been fetched and the metadata examined, and then the lot including the args would be shipped to the FU. I think you can see that this would add at least a cycle to every operation; you probably cannot appreciate what the rat's nest it would make of data routing, but take my word that it;s bad.

Meta-typing looks plausible for single-issue when performance is not an issue. It doesn't work when there are large numbers of instructions being executed concurrently and performance matters.

So much for tutorial :-) I'm going to let this topic drop at this point.

Validark

I don't have a any ideas that would be a silver bullet to disambiguate between different overflow/rounding behaviors. You're right that you'd need at least some info bits either way.

However, with a little creativity and a 100% complete and total ignorance of how to implement hardware, I have a complete shot in the dark to take.

What if you could track the types in the decoder and leave them there? Is there any way you could perform like a type-transformation equivalent to the transformation the instructions do? E.g. if I have a vector and turn that into a bitmap, you could figure out what the resulting type is without needing to compute the actual answer. Then you could decide which functional unit to delegate instructions to in the decoder phase of the pipeline. Then, casts would exist solely in the decoder phase and would not need to take up precious functional units. It could theoretically affect how many instructions you can decode per cycle, but what kind of code have you seen that uses a ridiculous amount of casts between totally incompatible types? I doubt it would be much more common than explicit no-ops on the Mill. I know doubles get abused for NaN-tagging tricks, but I wonder if it's still possible there is a tradeoff that could be worth it if you can reduce the opcode space by another large factor. Maybe that would enable even more ops to be decoded simultaneously and it could pay for itself.