I don’t have a any ideas that would be a silver bullet to disambiguate between different overflow/rounding behaviors. You’re right that you’d need at least some info bits either way.
However, with a little creativity and a 100% complete and total ignorance of how to implement hardware, I have a complete shot in the dark to take.
What if you could track the types in the decoder and leave them there? Is there any way you could perform like a type-transformation equivalent to the transformation the instructions do? E.g. if I have a vector and turn that into a bitmap, you could figure out what the resulting type is without needing to compute the actual answer. Then you could decide which functional unit to delegate instructions to in the decoder phase of the pipeline. Then, casts would exist solely in the decoder phase and would not need to take up precious functional units. It could theoretically affect how many instructions you can decode per cycle, but what kind of code have you seen that uses a ridiculous amount of casts between totally incompatible types? I doubt it would be much more common than explicit no-ops on the Mill. I know doubles get abused for NaN-tagging tricks, but I wonder if it’s still possible there is a tradeoff that could be worth it if you can reduce the opcode space by another large factor. Maybe that would enable even more ops to be decoded simultaneously and it could pay for itself.