Very interesting! I have been playing with a processor design where ALU status flags are pre-register (i.e. registers are 33 bits, including the carry bit, plus an overflow flag), but I forgot to think how it could be applied to floating point and vectors!
Intel has their KNI instruction set with vector mask registers, but the idea of just integrating them with the registers themselves seems to work very well. Especially as pick itself is actally a 4-way choice. Each selection boolean can be 0, 1, None, or NaR.
I presume you have 6 status bits per 32 bits, which either gives you one NaR bit per byte (with “None” encoded as a type of NaR), or one NaR bit and 5 IEEE flags per 32-bit float.
(Um… or do ypu have 6 bits per half-precision?)
The question someone asked (youtube@35:20) about how you manage vector pick in the bypass system is definitely something I’d like a more detailed explanation of. For scalars, it’s obviously simple renaming. For vectors, do you rename each byte lane separately? Even if you did, the sources would eventally fill up the bypass registers abd break your “the are always enough output latches” assumption) and then what? Do you bypass in 0 cycles and realize in 1 cycle in parallel?
The other thing I wonder about is what happens if you feed incompatible data to an ALU, like a vector of bytes in one side and a vector of 32-bit words in the other? Fault, interpret as per operand 1, or what?
What happens if you widen a maximum-width scalar? Narrow a minimum-width?
Another question is whether scalar and vector widen share the same opcode? The talk stresses (@20:25) “there’s only one narrow op, and only one widen op” but having the number of results dropped on the belt (1 for scalar widen, 2 for vector) depend on the input metadata seems perverse.
Can you do vector/scalar ops directly, or is there a required “broadcast” instruction? Pick seems to support vector/sclar operation; do others?
The 2-output smear seems like an unnecessary extra write port. I realize you don’t have regular register file write ports, but it’s still an extra bypass path.
Did you consider either of these two options:
1. doing away with smearx, and instead having a pickx operation which shifts the control word. The most significant part of the control word is not used by pickx and can be branched on to control the surrounding loop.
2. Having a smearx which rotates rather than shifts the vector, and a pick0 which ignores the lsbit of its control vector input. A branch can look at the lsbit of the control vector, which is ignored by pick0. (This is a little uglier, but might make 0-cycle pick easier to implement.)
I also note that the strcpy() is using unaligned loads. Have you considered having an “aligned partial load” that fills the unwanted leading bytes with None?
(BTW, bragging about how the Mill’s 5-cycle mispredict penalty is better than other chips’ 20-30 is specious. Those chips execute 2-3 instructions per cycle, so the penalty is ~60 instructions. The Mill’s slower clock means that the time delay is not much less, and it may easily be more instructions.)