Note: Easier to reply if you put the questons in separate postings 🙂
Metadata bit count: implementation dependent. While stealing NaR bits (one per byte) for other use in wider data (such as the FP flags) would save bits, it would complicate the hardware logic; generally bits are cheap.
Four-case pick: None or NaR selectors simply pass through, using the data width rather than the selector width. Implementation defined whether the None/NaR payload passes through or a new payload is created – the Belt crossbar is clock-critical, and pick must be very fast.
Vector-wise pick: passing the data through directly to the consumer is straightforward; just muxing. Creating the additional operand on the belt is more complicated, but Not Filed Yet.
Mixed-width arguments: each operation has a set of width signatures specified for what it will accept (see the next talk, Specification). If it gets something it doesn’t want then it used to produce a NaR, but we changed it a year ago so it faults immediately; wrong width indicates a compiler bug, not a data problem like overflow.
Excess widen/narrow: same as width-error above.
“only one opcode”: “only one opcode per signature” is more correct, but that might take us too far afield in a talk. For widening there are two ops, one for all scalar widths and one for all vector widths. For narrowing there is only one op but with two signatures, one with one argument (scalar) and one with two (vector). The same is true for other widening operations, such as add with the “widen” overflow attribute.
“broadcast” operation: there isn’t one. There is a “splat” operation that replicates a scalar into a vector, but that’s not the same. Just do “add” and it’s happy with either vector or scalar data, whichever it gets. There’s only one adder; the width just sets breaks in the carry tree.
“smearx”: this does have two results, which impacts the operation latency. The second port exists anyway; the Mill has a lot of two-result operations (such as vector widen). The talk doesn’t address latency because it couldn’t cover pipelining (May? maybe), but a loop written as shown, without pipelining and phasing, would need a no-op after the smearx. With phasing and pipelining the whole loop is only one cycle anyway, but that’s two more talks worth of explanation 🙁
“pickx”: not advantageous. It would add another mux to the belt crossbar, slowing the clock rate. In addition, while not used in the example, the bool vector is useful in other loops for more than pick.
rotating smear: Both the smear and the pick0 should be easy, and I don’t think that the pick0 would have the clock cost that pickx does. However, the loop control bool is at the wrong end of the bool vector from where smeari puts it, so there would need to be an extract operation to get it out to a scalar before it can be branched on, so that’s the same second cycle and belt position that the present definition of smearx uses, so no gain.
aligned loads: we have looked at quite a few possibilities in this area, with a goal of getting good performance on funnel-shifting a data stream. It’s quite hard to do that without branches, and we are not entirely happy with the present approach (Not Filed Yet)
clock rate: While for business and risk-reduction reasons the initial Mills will have a low clock, there is nothing in the architecture that precludes getting the same clock rate as any other chip.
mispredict penalty: While you are right that a high-end Mill mispredict stall can lead to loss of as many operation issues as a superscalar with a longer recovery would have, on the lower-end Mill family members (with peak issue widths no greater than conventional machines) the Mill advantage is real. We don’t claim that the Mill is always better at everything; we claim that it is always no worse and very frequently much better 🙂
An aside: I’m impressed with how far you have taken the overview given in the talk. The Mill boggles quite a few people 🙂 I hope you will continue to hang around the forum here.