This is really neat, nice. Some comments on the things I’ve found out about the arch:
con(v(0xe0, 0xf0, 0xf8)) is a length-3 vector? This is a little confusing, it needs to be 128 bits IIUC.
andlu(%first, %prefmask) won’t work since the Mill doesn’t splat automatically.
I don’t think you can return immediates as per retntr(%onebyte, %first, 1).
andlu’s immediate is morsel-sized, so andlu(%cont, 0xc0) won’t fit, and I don’t think the Mill will splat immediates either.
smearx(%picked) will return two elements, so you can dump the any(%picked).
con(v(0, 0, 0, 0)) can be a rd() of the appropriate constant.
Overall I don’t know if SIMD was the right choice; using pick and interleaving the different paths would probably be faster.