That kind of optimization is a middle-end problem, which in our case means LLVM, and LLVM doesn’t seem to know anything about such things; auto-vectorization in general is weak to absent.
You are right that the first step for a vectorized version on the Mill would be to None out the whitespace; that’s easy. There’s no reduction-compaction operation in the ISA at this point. One possibility would be to turn the None-laden vector into a bitmask using the mask() op, and then use the mask as the control in a switch that would execute an appropriate shuffle() op for each mask to do the compaction.
However, for low-height members (vector heights of 8 or 16 bytes) the cost of the armwaving to do vector compaction (absent a new op) probably exceed the cost of doing it in the naive scalar loop, which trivially gets one byte per cycle (var. 1 c/b)on a Mill.
Another approach might be to use the machine width to do several bytes at a time MIMD. Two-way would involve two loads and two stores per cycle, which larger Mills can do. Two compares, two adds, a pick and a branch (needed for the rest of the loop) would also fit in the same instruction in those members, so you’d get two bytes/cycle in scalar MIMD. As this is simple unrolling the compiler might be able to find it.
However, absent a compaction op in the ISA the right way to do this is streamers, but they are NYF.