I’m just blown away!
The great bulk of what you have done to improve the code is software pipelining, which the Mill is designed for. The pipelining done in our tool chain is still experimental (i.e. breaks on everything), so I can’t give you a comparison yet, but in-loop ILPs of 9 are actually low on high-end Mills because you can pipeline up overlapping iterations until you run out of FUs. Then there’s vectors…
Some minor notes on your code:
1) the “retire” operation (a.k.a. “Dave’s Device” here in-house) will let you get rid of all the loop setup (prologue) code, so there is no gain from sharing the prologue across the two loops.
2) “leave” discards all inflight operations; you don’t have to worry about multiplies or loads that you start in the loop dropping after you exit
3) the “conform” op is no more; it was obviated by letting branches supply a carry list. Similarly, the “rescue” in your code can be folded into the following branch op.
4) Long dependency chains in open (non-loop) code produce “sad” instructions even on a Mill 🙂
5) Your parseval$decimalMode EBB should end with a branch back to the same EBB; sending it to parseval$hexMode is a bug I think.
6) It is possible to write the hex loop to eat a new character every cycle; the code has to split the OR reduction in two, shifting by two nibbles out of phase with each other so the shifted nibbles interleave. The interleave can be running in the loop, or a final OR after it if the loop exports both partials; our pipeliner tries to move all reductions to after the loop, but has problems with that. Have fun 🙂
7) If you get hexMode to one byte per cycle, try the same on the decimalMode loop. Hint: unroll by three, then schedule on a torus so that corresponding operations on different iterations schedule at the same point. It’s fairly easy when there are no loop-carried dependencies, but it gets harder when, as here, the loop is a reduction.
8) and if you got all that, remember that Silver has two mulFUs and Gold has four, so you can in fact eat 2 or 4 bytes per cycle. You were regretting “only” 4.3 ILP; try ILP 19. Mill eats loops for breakfast, lunch and dinner (I just love to brag about the baby).