This isn’t quite what I was hoping for from you. These codes are already hand-tuned for a particular x86. A Mill can of course emulate an x86, but it won’t be fast. On some Mill members there may even be (an approximation to) AVX2 operations; you have to pick a member with the same vector height as a Haswell. You could then code your critical function in conAsm assembler for that member, and you’d get roughly the performance that a Haswell on the same clock and process would give you, or better depending on how many shuffle units were on the member you chose. I’m sure some people will use Mills in this way because it’s how they are used to coding, but we don’t recommend it.
What I was hoping to get from you was the actual algorithm that your code is a machine-dependent implementation of. Express it in scalar if need be. There won’t be any shuffle operations in the algorithm of course, nor any register sizes, nor registers for that matter. It’s impractical for me to try and deduce what your code is supposed to do by looking at what it actually does; decompilation is provably impossible in general.
The Mill does have an arbitrary shuffle. It’s not clear that any of your codes would actually used a shuffle, unless that were the only hammer you had.