1. Sure, you can use retire to get you that fetch-ahead load if you don’t mind stalling while the loop warms up. Alternatively you can just fix the instruction set so I can do it the right way ;).
2. Neat, I knew there must be something like that. It was too obvious to overlook.
6, 7, 8. I explicitly chose not to unroll or vectorize to make the comparison more meaningful. I’m sure these things work, but I tried to stick to things an OoO core would be trying to do in hardware. No maths tricks beyond what I could believe stock GCC would give.
- This reply was modified 6 years, 4 months ago by Veedrac.