Effective IPC during divide emulation?
What happens to a Mill’s effective IPC while it’s emulating a divide operation?
An iterated-in-software approach to divide (and relatives) would appear to require in the general case both multiple cycles and an unknown-at-compile-time number of cycles. If a divide emulation takes on average N cycles and there’s no way to get anything else done — because the number of iterations is unknown at compile time — that situation seems to imply that the effective ops/clock would drop to something proportional to 1/N, unless there’s a way to make use of the rest of Mill’s width (or at least a good portion thereof), during the emulated divide. If there’s no way to schedule other ops during the emulation, that sounds like it would be a substantial performance hit!
I won’t speculate publicly (I hate that constraint, but understand the need!),
but I’m very curious about:
a) How can the Mill CPU and toolchain get anything else useful done during an emulated divide?
b) What does the Mill’s approach to divide do to its effective IPC in codes that need to do a substantial number of division or remainder ops?
Apologies if this is a sensitive issue, but eventually people are going to want to know. (On the off chance that you’re not satisfied with your solution, I’d be happy to share my speculations via email.)
As always, thanks in advance for anything you can share publicly.