There’s not much reason for hardware divide on a machine as wide as the Mill. As your question shows, conventional divide is implemented as a microcode loop that blocks the pipeline, and pipe blockage gets more painful with wider machines.
While divides are defined in the abstract Ml, we do not expect them to be native hardware in any member. Instead,the specializer will substitute an emulation using Newton-Rapheson. The emulation will make use of several “divide-helper” ops that will be native. Thus the emulation, while software visible, is doing essentially the same job as the microcode on a conventional. The result has the latency of native hardware, but the individual operations can be scheduled intermixed with other ops, including those from other divides.
We expect to use the same strategy for sqrt and transcendentals.