Generally close to the consumer is best because it reduces belt pressure; you comput a value early and it may have fallen off the belt before it gets used, forcing spill/fill ops to retain it. The exception is loads, which should be issued as early as possible while still retiring as late as possible; maximizing load deferral gives the memory hierarchy the most time to come up with a value.
Because the inputs are themselves computed as late as possible, usually you don’t have the inputs waiting around on the belt either. There are exceptions: when the inputs are call arguments, or when an input is used more than once, with an early consumer and a late consumer. However, all of these cases are examples of a general pattern: inputs must be produced early for outside reasons, and outputs must be consumed late also for outside reasons, so there is a range between production of input and consumption of output in which the op could be placed. It turns out that where you place the op makes no difference in this pattern; if you place it early then its outputs may have to spilled, but if you place it late then its inputs may have to be spilled; the impact on belt pressure is roughly the same, assuming that the number of inputs and outputs are roughly the same. Because we are placing as late as possible in general, we place this pattern late too, just for simplicity even though it could have been placed earlier without affecting overall latency.
Incidentally, operation placement is equivalent to the NP bin-packing problem, so it’s a game of heuristics because optimal placement is impractical.
Power is a different issue that our hardware team will address.