Short functions with loads tend to have extra noops or stalls for the reason you give. A sufficiently smart compiler could issue prefetches in the caller, just as can be done for any architecture. But the general Mill approach is inlining.
Most ISAs inline to avoid call overhead, but calls are cheap on a Mill. We inline to get more for the machine width to work on. On a wider Mill you can inline surprisingly large functions with zero latency cost in the expanded caller, because the called ops fit in the schedule holes of the caller. Of course there is usually some increase in the code space used, but removing the call and return means less thrashing in the cache lines, so net it’s a win unless profiling shows the call is never taken.
We’re still tuning the inlining heuristics in the tool chain. As a rule of thumb, any function shorter than a cache line is worth inlining on icache grounds alone.