It’s clear that it can be done, and will be as efficient as a general-register machine. However, just what will be the best way will require some serious head-scratching and experiment.
My personal starting point would be to keep the Forth stack in belt and scratchpad. That permits everything to be fixed latency, which gives the compiler a chance to generate stall-less code. Some Forth words are not machine primitives, and the expressions can be evaluated directly on the belt. However, most Forth values are either consumed shortly after creation, or wind up lower in the Forth stack for reuse. A naive compiler can use rescue to get rubble out of the belt, but when a live value still would call off then it can spill to scratchpad.