Yes, you can have a wider path to DRAM, although there are board-level hardware issues; the costs and benefits are independent of the ISA. And you can reduce the DRAM latency by stacking and similar approaches; this to is independent of ISA.
Where the ISA can help is twofold: reducing the total DRAM traffic; and changing the timing. The Memory talk showed ways that the Mill cuts the traffic bandwidth; not a lot (perhaps 25%), but every bit helps. And the Mill has ways to alter the timing, but the patents aren’t in on those yet.
As for trading, there not much left to trade. On many conventionals there are serious overheads setting the machine up for vectors, but the Mill doesn’t have that problem (and there’s no fundamental reason others should either). You might as well vectorize a short loop on a Mill; even if it turns out that there’s only one element in the array, the cost of a vector and a scalar code will be very close, and vector wins big once the element count exceeds the vector height. The only real advantage of a scalar loop for very short arrays is in power usage – they are equally fast, but dealing with the Nones in the unused elements does cost power.
Not sure I’ve answered your question; feel free to try again.