This isn’t an official answer, but I think at least some of your questions are answered in the Belt talk.
* How did you decide on a 32 entry belt (for Gold, and 8 entry belt for Tin)?
The scratchpad has a three cycle spill to fill latency, if you spill a value you won’t be able to get it back for 3 cycles because of this the length of the belt is set so that nearly everything lives for three
cycles on the belt. So the length of the belt needs to be 3 times the number of results that can be produced by functional units in one instruction for that family member.
* Why a scratchpad instead of a second (slower) belt?
The belt is quite different from the scratchpad, the scratchpad only has two operations “spill a value from the belt and put it in the scratchpad”(spill) and “take a value from the scratchpad and put it on the belt”(fill).
* How was the size of the scratchpad decided on?
The videos mention that the scratchpad is on chip memory, that can be spilled out to caches and eventually out to DRAM if necessary. The size of a scratchpad available for a certain function is allocated by the specializer for that function. The function says how much scratchpad it needs upfront. The size of the available on chip memory is the same cost/speed trade off you make when buying DRAM, buy as much as you can afford so that typical program use won’t need to swap memory out to disk.
I’m not sure I understand specifically what you mean by a ‘slower belt’. The belt described in the videos is per function call. Every function call gets it’s own empty belt and scratchpad, it’s caller’s belt(and scratchpad) is still around and the caller’s caller’s belt is also still around. You could say that the belts of callers were ‘slower belts’ than the belt of the current function call as they don’t change or move until the current function call completes and returns it’s result.