Regards PLB size:
Consider the size of a high-end conventional L1 TLB; it might contain 64 4K page entries, 32 2MB page entries and 4 1GB pages.
The conventional L1 TLB has to do the address translation before the load from L1 cache itself; the translation and lookup are serial.
This is why the L1 TLB is forced to be small to be fast and hasn’t been growing in recent high-end OoO superscaler microarchitectures. They have actually been adding L2 TLB and so on because of this problem.
A recent article on conventional CPUs actually counts TLB evictions for various real syscalls:
Some of these syscalls cause 40+ TLB evictions! For a chip with a 64-entry d-TLB, that nearly wipes out the TLB. The cache evictions aren’t free, either.
Now consider the situation for the Mill PLB: the entries are arbitrary ranges (rather than some page count), and it has as many cycles as the actual L1 lookup to do its protection check… it can be large and slow as its work is in parallel to the lookup.
Now this really emphasises the real and practical advantages of a virtual cache and Single Address Space architecture 🙂
On the second question about SIMD: exactly! 🙂
Excess slots in a vector can be filled with
1.0 or whatever value nullifies those elements for the operations to be performed.