Transcribed from comp.arch
On 1/18/2014 12:19 PM, Stephen Fuld wrote:
> On 1/7/2014 10:37 AM, Stephen Fuld wrote:
>> I watched the talk, and after some time thinking about it, I have a
>> few questions.
> Another one.
> If this would be better after the talks on software pipelining and or
> “phasing”, just say so.
> I have been thinking about your loop parallelization mechanism. It
> seems from what you have presented so far, the degree of
> parallelization possible is the number of elements in a vector for
> the particular version of the Mill you are running on. But since you
> have so many FUs available, is there a way to “link” multiple vectors
> together in order to gain increased parallelism? e.g. if you could
> somehow process two vectors worth of bytes in parallel, (with the
> associated controls to prevent stores on the second vector), you
> would double the speed of the strcopy example you presented. I am
> not sure what I am asking for here but I could see some possible ways
> to do it.
If there are enough FUs then the compiler can unroll the vectorized loop sufficiently to soak them up. In the strcpy example, the ops would simply be issued twice, plus you need a bit extra to compute the None mask for the second store and for the branch condition. Roughly:
load1 load2 eql1 eql2 smear1 smear2 pick1 pick2a(done1, smear2_mask, Nones) pick2b(pick2a, load2, Nones) store1 store2 or(smear1_done, smear2_done) branch
The extras are straightforward. The extra “or” is oring the “are we done” results of the two smears so that the loop exits if either thought it was finished.
The pick2a and pick2b build a mask for the second store. Pick2a uses the “are we done” bool from the first smear to produce either the mask from smear2 or all None. That is then used as the control to mask None into the loaded value; if there was a null in the first load then the pick2b will have an all-None control and will yield all None, whereas if there was no null then pick2b has the usual smear mask for control and works like in the talk.
The result is that the unrolled loop is twice as long plus two more ops. It could be fully pipelined for a throughput of 2x vector size per cycle. The pipeline startup time to first store would be the same as for the not-unrolled loop; load->eql->smear->pick->store. Unrolling needs five flow slots (2x(load, store), branch); three exu slots (2xeql, or); and three writer slots (3xpick). Four load/store function units means a pretty high-end Mill.
Some Fortran compilers can do this kind of unrolled vectorization, so I don’t see any reason why a Mill compiler could not find this code. Not high priority work though because of the limited utility in medium and smaller family members.
- This reply was modified 9 years, 10 months ago by Ivan Godard.