Forum Replies Created
- AuthorPosts
Transcribed from comp.arch
===================================================================On 1/18/2014 12:19 PM, Stephen Fuld wrote:
> On 1/7/2014 10:37 AM, Stephen Fuld wrote:
>> I watched the talk, and after some time thinking about it, I have a
>> few questions.
>
>
> Another one.
>
>
> If this would be better after the talks on software pipelining and or
> “phasing”, just say so.
>
> I have been thinking about your loop parallelization mechanism. It
> seems from what you have presented so far, the degree of
> parallelization possible is the number of elements in a vector for
> the particular version of the Mill you are running on. But since you
> have so many FUs available, is there a way to “link” multiple vectors
> together in order to gain increased parallelism? e.g. if you could
> somehow process two vectors worth of bytes in parallel, (with the
> associated controls to prevent stores on the second vector), you
> would double the speed of the strcopy example you presented. I am
> not sure what I am asking for here but I could see some possible ways
> to do it.
>
>
>
>If there are enough FUs then the compiler can unroll the vectorized loop sufficiently to soak them up. In the strcpy example, the ops would simply be issued twice, plus you need a bit extra to compute the None mask for the second store and for the branch condition. Roughly:
load1 load2 eql1 eql2 smear1 smear2 pick1 pick2a(done1, smear2_mask, Nones) pick2b(pick2a, load2, Nones) store1 store2 or(smear1_done, smear2_done) branch
The extras are straightforward. The extra “or” is oring the “are we done” results of the two smears so that the loop exits if either thought it was finished.
The pick2a and pick2b build a mask for the second store. Pick2a uses the “are we done” bool from the first smear to produce either the mask from smear2 or all None. That is then used as the control to mask None into the loaded value; if there was a null in the first load then the pick2b will have an all-None control and will yield all None, whereas if there was no null then pick2b has the usual smear mask for control and works like in the talk.
The result is that the unrolled loop is twice as long plus two more ops. It could be fully pipelined for a throughput of 2x vector size per cycle. The pipeline startup time to first store would be the same as for the not-unrolled loop; load->eql->smear->pick->store. Unrolling needs five flow slots (2x(load, store), branch); three exu slots (2xeql, or); and three writer slots (3xpick). Four load/store function units means a pretty high-end Mill.
Some Fortran compilers can do this kind of unrolled vectorization, so I don’t see any reason why a Mill compiler could not find this code. Not high priority work though because of the limited utility in medium and smaller family members.
- This reply was modified 10 years, 10 months ago by Ivan Godard.
There’s no restriction to two results, although we have none more than two except the function return op itself, which can completely fill the belt if it wants.
FUs that have ops that can return more than one result have an output result register for each result. These bump up the latency chain normally if there are the same number of registers at the next higher latency. If there are not (which is common) it’s up to the hardware guys whether to add dummy result registers at the higher latencies to bump into (simpler, but more space and more belt sources) or bump the excess direct into the spiller buffer pool (more spiller sources). The software can’t tell.
- in reply to: Prediction #546
Unfortunately, to fit in time and mind-boggle limits the talks have to gloss over quite a few important details. 🙁
It’s been mentioned that Mill EBBs can contain embedded calls (not permitted in the usual definition of EBB). That means that an EBB must have multiple predictions if it contains calls: one from entry to the first taken call (calls can be conditional), one from the return of that call to the next taken call, and so on until one from the last taken call to the taken return (returns can also be conditional). Each of these predictions needs a key for lookup in the prediction table. The first key is obviously the address of the EBB entry point; what should the rest be?
It’s tempting to use the “from” address of the taken call (where the call will return to) as the key for the next prediction. However that doesn’t work. To begin with, there are two “from” addresses due to split-stream encoding, and the exu address may be ambiguous due to no-op elision, but both these could be worked around. Unworkable is that the return address is not itself predicted from the information in the prediction: the pred contains a line count and an instruction count, but not a byte count. Without a byte count, we could not know from one prediction what the key of the next would be (if the key is the byte address), which means that we cannot do prediction chaining with the code not yet in memory, which is the goal of the whole exercise.
Could we use a byte count in the preds instead of line/cycle counts? No, because the decoder needs to stage lines into the instruction shifter on a cycle-by-cycle basis. We don’t know how bytes translate into cycles without looking at the instruction headers, and by the time we have figured out from the remaining-byte count that we have done the last instruction it’s too late to get the next line into the shifter. And adding a byte count as well as line/inst counts would blow up the pred entry too much.
The solution is to use the entry address for the initial key, and for each later key to decrement the entry address by two. The initial key is known from the prediction that enters the EBB, and each subsequent key is known just by arithmetic on the entry key; consequently chaining is possible even with embedded calls (including multiple calls within a single instruction, which is possible on a Mill – more on that in the Execution talk on Wednesday).
I mentioned above that there are conditional calls, which need keys. If (say) we enter the EBB at 0x12345 and the first call is conditional and predicted taken then the next key (for use after the return) will be 0x12343, two less than the entry. If however the first call is predicted not taken and we expect to execute until the next call, then the return from that (and the next key to use in the chain) is 0x12341, i.e. the entry decremented by two, twice. When chaining, how do we know which next key to use? The answer comes from realizing that we need to decrement the key not just for taken calls but for untaken ones as well. To do that, each pred also contains a small field for untakenCallCount, the number of untaken calls stepped over by the prediction. With that, we can decrement the right number of times and get the correct next key for chaining.
There’s a potential ambiguity in using decremented entry addresses as subsequent keys: what if the decremented address was decremented enough to move the key pseudo-address into the adjacent EBB’s code and collide with that EBB’s entry address that it uses for its own prediction? Can’t happen: a decrement is moving in the flow-code direction from the entry point (increment would move in the exu direction). The call operation is a flow-side operation, and all possible call encodings occupy at least two bytes. Consequently, a two-byte decrement will necessarily still lie in the flow side of the EBB because the call that requires another key must occupy enough bytes to keep the pseudo-key within the memory range of the EBB. Increment would not work, because EBBs with no exu side are possible and enough increments could move the key into ambiguity.
Lastly: why decrement by two; why not by one? Answer: each prediction (the “main” pred) may have an alternate pred that has the same key minus one. If we mispredict, we probe for the alternate key. There may not be an alternate; by definition alternates are used less than main predictions, and the alternate may never have been taken (and hence added to the table) or may have aged out, but if the transfer is at all loosey-goosey then there is a decent chance of an alternate. We chain normally from the alternate, and get prefetch etc started on that chain.
Alternates deal with the case in which a program usually stays on the main line, but sometimes goes off into something complicated. Once we know it’s been rerouted, we want to be able to get prediction running ahead of it as quickly as possible, without additional misses. We already had a miss; without the alternate prediction, there would be another miss as soon as we got to the end of wherever the first miss took us. With the alternate, we avoid that second miss, and in addition have gotten prefetch and fetch working early.
Obviously this alternate mechanism applies to loops, and in fact it was first developed as support for exactly the problem you identify. Since then we have added more loop support (NYF), but have kept the alternate machinery because it is valuable even outside the loop context.
Now you see why I didn’t try to include the details in the prediction talk 🙂
Speaking of which, anybody want to volunteer to write a white paper on the full prediction mechanism? You would get some glory (name on paper, will get published on our site and in Computer Architecture News) and some sweat-equity in the Mill/OOTBC.
- in reply to: Application Walkthrough #531
I don’t know this kind of application and your description is a bit cryptic, so this is a guess.
The two loads, subtract, and square are routine vector ops. By “sum” I guess you mean an add-reduction; if so, NYF. A scalar sqrt uses a hardware sqrt-approximation and a software in-line algorithm, typically Newton-Rapheson. The whole can be pipelines, NYF.
Return takes the same belt-argument list that call (and a few other operations line conform and rescue) do. You can’t build it up piecemeal. The arguments don’t have to be adjacent on the belt, and the same belt position can be used for more than argument. There’s no “nil” argument to return (or call etc.); you have to put a nil (whatever that is in the app) on the belt and then pass it as many times as you want.
Thread-local is possible. It is not in general possible for a callee to address the caller’s stack without action on the part of the caller, which would be more trouble than TLS.
Sounds like a null function would be in and out before a d$1 latency, arguing again for putting the loads after the call rather than in-flight over the call.
Varargs calls in the belt are a problem; by the time you have figured out how many arguments you have they may well have fallen off the belt. In C only the fixed arguments pass in the belt and the variable part is passed in the stack, but C VARARGS is rare enough that it doesn’t matter. Sounds like every call in Lisp is facing the issue, which would matter.
I think the in-memory solution, while it might be better than other CPUs, is fundamentally inapt for the Mill function model. Needs thought.
Returning a vector would certainly work, and the Nones are very Mill-ish. But smaller members have only 8-byte vectors. It would also cost some code to pack the vector and unpack it again, which might be worse than going via memory. The optimal might be to return one scalar and the vector; in the common case the vector would be all Nones, available as a popCon from rd() cheaply and ignored by the caller. In the uncommon case you’d pay for the pack/unpack. If exactly two results were common then return two scalars for that case and the caller can test whether it has a scalar or a vector second result.
However, because the code depends on the size of a vector then your Lisp becomes member-dependent, which we really want to avoid. 🙁
p.s. after some more thought
The variable returns won’t work on the belt, and there are no shared registers or scratchpad. In fact, the only shared thing is memory. So what happens if the results (after the first) are communicated in memory?
If the shared-register approach works on a conventional, then sharing pseudo-registers at static addresses (w/r/t the program load image) should work too, albeit not as clean as something like a Forth double-ended stack. The cache line containing the pseudo-registers (PR hereafter) will always be hot and in the top-level cache. If ~14 (32-bit?) registers is enough, then the 16 32-bit values in a line should also be enough, although a 64-bit Lisp might want two lines. If the reserved addresses were placed at the bottom of the dp space then the address offsets would always encode in one byte for the accessing load/store operations. As the l/s opcode and other info are going to be around 16 bits, it means an access will be 24 or so bits of entropy in the encoding, favorable compared to 32-bit RISC codes and probably comparable to x86 (I am not an x86 expert – how many bytes in a load/store if the offset is one byte?) However, the l/s ops occupy flow slots, retire stations, and d$1 bandwidth. This is not free, compared to returning results on the belt.
The stores, inside the called function, are fire-and-forget. However, there is load latency to pick the values up from the PRs. The code doesn’t want to wait for a D$1 cycle before looking at the call result, although I suppose it could overlap the testing for the presence of the extra results, and probably some use of the prime result.
However, there’s another way: issue the load of the result before making the call, with the retire delay set so the load will retire in the instruction after the call. That load can be hoisted arbitrarily high, but if LISP overhead guarantees that any call will last longer than d$1 then hoisting is not necessary.
Unfortunately, this doesn’t buy anything when there actually is a new result to load. The load will allocate a retire station and go to cache for the value and then wait out the call. At the end of the call the store will be detected by the retire station, which will then re-issue the load request. But as the load is supposed to retire at once, and the second request to the d$1 is not instantaneous, the retire station must stall the machine until it gets the just-stored data, which is the same time that would have been if the load were issued after the call rather than before it 🙁
All in all, with the present Mill definition, it looks like extra results would pass in PRs in memory, and not be available until d$1 latency after the call. This works, but is unattractive.
I’ll keep the issue in mind and will report if illumination strikes.
Thanks for the explanation. As currently defined, the Mill doesn’t handle this use-case well. Needs thought.
Your suggestion of a few user-accessible registers does not work well, because then there is the problem of what happens if more than one operation writes them, and write->read latencies, and so on; all the hazards and everything else about general registers that the belt avoids. The last thing we want to do is to re-introduce renaming into a Mill just to support Lisp 🙂
The existing specRegs don’t have this problem, because they are all read-only w/r/t the application. Yes, they change, but only as a side effect of operations that the hardware knows about, like call and return.
Values swapped to the spiller effectively cost nothing unless you swap the spiller’s bandwidth; that’s easy to do in a test case (a function with a very large number of arguments that recurses as soon as called) but is unseen in practice. Scratchpad is not free – the spill and fill operations cost slots and entropy, and the size of scratchpad has a hardware maximum.
We have considered having a set of globals, or something like a global scratchpad. The problem is the save/restore cost in turf switch, which can be very frequent on a Mill. We decided to defer the idea until we had hard simulation numbers showing what the cost of not having them was.
You are well on the way to inventing capabilities 🙂 https://en.wikipedia.org/wiki/Capability-based_addressing
I truly wish that the Mill were a cap(ability) architecture. One of the Mill team is Norm Hardy, who did the KeyKos cap operating system, of which Eros and others are descendants. The first compiler I ever did was for the Burroughs B6500, which was a cap machine before caps were invented. One of the things that Andy Glew and I are in passionate agreement about is that computing needs caps.
That said: I know how to build a cap machine, but I don’t know how to sell one. Caps are fundamentally incompatible with C, because caps are secure and C is not. Caps machines like the AS-400 let you use C by giving you a sandbox that is one whole cap in which you can be as insecure as you want within your play-PDP11. Oh well.
And fair warning: language design is an insidious vice – you have a slippery slope ahead. Hear the words of someone whose name is in the Algol68 Revised Report 🙂
- AuthorPosts