Mill Computing, Inc

Forum Replies Created

Viewing 8 posts - 1 through 8 (of 8 total)

Author
Posts
Wolke
Participant
August 12, 2015 at 4:09 pm
Post count: 8
in reply to: Pipelining #1953
Intel seems to be up to something similar:
AVX512 based Software Pipelining method
Wolke
Participant
August 12, 2015 at 4:06 pm
Post count: 8
in reply to: Pipelining #1952
In case you haven’t found this yet:
Intel proposes a Software Pipeling mechanism very similar to the one described here for its AVX512 extension (AVX512 SPL)
Wolke
Participant
November 20, 2014 at 1:43 am
Post count: 8
in reply to: Pipelining #1544
I have tinkered with the Software Pipelining approach for the last few months and I think it’s a good idea to do it the Mill way instead of using the usual modulo technique.
I have a few open questions though and I’ll start with the most high-level one:
Which part of the SWP transformation happens in the compiler vs. in the specializer?
I’m currently thinking of these transformation steps:
* determine if SWP is applicable to this loop
* determine steady state
* prime values with NaRs
* add SPILL/FILL for long term dependencies
* add computation of condition vector
* add PICKs
* … whatever I forgot
Wolke
Participant
August 8, 2014 at 2:49 pm
Post count: 8
in reply to: The Belt #1321
Since you’re mentioning 100MHz:
I was pondering around together with a colleague recently about the forwarding network.
It seemed to us that you face exactly the same issues than anybody else implementing a forwarding network with a large number of FUs.
Did you do a sniff check on the wiring issues you’ll get with the larger members, especially GOLD?
On a related note I’m curious if I’m right that the size of the belt scales linearly with the number of FUs. IOW, is the number of belt entries per FU roughly constant across members?
I hope none of these answers fall in the NYF department
Wolke
Participant
August 8, 2014 at 9:01 am
Post count: 8
in reply to: The Belt #1306
Ah, I remember some related work dating back to the year 2000.
Bernd Paysan developed a 4stack processor back then and discussed it on comp.arch .
4stack CPU
Wolke
Participant
June 26, 2014 at 12:40 am
Post count: 8
in reply to: MIPS/sqrt(W*$) as a better metric #1159
In my opinion a very good solution is to express Power in terms of Cost and then optimize MIPS/total_cost.
Cost would then include the power of the chip, the cooling for the chip… even the floorspace cost if you think about it.
You can only put so many Watts on a square meter of floor space.
There is even application specific cost, since the administrative effort of using more cores or chips can differ per application.
It might also make sense to add maintenance cost, since the MTBF differs too if you put 80 fast chips in a rack vs. 800 slow ones.
Wolke
Participant
November 24, 2014 at 1:43 am
Post count: 8
in reply to: Pipelining #1550
Thank you for your very detailed answer.
You flatter me with your offer to contribute on the LLVM side but I’m really not so deep into compilers, I’m a hardware designer with just a basic course in compilers back at unitversity.
During the last few weeks I was peeking into the LLVM optimization stages to find out what is there already and what’s missing from my POV. I have a completely different microarchitecture in mind but it shares some of the problems of producing optimal code with Mill (in fact only the loop portion, I handle open code differently).
As to your comments:
I understand that most SWP code generation steps must happen in the specializer. Here’s a bunch of comments to your answer:
Why do you need the heuristic? If you compare latency of code scheduled following the dataflow with the latency the SWP schedule gives you, you can select the best option every time. Is it to save compile time?
If I were to generate the packed schedule, I’d formulate it as a MIP (Mixed Integer Program) and let the solver do the hard part for me. IMO, using a greedy algorithm at the prototype stage is fine, but it has to be replaced by an optimized algorithm most of the cases anyway.
In your algorithm to determine spills and fills you only guarantee a feasible schedule in the degenerated case. If there are too many inter-loop dependencies it might get spiller bandwidth limited. In these cases it might me worthwhile to additionally use the caches on top of the scratchpad. Of course this introduces the well-known problems with caches.
The None-insertion matches what I figured very closely. I was thinking about using a proof-engine to check reachability before inserting the recurs statements.
Wolke
Participant
June 26, 2014 at 5:21 am
Post count: 8
in reply to: Wafer feature efficiency #1160
<fun>
>Decentralization will bring the cloud down to earth
Fog Computing: As nebulous as Cloud Computing but closer to the customer
</fun>
Author
Posts

Viewing 8 posts - 1 through 8 (of 8 total)

Wolke

Forum Replies Created