Forum Replies Created
- AuthorPosts
the compiler is responsible for coming up with bundles of operations which
can be executed concurrently.Correct.
How do you know the compiler can always come up with bundles
which have 30 parallel operations ?Always is a strong word; we obviously can’t. The conventional wisdom was that there’s only an ILP of 2 or so in open code. This is not true. Our Execution talk describes phasing which is one of the ways we improve on this.
- in reply to: Potential vector width weakness? #724
You answer your own question very well π I think your reasoning closely approximates the ootb team.
> Or maybe you all have thought of some clever means of partitioning your belt network by depth that isnβt obvious to me (and so is probably NYF), or Iβm wrong to thinkg that all inputs and outputs have to have the same size?
Well holistically Mill is an ABI at the load module level. But its very much grounded in a hardware architecture, of course. Yet how the belt is implemented is always described in the talks as an implementation detail…
- in reply to: mill + parallella #715
Do you have any insights into how to do this or how you imagine it working?
- in reply to: Site-related issues (problems, suggestions) #583
Can mods (like myself) host images on ootbcomp? WP normally has add-media dialogs, but I cannot find them for ootbcomp.com.
- in reply to: Introduction to the Mill CPU Programming Model #835
The multiply op belongs in the execute phase, so issues in the second cycle of the instruction.
The number of cycles it takes is member dependent, and operand width dependent, and dependent on the type of multiply (integer, fixed point, floating point, etc). Multiplying bytes is quicker than multiplying longs, and so on. But the specializer knows the latencies and schedules appropriately.
Lets imagine it takes 3 cycles, which includes the issue cycle. The instruction issues on cycle N, but the multiply operation issues on cycle N+1 and retires – puts the results on the belt – before cycle N+4.
The CPU likely has many pipelines that can do multiplication, as its a common enough thing to want to do. The Gold, for example, has eight pipelines that can do integer multiplication and four that can do floating point (four of the integer pipelines are the same pipelines as the four that can do floating point).
So on the Gold, you can have eight multiply ops in the same instruction, and they all execute in parallel. Furthermore, even if a pipeline is still executing an op issued on a previous cycle, it can be issued an op on this cycle. And each multiply can be SIMD, meaning that taken altogether the Mill is massively MIMD and you can be multiplying together a staggeringly large number of values at any one time, if that’s what your problem needs.
- in reply to: Site-related issues (problems, suggestions) #815
Do the 4th and 5th level replies have the same indent in your browser?
We seem to quickly get more than 5 deep in a thread and then it becomes really hard to follow. I just answered a couple of questions only to later work out that Ivan had answered them in his usual depth and detail..
Yes, you can use the stack to pass arguments between frames.
You may also use the heap. And how you do it depends a lot on whether you make a distinction between objects and primitives, and how you do garbage collection.
Normal calls within a compilation unit are very conventional and all the normal approaches work; its only if you want to take advantage of the Mill protection mechanism or its convenient for dynamic linking that you use portals.
Answering particular parts of your question where I can:
> Wow much of the selection of the current phasing setup was a result of the way the hardware works, and how much profiling actual code to see what would be useful in practice?
Well, definitely informed by code analysis. There’s an upcoming talk on Configuration, and its been described in the talks and in the Hackaday interview.
A central tenet of the Mill is that people building products can prototype custom processor specifications in the simulator very quickly and choose a configuration with the best power performance tradeoff informed by benchmarking representative code.
> With regard to calls, how many cycles do they normally take in overhead?
One. Calls and branches (that are correctly predicted) transfer already in the next cycle. In the case of mis-predictions where the destination is in the instruction cache, again the Mill is unusually fast; it has a penalty of just five or so cycles.
Additionally, there is is none of the conventional pre- and post-ambles.
- in reply to: mill + parallella #718
Yes, I understand you.
We do want to produce an early “make” model for enthusiasts, as there are lots of enthusiasts who have asked for Mill dev boards. The funding mechanics priority and timescales of this is still being discussed.
Does the mill support (arbitrary) vector element swizzling?
Yes. There is a
shuffle
op for arbitrary rearrangements of a vector.Iβm just wondering if the same functionality that enables free pick might also allow free swizzles.
I believe its in the op phase.
I could see how it might be machine dependent due to different vector sizes.
Well, you can always use the
remainder
op to create a mask that you then pick against with0
s orNone
s to create a partially filled vector? This was covered in the strcpy example in the Metadata talk and the Introduction to the Mill CPU Programming Model post.- in reply to: Introduction to the Mill CPU Programming Model #615
I was careful to say wider belt, meaning more elements in a vector, rather than longer belt because I imagine its diminishing returns and stresses instruction cache and so on.
The key thing is that it is straightforward to simulate variations and evaluate them on representative target code. I’m sure that the current configurations haven’t been plucked from thin air, but rather represent what is considered the most advantageous mix for the first cut.
I do want a Platinum Mill on my desktop and to hell with cooling! When we have a monster for gaming rigs, compiler rigs and for the fun of it, then we can dream of an Unobtainium Mill.
- in reply to: Introduction to the Mill CPU Programming Model #611
Sorry, I could have been over-enthusiastic. I can imagine a Mill with more L2 and L3, wider belt, more FP, higher clock rate and so on π
But yes, the Mill is high end compared to today’s CPU cores, perhaps? π
I will fix the post when I am next on a laptop.
- This reply was modified 10 years, 10 months ago by Art. Reason: correct formatting to intended
- in reply to: Instruction Encoding #578
Well, it would be staggeringly unlikely to be a meaningful program because the two streams diverge in memory from the entry point; another address e.g. EBB+n would mean that the bit stream fragment between EBB…EBB+n that is one side is also valid instructions to the other side when decoded backwards…
Of course, trying to generate two legal, overlapping EBBs with this property may be a fun exercise for determined readers π
Add-reductions keep coming up in my mind when doing 3D (e.g. as game and graphics engines will be doing buckets of). In 3D graphics there are lots of vectors which are 3 or 4 long.
I imagine that, whilst belt vectors are powers-of-2 in length, you can load a non-power-of-2 vector, and the load automatically pads it with Nones? So if you
load(addr,float32,3)
you actually getx
y
z
None
.And you’d want an add reduction to treat Nones as 0 rather than propogate them.
The
shuffle
sounds useful for computing cross-product.Generally in games/graphics you want sqrt, inverse sqrt and dot product. You also likely want to sum a vector again when you do matrix multiplication.
My thinking would be that in the Mill IR sqrt, inverse sqrt, sum reduction, fork/exec/vfork, memcpy and memmove etc are built-in functions, and the specialiser on each target turns that into single or multiple operations as the target supports. So that’s like microcode (or standard function inlining), but in the specialising compiler rather than in the outer compiler or on-CPU. It would be a hassle for a specialiser to have to unravel some old IR that is coding its own sqrt loop using a lower-level operation if there is ever hardware with better built-in sqrt, for example?
And as for hazards, we all want to avoid them, but pragmatically if its the specialiser that has to know about them and it has to know about one of them, it might as well open the floodgates and have a few more π
- AuthorPosts