Mill Computing, Inc

Forum Replies Created

Viewing 5 posts - 1 through 5 (of 5 total)

Author
Posts
peceed
Participant
April 11, 2025 at 6:02 am
Post count: 5
in reply to: Memory level parallelism and HW scouts #4005
From what I understand, the problem is in the code like the kernel below:
aLotOfComputationsA(a[hash(i)]); //predictors don’t work
aLotOfComputationsB(b[hash(i)]);
aLotOfComputationsC(c[hash(i)]);
aLotOfComputationsD(d[hash(i)]);
//”One miss to stall them all!”
Mill is not able to handle such a scenario!
It must be remembered that we have entered the era of huge L3 caches that have access times of around 40 CPU cycles, and this is an area where OoO does great, and Mill – not, being doomed to work in bursts. Additionally, OoO can use SMT for practically free, which reduces the occurrence of stalls by a factor of two – this is hard to beat, Mill has fundamental difficulties in implementing SMT.
Similarly, processing elements of a linked list, OoO can process the chain of dependencies as fast as possible. It should be noted, however, that garbage collector algorithms sort memory objects and this very well improves the performance of predictors for data – that’s why the hash example is better.
If we add to this the widespread use of vector instructions that can parallelize many algorithms quite analogously to MILL, and the simultaneous transfer of most of the calculations from CPU to GPU/Tensor cores, then we can see that the economic window of opportunity for Mill is closing, and it is in danger of becomining an alternative past.
peceed
Participant
August 2, 2023 at 3:00 pm
Post count: 5
in reply to: Can branch predictors perfectly predict counting loops? #3947
nihil novi
I had exactly the same idea, my obvious case was quicksort that switches to more regular quadratic algorithms for short arrays.
This way I have rediscovered Loop Termination Prediction, Count Register, Counted Loop Branch, etc.
Prediction offers little value in lengthy loops
Yes, but there is surprisingly high amount of relatively short loops.
And don’t forget, that vectorization makes loops shorter up to 32 times!
When we execute the branch and discover that the prediction was wrong, or if we take a exit that wasn’t predicted, the hardware has already issued two cycles down the wrong path. Miss recovery has to unwind that, fetch the right target, and decode it, which takes 5 cycles if the target is in cache
5 cycles on Mill is like 15 cycles on conventional processor – quite a big loss!
In essence you propose a deferred branch whose transfer is detached from the evaluation of the predicate, similar to the way Mill separates load issue from load retire.
IIRC Elbrus E2k uses this technique, it has Prepare Branch and Execute Branch instructions.
One is the time required to reset the fetch. If the target is not in the I0 microcache then reset would take roughly as long as mispredict recovery, i.e. five cycles in our test configs.
It looks like you should split speculative fetch and speculative code prefetch.
How often can we eval a predicate five cycles before the transfer? Not often I’d guess, but I have no numbers.
But you can start resetting the fetch earlier! Prepare Branch and the following instructions before Branch Execute are valid…
A semantic issue is how the DB interacts with other branches. Suppose a regular branch is executed between DB issue/eval and transfer? Who wins?
It doesn’t matter as long as it is consistent! We could treat it in the same way as delay slots, so the most “powerful” answer is:both of them, unfortunately it won’t work with Belt I suppose.
- This reply was modified 1 year, 9 months ago by peceed.
- peceed
  Participant
  August 3, 2023 at 5:12 am
  Post count: 5
  in reply to: Can branch predictors perfectly predict counting loops? #3949
  One is the time required to reset the fetch. If the target is not in the I0 microcache then reset would take roughly as long as mispredict recovery, i.e. five cycles in our test configs. Even an I0 hit would likely need three cycles to reset.
  It is hard to believe that reset on such early stage of instruction execution (no registry/belt/memory operations) is something different than stall of subsequent “correct” stages. So there should be no lower limit on its duration when the information is available earlier. Just issue correct instruction and wait for its “execution front”.
peceed
Participant
April 15, 2025 at 7:15 pm
Post count: 5
in reply to: Memory level parallelism and HW scouts #4007
11 years have passed since I heard about Mill for the first time. It’s a long time.
The lack of realistic benchmarks is unfortunately very disappointing, especially since they can use the simulator.
There is no point in hiding from the competition, it does their things at own pace anyway.
I think that the whole project lacks a business approach. It was possible to use grants from the European Union to create a Linux version that would cope with the Mill memory model – due to energy savings it is ‘pro -ecological’.
All the fun in creating ideal architecture is immediately idiotic when it still uses the intermediate code, and the only markets on which it can be adapted quickly and are based on compiled sources or virtual machines (lamp/java/.Net/Android/Apple). Nvidia and AMD change GPU architecture every few years and it doesn’t matter.
It was necessary to ‘show meat’ as soon as possible, how well Gold and Silver cope with mobile applications and JS, and the rest of the improvements show later. The new security model is unnecessary for the first version, and the prototype can use ordinary uniform physical memory and run a trusted code only.
You need ‘proof of concept’. The lack of such a thing indicates that the benefits only occur after implementing the whole innovation, and this means that the core is essentially weak.
As I once pointed out that there is an Embedded market that can do without bells and whistles and first generation of product can operate using a 32-bit address space.
peceed
Participant
May 2, 2021 at 3:03 pm
Post count: 5
in reply to: news? #3687
There are tuning issues in the smaller members too, Tin and Copper. There the issue is belt size. Even an 8-position belt is enough for the tests’ transient data, but the codes also have long-lived data too, and the working sets of that data don’t fit on the smaller belts, and some not on the 16-belt of a Silver. As a result the working sets get spilled to scratch and the core essentially runs out of scratch with tons of fill ops; much like codes on original 8086 or DG Nova. This is especially noticeable on FP codes like the numerics library for which the working set is full of big FP constants. Working out of scratch doesn’t impact small-member performance much, but has scratch bandwidth consequences, and power too that we can only guess at. We may need to config the smaller members with bigger belts, but that too has consequences. In other words, the usual tuning tradeoffs.
I think you can use a small register set that is a logical extension of belt, using additional bit in argument address.
Encoding cost is acceptable and it solves problem of “frequently used arguments”.
It can be entropy optimized by restricting number of register arguments to one per operation or by limiting number of functional units that can use register arguments. Small models can use more bits for register specifier.
The C++ library is coming up because we are doing the OS kernel in C++
I am under strong impression that you are trying to innovate too much at once.
Your initial goal should be a “software stack accelerator”: processor that needs minimal OS modifications and is fully compatible with existing applications (Linux/Java/Android).
Forget single address space: it doesn’t save a lot of power (TLB uses ~15% IIRC) but is the biggest blocker in quick adoption. You can easy make it optional.
You can win the market by offering “only” double performance to power and performance to cost ratios, as long as you are software compatible/sane. “Datacenters and smartphones” are sensitive enough to 2-3x power advantage, but they are not able to rewrite their software!
Time is running out – volume of computations is moving into visual/AI domain.
Author
Posts

Viewing 5 posts - 1 through 5 (of 5 total)

peceed

Forum Replies Created