Forum Replies Created
- AuthorPosts
- in reply to: Transistor counts? #3697
Algorithmic information is the shortest program that outputs a given set of data. Searching for that algorithmic information requires a universal Turing machine equivalent upon which to run that program. There’s no escaping the universality requirement for algorithmic information search. Moreover, scientific induction can’t do better than algorithmic information, which is why its the proper basis for ML.
Energy constrains the economics of proof-of-work.
- This reply was modified 3 years, 5 months ago by jabowery.
- in reply to: Transistor counts? #3695
Choice of mining hardware depends on the choice of proof-of-work algorithm. As the guy who came up with the idea for The Hutter Prize for Lossless Compression of Human Knowledge clear back in 2005, I’m rather distressed at the direction taken by the machine learning industry which elides the algorithmic information theoretic foundations. I convinced the mining chip executive that the ML industry’s foundation in matrix multiplication needs to be replaced with algorithmic information search. Automacoin is somewhat related to the direction I’d like to see the mining industry take.
- in reply to: Transistor counts? #3693
My motivation for asking is that I just returned from Bitcoin 2021 in Miami where I met up with an executive with a mining chip manufacturer. Elon Musk’s May 12 tweet about exhorbitant cryptocurrency energy usage caused a quarter trillion dollar loss of market capitalization to Bitcoin alone. I wanted to explore alternative proof-of-work algorithms based on Mill architecture in a IRAM chip featuring a large number of cores, but had no idea how to estimate how much real estate it would take to achieve given levels of performance.
Using real estate for more cores in preference to threading, resulting from the the Mill’s other architectural features, brings to mind a question about on-chip memory architecture that, while of no immediate consequence to the Mill chip, might affect future trade offs in real estate use.
With 14nm and higher density technologies coming on line, there is a point where it makes sense prefer on-chip shared memory to some other uses of real estate. This raises the problem of increasing latency to on-chip memory, not only with size of the memory but with the number of cores sharing it. In particular, it seems that with an increasing number of cores, a critical problem is reducing arbitration time among cores for shared, interleaved, on-chip memory banks. In this case, interleaving isn’t to allow a given core to issue a large number of memory accesses in rapid succession despite high latency; it is to service a large number of cores — all with low latency.
Toward that end I came up with a circuit that does the arbitration in analog which, if it works when scaled down to 14nm and GHz frequencies, might result in a architectural preference for a on-chip cross-bar switch between interleaved low-latency memory banks and a relatively large numbers of cores.
This spans disciplines — a problem well known to folks involved with the Mill architecture which spans disciplines between software and computer architecture (rather than between computer architecture and analog electronics).
I’d appreciate any feedback from folks who understand the difficulty of cross-disciplinary technology and have a better grasp the issues than do I.
Ivan Godard writes:
care to tell what you’d like to have?
See section 4.1 FPGA Implementation of “Two Sparsities Are Better Than One: Unlocking the Performance Benefits of Sparse-Sparse Networks” by Kevin Hunter et al of Numenta, Redwood City.
From the abstract:
Using Complementary Sparsity, we show up to 100X improvement in throughput and energy efficiency performing inference on FPGAs.
There are a couple of things to keep in mind here:
1) Numenta has been approaching computational neuroscience from the top down — starting with neuroscience and attempting to figure out how the neocortex’s fundamental block (“column”) operates in computational terms. So they’ve done almost the opposite of the rest of the ML community which started from a “What can we compute with the hardware at hand?” perspective. While the rest of the ML community is stuck with the path dependence of graphics hardware (which is, unsurprisingly, fine for a lot of low level image processing tasks), Numenta has been progressively refining its top-down computational neuroscience approach to the point that they’re getting state of the art results with FPGAs that model what they see as going on in the neocortex.
2) The word “inference” says nothing about how one learns — only how one takes what one has learned to make inferences. However, even if limited in this way, there are a lot of models that can be distilled down to very sparse connections without loss of performance and realize “up to 100X improvement in throughput and energy efficiency”.
I have no conflict of interest in this. My background, while it extends to the 1980s and my association with Charles Sinclair Smith of Systems Development Foundation who financed the PDP books that revived machine learning, my most significant role has been asking Marcus Hutter (PhD advisor to the founders of DeepMind) to establish The Hutter Prize for Lossless Compression of Human Knowledge — which takes a top-down mathematical approach to machine intelligence.
PS: Not to distract from the above, but since I cut my teeth on a CDC 6600, there is an idea about keeping RAM access on-die somewhat inspired by Cray’s shared memory architecture on that series, but it is wildly speculative — involving mixed signal design that’s probably beyond the state of the art IC CAD systems if it is at all physically realistic — so take it with a grain of salt.
- This reply was modified 2 years, 6 months ago by jabowery.
abufrejoval writes:
That’s where I am looking for reassurance, because I just love the Mill. But loving it, doesn’t mean being convinced about the value it can deliver.
Agreed and I’m not going to tell you I offer that reassurance in my very top-down technical market gap which, in the final analysis, is mainly about keeping shared RAM access on die given the fact that increased density has been outstripping increased clock rates for a decade or so.
But the machine learning world is not only an emerging market for silicon — it is breaking out of its drunken path-dependent stupor about dense models born of cheap GPUs to realize the value of sparse models — not only for better models (see Algorithmic Information Theory and Solomonoff Induction’s formalization of Occam’s Razor), but for a factor of 100 energy savings per inference. The large models everyone is so excited about are not just ridiculously wasteful of silicon, their energy costs dominate.
NVIDIA’s newest ML foray (Grace) at 80e9 transistors claims it supports “sparsity”. This is (to be _very_ kind) marketing puffery. Their “sparsity” is only about a factor of 2. In other words, each neuron is assumed to be connected to half of all the other neurons. All their use of that term tells us is that the market demands sparsity and that NVIDIA can’t deliver it but knows they need to. Actual graph clustering coefficients in neocortical neurons, and actual weight distillation metrics indicate you’re probably going to hit the broad side of the market’s barn by simply turning those 80e0 transistors into a cross-bar RAM where a large number of banks of phased access RAM are on one axis and a large number of simple GPs are on the other axis.
Can the Mill serve as the “simple GPs”? How many transistors does one Mill GP take if its architecture is biased toward sparse array matrix multiplies and/or sparse boolean array (with bit sum) operations?
As for as the “switch of ISA” is concerned, what do you think CUDA is? What I mean by that is there is a lot of work out there to adapt software to special purpose hardware motivated by machine learning. I don’t see why a pretty substantial amout of that couldn’t be peeled off to make the compilers more intelligent for matching the Mill ISA to the hardware market.
abufrejoval writes:
With wafer scale machine learning and quantum computing we are so down the road to special purpose architectures, that GP is really treated like orchestration code.
The big market gap in machine learning is extreme sparsity and extreme sparsity requires indirect memory access due to the data structures required for sparse arrays. So the ML challenge is keeping shared, parallel, random memory access on-die. This is a gap that has existed ever since GPUs were repurposed for ML since graphics (up until ray trace) had little use for sparse matrix multiplies. This has, in turn, biased the entire field toward dense models — drunks looking for their keys under the lamp post. The hundreds of billions of parameters in the large models are dense but it’s been demonstrated that weight distillation can bring that down by at least a factor of 10 without loss of perplexity score. One case achieved 97% reduction. This is consistent with what we know about neocortical neuron connectivity. Moreover, Algorithmic Information Theory tells us that the gold standard for data-driven induction of models is parameter minimization approximating Kolmogorov Complexity of the data.
Quantum computing is a pig in a poke.
At the high-end in cloud servers transistor budgets for cores and Watts to operate them seem much more compelling to pay for architecture switches, but I don’t know if you could scale the Mill meaningfully to dozens of cores in a die.
It’s important to think in terms of 50e9 transistors per die as a way of keeping shared RAM access on-die. That’s where we are for both indirect memory access (sparsity in ML) and GP.
- AuthorPosts