• Author
    Posts
  • mmeyerlein
    Participant
    Post count: 13
    #3431 |

    hello mill team, I’m pretty curious about what’s happening right now.
    will there be another talk soon?
    or other news or updates?
    looking forward to new infos 😉

    • This topic was modified 3 years, 3 months ago by  mmeyerlein.
  • Thomas D
    Participant
    Post count: 21

    HELLO??? Did Ivan/Roger/Mark leave us and now this startup is dead, too?

    • Ivan Godard
      Keymaster
      Post count: 627

      Nope, still here – a screwup on my RSS feed meant that I didn’t know these posts were happening until veedrac brought it to my attention on a different board. Sorry.

  • Ivan Godard
    Keymaster
    Post count: 627

    Mostly we’ve been working with tests in C; currently 95+% of the suite compile. The C++ library is coming up because we are doing the OS kernel in C++, but it’s really only starting. Of course compile-time only code like the STL ports easily, but the rest will be a lot of work. The only part that is not just a port-and-debug is throw/catch, for which the standard implementation approach doesn’t work on a bare machine. We know what we will do, but that will be new code, which is slow to produce while we are still elfing things out.

    For the C tests we have been throwing whatever we can find for free at it. We have the entire Rosetta suite in, for example. Probably the biggest single test in the suite is the micropython interpreter. There are a bunch more tests available that we can’t yet use because they need a file system – the bare-machine problem again; an API to use the host file system using outcalls is on the spike, but low priority..

    As for comparisons, we has a commercial harness running (can’t remember the name) that does host vs. target compares across various stats and different test runs. With that we’ve found and fixed several sim-level problems: we didn’t have a stride prefetch predictor configured, exit predictor bugs, poor cache replacement, that kind of thing. By eyeball after that work Silver seems to be running similarly to our server x86s. Of course we can only measure execution, and will have to armwave power and area.

    Actually, the biggest issue to my mind coming out of the comparisons is how apples-to-oranges they are, IPC especially. Sometimes the A2O gives us more instructions than x86: all our static speculation counts as instructions, but OOO run-ahead doesn’t; likewise a load-op-store sequence counts as three ops for us while an x86 mem/op/mem counts as one. Sometime A2O gives the x86 more: call is one op for us, potentially dozens for x86. About the only thing that is more apples-to-apples is whole-program wallclock-to-wallclock and there Silver seems pretty close on individual tests and sometimes ahead, and it’s still early days for our code quality. That’s without pipelining up yet; I’m real encouraged because I’m seeing loop sizes compress by a factor of three or more on Silver when I enable piping.

    For now Gold isn’t really giving better numbers than Silver – the extra width can’t be used by the test codes and won’t be until we not only have piping working but also can unroll up to saturate the width. All the configs also need tuning – Silver to run one daxpy per cycle piped and Gold to run two, piped and unrolled, both with excess resources cut out to save power and area.

    There are tuning issues in the smaller members too, Tin and Copper. There the issue is belt size. Even an 8-position belt is enough for the tests’ transient data, but the codes also have long-lived data too, and the working sets of that data don’t fit on the smaller belts, and some not on the 16-belt of a Silver. As a result the working sets get spilled to scratch and the core essentially runs out of scratch with tons of fill ops; much like codes on original 8086 or DG Nova. This is especially noticeable on FP codes like the numerics library for which the working set is full of big FP constants. Working out of scratch doesn’t impact small-member performance much, but has scratch bandwidth consequences, and power too that we can only guess at. We may need to config the smaller members with bigger belts, but that too has consequences. In other words, the usual tuning tradeoffs.

    As soon as pipelining is up and we have comparison numbers we expect to pull the trigger on the next funding round, perhaps ~$12M. There are a bunch of things we want to file more patents on, but those, and more talks, will wait for the round.

    • kwinz
      Participant
      Post count: 2

      You are programming an OS kernel now in C++?
      Sounds like an awful business decision for a hardware startup.

      In the interest of “crawl, walk, run”:
      Start with an OS-less platform first. Something that doesn’t need an OS, like Arduino platform board.
      Yes, it won’t highlight all the novel address space innovations and target “specializations” that you bring to the table but it’s a start.
      Then iterate. Similar to how SiFive did it with their RISC-V offerings.

      I worry about you guys. All the best!

      • Ivan Godard
        Keymaster
        Post count: 627

        Much less burdensome on a Mill, but still will be a long time and a lot of work, yes.

        However, we need to start the OS work so as to have a test and verification code suite for the system aspects of the chip. Even an Arduino needs interrupt handling, atomics, … Standard benchmarks don’t exercise those parts of the design.

        Various announcements coming soon (for some value of soon).

        Ivan

      • peceed
        Participant
        Post count: 1

        There are tuning issues in the smaller members too, Tin and Copper. There the issue is belt size. Even an 8-position belt is enough for the tests’ transient data, but the codes also have long-lived data too, and the working sets of that data don’t fit on the smaller belts, and some not on the 16-belt of a Silver. As a result the working sets get spilled to scratch and the core essentially runs out of scratch with tons of fill ops; much like codes on original 8086 or DG Nova. This is especially noticeable on FP codes like the numerics library for which the working set is full of big FP constants. Working out of scratch doesn’t impact small-member performance much, but has scratch bandwidth consequences, and power too that we can only guess at. We may need to config the smaller members with bigger belts, but that too has consequences. In other words, the usual tuning tradeoffs.

        I think you can use a small register set that is a logical extension of belt, using additional bit in argument address.
        Encoding cost is acceptable and it solves problem of “frequently used arguments”.
        It can be entropy optimized by restricting number of register arguments to one per operation or by limiting number of functional units that can use register arguments. Small models can use more bits for register specifier.

        The C++ library is coming up because we are doing the OS kernel in C++

        I am under strong impression that you are trying to innovate too much at once.
        Your initial goal should be a “software stack accelerator”: processor that needs minimal OS modifications and is fully compatible with existing applications (Linux/Java/Android).
        Forget single address space: it doesn’t save a lot of power (TLB uses ~15% IIRC) but is the biggest blocker in quick adoption. You can easy make it optional.
        You can win the market by offering “only” double performance to power and performance to cost ratios, as long as you are software compatible/sane. “Datacenters and smartphones” are sensitive enough to 2-3x power advantage, but they are not able to rewrite their software!
        Time is running out – volume of computations is moving into visual/AI domain.

    • goldbug
      Participant
      Post count: 52

      Maybe open sourcing the kernel? there are a lot of file systems out there, I am sure someone could port one. maybe from genode? Even if it is not perfect, it might serve as an MVP

      • Ivan Godard
        Keymaster
        Post count: 627

        We are doing a micro-kernel, which is necessarily from scratch although we may borrow approaches from L4.

        • indolering
          Participant
          Post count: 2

          We are doing a micro-kernel, which is necessarily from scratch although we may borrow approaches from L4.

          That’s odd, the L4 kernels got code reuse up to 50%. Is it that the architecture is just so very different from traditional architectures? Or is it a lack of well supported open-source L4 kernels? I know Intel is using Minix for some of their embedded stuff….

          • Ivan Godard
            Keymaster
            Post count: 627

            I suppose it depends on what you include in the kernel. Our OS framework is a set of cooperating services, but the great majority of that is app code with no particular privileges – things like a math library, but including a lot that in legacy CPUs has to run in the kernel.

            We define the real kernel as that code which has to be trusted because it can unilaterally change the state and condition of other code that is not trusted. This trust is different from code that is relied upon: if your app uses sqrt in some calculation, you rely on the math library to in fact give you a square root. But sqrt cannot change the state of its caller (courtesy the Mill protection model) so the app does not have to trust it in this sense.

            So what has to be trusted? Not very much on a Mill: some initialization code for boot; the top-level interrupt handler; the dispatcher; a few allocators; and most importantly, the code that updates the protection state. We project ~3k LOC total. And nearly all of that code exists to deal with the way the Mill works, so it can’t be shared with any other platform, in either direction.

            Of course, surrounding that microkernel there will be a ton of untrusted (but relied upon) libraries that we expect to lift from L4 and anywhere else. That will include a lot of what the original source thought was part of the kernel, but we don’t.

            We anticipate considerable terminological confusion.

    • goldbug
      Participant
      Post count: 52

      I could not find anything on google. What is A20? is it a profiler? emulator?

    • cpt_charisma
      Participant
      Post count: 1

      On the Tin and copper (and all members, really), it seems like you can get a 25%-400% (depending on what you’re doing) effective belt length increase if you can get the specializer to add something like conform ops every couple of instructions to clear dead values or move useful ones to the front. If 80% of values are only used once on average, it makes sense to get rid of them asap, leaving more space for useful values. The specializer should already know live/dead info, since it has to figure out whether to spill things. The question is whether you have enough holes in the schedule to take advantage. It might be worth adding conform to other slots or adding a few bits to the encoding specifically for this purpose. It’s extra hardware, but maybe less than a longer belt?

      You could also do something goofy like automatically killing belt values that get spilled or written to memory. You would have to add additional ops for cases where you don’t want that behavior. I have no idea whether this would be worthwhile, though.

      Of course, this assumes you usually have at least a few dead values on the belt.

      • Ivan Godard
        Keymaster
        Post count: 627

        An update: Conforming is no built into branch ops, so the only use of a separate op is to recover live values from the mix of live and dead, as you suggest. Consequently the former “conform” op is no spelled “rescue”.

        Rescue is cheap in execution time but expensive in code space when a value is only used far from where it is defined. A significant part of the specialize heuristics are devoted to finding a balance between use of rescue and of spill-fill.

        We looked into auto-kill, but so far it seems to not fit the hardware timing; figuring out how things get reordered in the specializer is faster than trying to do it at runtime.

        Rescue is only useful when the currently live values fit on the belt but some are out of reach. If the live-value count exceeds the belt size then there’s nothing for it but spill and refill when belt pressure is lower.

        • Thomas D
          Participant
          Post count: 21

          Conforming is now built into branch ops, so the only use of a separate op is to recover live values from the mix of live and dead, as you suggest. Consequently the former “conform” op is now spelled “rescue”.

          When are you going to regenerate the wiki? I imagine that it is quite old at this point….

          • Ivan Godard
            Keymaster
            Post count: 627

            Positively antique! We’ve hoped for someone to join that wants to do it, but so far… You interested?

          • Thomas D
            Participant
            Post count: 21

            I don’t believe that I could deliver results commensurate with how exciting I find your work. As a consequent of turning a hobby into a bill-paying job, the job frequently saps my desire to perform my hobby (and given the paperwork that I signed for said job, I am frequently unsure of how to actually contribute to things of pith and moment). I’ll have to pass, for now.

            Puke: Nevermind my tights, I thought we were talking about my pith.
            Snot: That’s why you have to wash them out!

  • LarryP
    Participant
    Post count: 78

    What’s a daxpy?

    one daxpy per cycle

    And when you write “cycle,” do you mean a clock cycle or a Mill instruction?

    Thanks,

    • Ivan Godard
      Keymaster
      Post count: 627

      What’s a daxpy?
      Daxpy is a BLAS function, the most commonly used: “double precision A times X + Y”; see https://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprograms#Level_1. It’s why hardware has FMA (a.k.a MAC) ops.

      And when you write “cycle,” do you mean a clock cycle or a Mill instruction?
      Absent stalls, Mill issues all instructions of one bundle (formerly described as “all operations of one instruction”, except people got confused by the terminology) in one clock cycle. So cycle and bundle (instruction) are equivalent.

  • mmeyerlein
    Participant
    Post count: 13

    hello again, somehow it has become quite calm in the last months around this very unusual project.
    i am still very curious about the current status. can you give a big update again?
    or how about another talk? they were always very insightful.

    • Ivan Godard
      Keymaster
      Post count: 627

      Well, life is what happens when you were planning something else. We were starting to search for our next funding round, as announced, when the virus hit and that whole industry put on its hat and went home.

      In some ways the plague is much less a problem for the Mill project than for other businesses. We have always been a distributed virtual company, so we already had work-from-home worked out. And as a sweat-equity organization with a burn rate of zero we have an infinite runway, while so many others are shut down and going bust.

      So no news from us is good news, sorta. Thanks for your encouragement.

  • Nick
    Participant
    Post count: 2

    On the contrary, Marc Andreessen seems to think now is a great time to invest in the future. Wonder if he’s actually willing to make those investments or if he’s all talk :P. Mill is one of those companies that you don’t see very often: redesigning platforms instead of incrementally layering and remixing. I hope things work out well for you guys; you deserve success.

    Is the work that’s left just a software challenge? Is the funding just for hiring some devs and hitting “print” at the fabs at this point? I feel there should be some markets that just care about raw “on the metal” compute, not established OS support, and maybe an LLVM backend is all they need? My naive guess anyway… I presume you have initial markets in mind.

    • Ivan Godard
      Keymaster
      Post count: 627

      I wonder too 🙂 Realistically, the virus has to settle down a little before doing meetings.

      As for the project, the tool chain is usable against our four test configurations; it it no longer on the critical path to product. Software is working on the micro-kernel, and hardware is working on getting the C++ expansions make the right Verilog for all. There’s a lot to do, but we are out of research ind into development, and money and talent we can use both of now.

      • Nick
        Participant
        Post count: 2

        True! I guess you shouldn’t be shaking hands with too many folks just yet. I keep forgetting how bad things are in the US — Australia has it pretty good right now!

        Sounds like you’re all doing well. Best of luck. I want a Mill in my laptop in 10 years!

  • mmeyerlein
    Participant
    Post count: 13

    me again 🙂
    i check in here every few months to see how the mill is developing. i find the ideas and approaches to challenge every aspect of cpu design long overdue, and have not felt present since transputer, itanium or transmeta. i think the world again needs this courage to not always follow the same or very similar path as seen in thousands of scientific papers on cpu design. that’s why i accompanied all talks with admiration, and i really enjoy being on the wiki because it’s always very inspiring 4 me.
    however, in the last 2 1/2 years it has become very quiet and i could imagine that the fanbase starts to erode if i generalize my motivation.
    ivan, i find it very informative when you answer triggered, which makes my flame of hope blaze 😉
    we ourselves just went through a pandemic with a series b in a similar size, and know how hard it is to get vc money, but it was possible.
    so i wanted to ask you if it wouldn’t be a good idea to make a virtual talk about the challenges and progress of the last 2 years. not so much a special technical talk, but more a general talk about the company itself?

    • Ivan Godard
      Keymaster
      Post count: 627

      Would it be a good idea? Yes. Right now? Well, maybe not.

      We were all set to make our move this past spring; shutting down down the Convertible Notes was the last preliminary. A “state of the company” talk would have been part of the active solicitation. You know what happened then. At some point we’ll have to say enough is enough and just go do it, but when? There’s an argument for not waiting, and there’s an argument for waiting, not until everything is better, but at least until everything is stably bad.

      Either way, we won’t do anything during the summer, which is when the whole finance industry goes on vacation.

      And yes, we’re frustrated too.

      • mmeyerlein
        Participant
        Post count: 13

        i will continue to visit here every few months in the hope that the mill has taken another big step forward.
        i see the mill as a coherent overall concept that takes a big step forward instead of the “yet another risc” that the world is celebrating.

        ok, i trust the whole mill team not to lose their enthusiasm and hope that such information, which is given again and again on request, will be established in a format of its own, something like a “message from the bridge” newsletter that comes every three months… 😉

  • phorgan1
    Participant
    Post count: 1

    Just checking in and saying hi

    Patrick

    • mmeyerlein
      Participant
      Post count: 13

      hi 😉

  • mmeyerlein
    Participant
    Post count: 13

    and the next three months are over 😉
    are there some exciting news again?
    what about the fpga implementation ?
    or some interesting benchmarks?
    yes i know, corona, but with such a cool topic you can’t sit still, i bet your fingers are tingling a lot, right?

  • gaby_64
    Participant
    Post count: 10

    im waiting for more news aswell, the mill cpu is on my “technology’s to watch” spreadsheet.

    • Ivan Godard
      Keymaster
      Post count: 627

      We expected to go out for a funding round (and convert from bootstrap to salary-paying) last spring but, well, 2020.

      Right now we’re trying to decide when to make a second try at it. The financial market for the likes of us isn’t back yet, and the virus rates are turning up again, which argues for waiting. But the economy and the future market is real iffy, which argues for doing it now.

      Comments?

      • mmeyerlein
        Participant
        Post count: 13

        from my personal experience i can say that the vc market has been back for a long time, since they took all the stressful companies out of their portfolios in the middle of last year. the really big vc’s are even more aggressive in the market than before.
        therefore: go go go!

        • Ivan Godard
          Keymaster
          Post count: 627

          I’m interested in details. Can you describe that personal experience here, or drop me an email to ivan at millcomputing.com?

          • mmeyerlein
            Participant
            Post count: 13

            maybe i should visit here more often 😉
            i work in berlin as an interim cto and i have accompanied companies in their series a or b in the last two years. i have dealt with vc’s of the size of lightspeed, and of course talked to them a lot about pandemic. all vc’s had a lot to do in the beginning of 2020 to quickly clean up their portfolio. from mid 2020 on, however, business continued as normal, and tech companies were never out of focus. at least that is what i had heard several times from the vc’s, and also experienced myself.
            so i understand that the processes of one’s own company have to be adjusted first in the pandemic which didn’t simplify the cooperation, but a good team, good founders, a good idea and a good pitch deck provided you can find a vc. i’m not saying it’s easy to contact 100+ vc’s, but def. possible 🙂

            I find the idea and almost all approaches terrific!
            however, i would have rather chosen the risc-v approach and won a large community for this great idea and then monetized the implementation.
            but i respect and admire your persistence!

          • Ivan Godard
            Keymaster
            Post count: 627

            Thank you for your support and enthusiasm. The company has done fine over the past year; being already a global virtual company almost nothing changed for us. But it’s been a rough year personally for many of us, especially on of the compiler team who spent a month in hospital in Poland an is still not fully back.

            About your recent fund-raising experiences: In your VC contacts, were you face-to-face or video? Had you been vaccinated?

            Personally I’m in a high-risk group, and more than a little paranoid about the situation here in the US.

          • mmeyerlein
            Participant
            Post count: 13

            i can reassure you. we had the meetings and the signatures all done remotely, because it was in the height of the pandemic or at a time where no one knew exactly what was going on. in the meantime i’ve been vaccinated, but even now the board meetings are all remote. but that will certainly adjust a little again soon 🙂

            as i said, i think the idea is really great, however i’m starting to get a little worried that the competition in the fast lane, with much worse implementations, will race past the mill and it will lose its usp. processors like the prodigy and the huge community of risc-v are making very big strides forward with their traditional implementations on both important kpi’s performance/transistor and performance/power, which will make funding more difficult.
            i would find it extremely unfortunate if this really good new approach were to disappear into a drawer as a result….

            so get to the computer and present the pitch deck at the vc’s! 😉

          • mmeyerlein
            Participant
            Post count: 13

            winter is just around the corner.
            and since I’m still very curious, I wanted to ask what the answers from the vc’s are 🙂

          • Ivan Godard
            Keymaster
            Post count: 627

            Come on, you know we can’t say 🙂

          • abufrejoval
            Participant
            Post count: 3

            Nice to hear you’re still talking to them!

            The lectures were truly inspiring and made tons of sense, while I was listening to them.
            Still over the years I’ve forgotten so much, I couldn’t explain how the Belt works if I was asked me today 🙁

            What I *do* remember is an order of magnitude better general performance from the same transistor budget.

            But when I look at an Apple M1 vs. a Jetson Nano, or a AMD Ryzen 5800U vs. an AMD Bobcat, that’s also an order of magnitude in a decade, redoing architectures on a very conventional ISA.

            It reminds me of i860 vs. i386 days when a novel “Cray-on-a-chip” ISA could deliver an order of magnitude of performance per clock, but never survived more than half an architecture refresh, while x86 still lives.

            So I wonder how meaningful do “tin” to “gold” performance targets remain a decade after starting, when even x86 and ARM need to prove that they can continue to scale performance at static energy cost?

            In theory a Belt ISA implementation should always remain ahead, but only if it could mobilise simlar budgets to keep scaling the implementation.

            I am growing a little worried, that perhaps the Belt will wind up better than a comparable RISC-V at the same transistor budget, but that it won’t matter because it’s travelled downward to the embedded “sleep mostly” range where the cost of an extra ISA is much higher than the price for the extra die area.

            You’d need to hit laptop or smartphone targets with significantly better performance and/or energy efficiency ratios to get enough sales traction to create an eco-system, so where would “tin” to “gold” fit today, when you imagined them a decade ago?

            How would you grow to 256 Platinum cores for a server variant and can an ISA survive without planning for that league?

          • mmeyerlein
            Participant
            Post count: 13

            hello again – i’m starting to feel like howard carpendale 🙂

            as i am still highly interested in how things are going with you, my quarterly reminder has directed my focus back to you.

            due to the rather restrained communication, the feeling is slowly spreading that, according to experience, the probability of financing is strongly decreasing… which i would find extremely unfortunate…

            is there a plan b?
            something like an open source project?
            or at least to make the patents freely available?
            because selling them to be locked up, would be, as already said several times, an extreme pity.

            can you feed your fans a little information, please?

          • Ivan Godard
            Keymaster
            Post count: 627

            Had to look up Carpendale 🙂

            We continue to feel little time pressure; our advantages (and drawbacks, such as they are) seem to persist. The industry and market have essentially given up on ISA advances; there have been process advances, but they apply to us as much as anyone. The work we can do in house results (eventually) in money to us instead of money to funders.

            We know we can’t dawdle forever – patents time out, as do people. But for now, steady as she goes.

          • abufrejoval
            Participant
            Post count: 3

            Has the industry given up on ISA improvements?

            My impression is that a 10x efficiency improvement in general purpose code isn’t enough to make it change horses any more, because general purpose is becoming less important.

            With wafer scale machine learning and quantum computing we are so down the road to special purpose architectures, that GP is really treated like orchestration code.

            And RISC-V nicely fills that space were GP code and special purpose extensions make things happen in the embedded world, even if the European Processor Initiative is playing with HPC extensions, too. I can’t see the Mill compete there, because reduced entropy in its instruction space is at the heart of its design.

            It’s extremely frustrating to know that with Mills on a current process mobiles, laptops and chromebooks could run just as fast but with much less CPU power, but with displays, RAM and storage already taking the wattage lion share (NPUs, DPUs, IPUs and GPUs the SoC real-estate) and few people having to survive days without charge it wouldn’t really matter that much any more.

            At the high-end in cloud servers transistor budgets for cores and Watts to operate them seem much more compelling to pay for architecture switches, but I don’t know if you could scale the Mill meaningfully to dozens of cores in a die.

            I fear that the Mill has missed its window of opportunity and I find that extremely sad, because it’s truly great and inspirational design.

          • jabowery
            Participant
            Post count: 9

            abufrejoval writes:

            With wafer scale machine learning and quantum computing we are so down the road to special purpose architectures, that GP is really treated like orchestration code.

            The big market gap in machine learning is extreme sparsity and extreme sparsity requires indirect memory access due to the data structures required for sparse arrays. So the ML challenge is keeping shared, parallel, random memory access on-die. This is a gap that has existed ever since GPUs were repurposed for ML since graphics (up until ray trace) had little use for sparse matrix multiplies. This has, in turn, biased the entire field toward dense models — drunks looking for their keys under the lamp post. The hundreds of billions of parameters in the large models are dense but it’s been demonstrated that weight distillation can bring that down by at least a factor of 10 without loss of perplexity score. One case achieved 97% reduction. This is consistent with what we know about neocortical neuron connectivity. Moreover, Algorithmic Information Theory tells us that the gold standard for data-driven induction of models is parameter minimization approximating Kolmogorov Complexity of the data.

            Quantum computing is a pig in a poke.

            At the high-end in cloud servers transistor budgets for cores and Watts to operate them seem much more compelling to pay for architecture switches, but I don’t know if you could scale the Mill meaningfully to dozens of cores in a die.

            It’s important to think in terms of 50e9 transistors per die as a way of keeping shared RAM access on-die. That’s where we are for both indirect memory access (sparsity in ML) and GP.

          • abufrejoval
            Participant
            Post count: 3

            What you say rings true to my naive ears, but what does it mean for the Mill?

            To my understanding RISC-V is a totally unremarkable architecture which on its own would come decades too late and offer none of Mill’s merits, but the easily extended instruction set punches exactly where it counts: special purpose acceleration seamlessly baked into general purpose outer loops. And here we are talking 3 or more orders of magnitude better than general purpose instructions, where the Mill might deliver one order of magnitude for a similar transistor budget.

            But exactly because one way it achieves this advantage is by using a reduced encoding space for code, it loses ISA extensability AFAIK.

            Of course, accelerators might just be memory mapped and orchestrators just need to ready the bits in the RAM that neural code might then decode as sparse. The Mill might still deliver 10:1 benefits on orchestration, but is that enough to motivate a switch of ISA?

            That’s where I am looking for reassurance, because I just love the Mill. But loving it, doesn’t mean being convinced about the value it can deliver.

          • jabowery
            Participant
            Post count: 9

            abufrejoval writes:

            That’s where I am looking for reassurance, because I just love the Mill. But loving it, doesn’t mean being convinced about the value it can deliver.

            Agreed and I’m not going to tell you I offer that reassurance in my very top-down technical market gap which, in the final analysis, is mainly about keeping shared RAM access on die given the fact that increased density has been outstripping increased clock rates for a decade or so.

            But the machine learning world is not only an emerging market for silicon — it is breaking out of its drunken path-dependent stupor about dense models born of cheap GPUs to realize the value of sparse models — not only for better models (see Algorithmic Information Theory and Solomonoff Induction’s formalization of Occam’s Razor), but for a factor of 100 energy savings per inference. The large models everyone is so excited about are not just ridiculously wasteful of silicon, their energy costs dominate.

            NVIDIA’s newest ML foray (Grace) at 80e9 transistors claims it supports “sparsity”. This is (to be _very_ kind) marketing puffery. Their “sparsity” is only about a factor of 2. In other words, each neuron is assumed to be connected to half of all the other neurons. All their use of that term tells us is that the market demands sparsity and that NVIDIA can’t deliver it but knows they need to. Actual graph clustering coefficients in neocortical neurons, and actual weight distillation metrics indicate you’re probably going to hit the broad side of the market’s barn by simply turning those 80e0 transistors into a cross-bar RAM where a large number of banks of phased access RAM are on one axis and a large number of simple GPs are on the other axis.

            Can the Mill serve as the “simple GPs”? How many transistors does one Mill GP take if its architecture is biased toward sparse array matrix multiplies and/or sparse boolean array (with bit sum) operations?

            As for as the “switch of ISA” is concerned, what do you think CUDA is? What I mean by that is there is a lot of work out there to adapt software to special purpose hardware motivated by machine learning. I don’t see why a pretty substantial amout of that couldn’t be peeled off to make the compilers more intelligent for matching the Mill ISA to the hardware market.

          • Ivan Godard
            Keymaster
            Post count: 627

            Can the Mill serve as the “simple GPs”? How many transistors does one Mill GP take if its architecture is biased toward sparse array matrix multiplies and/or sparse boolean array (with bit sum) operations?

            We clearly have a home in the control-processor role, but the actual ML bit-banging sure seems like it needs a dedicated architecture, not a general purpose one. Mill can of course put in the same operations and accelerators as any other CPU. It can do it with a faster manufacturing turn too, because of the specification-based design. I’m pig-ignorant about ML, but my impression is that the problem is not the computation, it’s all about getting data from here to there. Bio is self-modifying and basically analog, which Mill is not, nor is anything else built with the tools and fabs used for CPUs today. We do have some NYF stuff in the pipeline that addresses on-chip distributed memory, but frankly that’s for conventional programs, not ML.

            You clearly are deep into the subject – care to tell what you’d like to have?

          • jabowery
            Participant
            Post count: 9

            Ivan Godard writes:

            care to tell what you’d like to have?

            See section 4.1 FPGA Implementation of “Two Sparsities Are Better Than One: Unlocking the Performance Benefits of Sparse-Sparse Networks” by Kevin Hunter et al of Numenta, Redwood City.

            From the abstract:

            Using Complementary Sparsity, we show up to 100X improvement in throughput and energy efficiency performing inference on FPGAs.

            There are a couple of things to keep in mind here:

            1) Numenta has been approaching computational neuroscience from the top down — starting with neuroscience and attempting to figure out how the neocortex’s fundamental block (“column”) operates in computational terms. So they’ve done almost the opposite of the rest of the ML community which started from a “What can we compute with the hardware at hand?” perspective. While the rest of the ML community is stuck with the path dependence of graphics hardware (which is, unsurprisingly, fine for a lot of low level image processing tasks), Numenta has been progressively refining its top-down computational neuroscience approach to the point that they’re getting state of the art results with FPGAs that model what they see as going on in the neocortex.

            2) The word “inference” says nothing about how one learns — only how one takes what one has learned to make inferences. However, even if limited in this way, there are a lot of models that can be distilled down to very sparse connections without loss of performance and realize “up to 100X improvement in throughput and energy efficiency”.

            I have no conflict of interest in this. My background, while it extends to the 1980s and my association with Charles Sinclair Smith of Systems Development Foundation who financed the PDP books that revived machine learning, my most significant role has been asking Marcus Hutter (PhD advisor to the founders of DeepMind) to establish The Hutter Prize for Lossless Compression of Human Knowledge — which takes a top-down mathematical approach to machine intelligence.

            PS: Not to distract from the above, but since I cut my teeth on a CDC 6600, there is an idea about keeping RAM access on-die somewhat inspired by Cray’s shared memory architecture on that series, but it is wildly speculative — involving mixed signal design that’s probably beyond the state of the art IC CAD systems if it is at all physically realistic — so take it with a grain of salt.

            • This reply was modified 1 month, 3 weeks ago by  jabowery.
          • Ivan Godard
            Keymaster
            Post count: 627

            Interesting bedtime reading; thank you.

            I had one concern: the compaction phase is static preprocessing, which is fine for a fixed corpus but doesn’t really work when the a priori weights are unknown. Compaction looks to be the bin packing problem, and you shouldn’t (I surmise) stick a NP step in the processing. I wonder whether true isolation of the kernels is really necessary though – if the kernels are sparse enough, shouldn’t it be possible to just slam them together randomly and let any collisions be “learned around”in the style of a Bloom Filter?

          • Ivan Godard
            Keymaster
            Post count: 627

            Mill cores build on the same processes as any other core. If the limit of core count is the mm^2 occupied by cache, pin pads, and other sparable regular structures then the cores-per-die count will be the same as other ISAs using the same amount of cache etc. If the limit is yield per wafer then we expect 2X more Mill cores, because the un-sparable space occupied by a Mill core is half the size or smaller than a conventional OOO architecture of similar performance, and it’s the part of the chip that you can’t use sparing on that dominates yield.

  • BobC
    Moderator
    Post count: 10

    My brother-in-law works for a small physics startup that in January started looking for another Angel to get through COVID. They wound up getting 3 VCs in March. (The 3 VCs were to ensure share dilution, not to get 3x money.) Not sure how that came to happen (who talked to who), but the money is out there.

    • This reply was modified 1 year, 1 month ago by  BobC.

You must be logged in to reply to this topic.