Forum Replies Created
- AuthorPosts
- in reply to: The Mill's Competition: Can it still win? #3627
According to Wikipedia WebAssembly may get about 65% of native assembly speed (mileage may vary).
WebAssembly runs on a VM and is more a 32-bit virtual machine running on modern hardware as much as Steve Wozniak’s Sweet-16 ran on the Apple 2 series (could run on other machines easily).
The biggest “value” for running WebAssembly on another processor rather than native is a portable executable that doesn’t need to be recompiled natively (or, not as much) but you lose so much more in the process of that, including execution speed, energy efficiency, and you need to have a lot of other things added to make it useful.
It’d be interesting to see how efficiently a Mill processor can execute the VM in practice because it’s clear it’s something a number of online things will use in web browsers. Outside of web browsers it makes far more sense to compile a higher-level language down to local Mill ISA than to use WebAssembly.
The ramifications of the talk have sunk in, and they’re funny in a brilliant way: whereas x86 has rings 0-3 (usually only using 2 of those unless virtualization is used in some form) for levels of memory protection and supervisor/user privileges, the Mill architecture has, by virtue of removing the concept of supervisor/user mode created a fractal tree of up to 2^22 protection levels that are hardware-accelerated and stupidly easy and cheap to manage. All that, and the virtualization facilities haven’t been revealed as of yet! Sure, in theory, you could lock out access in x86 or comparable architectures to not have any given task have access anywhere else, but it would have massive overhead both in software and hardware to do so.
As mentioned by another poster regarding embedded software, these ramifications are rather interesting: I’ve not seen any kind of mention in my knowledge/understanding of machine architectures where protection levels are so fine and easy to work with. I am curious about details of MMU functionality for each of the regions, if it has present/not present bits, to make it comparable in that aspect of things: I suspect it does. In a finite physical memory system, where code is, I’d expect it’d need to use jump tables or all relative code so it could be swapped out, due to physical addresses being the same as virtual addresses. For data, it means that either data needs to be in separate physical regions for all allocated data, or there needs to be a method provided for fixing up pointers for when regions are swapped in and out.
But one of the funniest and best ramifications of the region/turf setup is the ability to perfectly isolate all data accesses and code accesses so precisely that it’d make tracking down stray pointers in the most complex of code bases a dream: since you could make each and every subroutine a service that has explicitly isolated memory accesses both for code and data, no buggy code, even in the “kernel” (Mill greatly confuses what that means in practice as one of the ramifications!) can’t stomp on anything but its own dynamic WKR, thus making it easy to isolate such faults to either a very small part of the code base, or… hardware defects (pretending that won’t happen is insane, as all know). Thus, if a service is known-good code, and something messes up, it’s inherently traceable to a greater degree than probably any previously existing architecture that it was, indeed, a hardware error, even without ECC or equivalent, because if the only code that can access a small subset of RAM is known-good, then it can be demonstrated that the hardware did something wonky (perhaps DMA/bus mastering, or just plain failure).
This would make the Mill architecture an absolutely stunning processor for proving (as much as software can be proven) software code correct, especially kernels and their drivers for any operating systems, and then recompiling it for other architectures, if you felt a strange need to work with legacy hardware 😉
And that’s the rub: it (the Mill architecture) needs to be adopted over other things for the long-term success it needs, but there’s a huge amount of inertia in regards to not only rewriting code (it’s not all portable, and often makes many assumptions about system/CPU architecture that may not be true on Mill) by also the chipsets. I would be so very unhappy if the Mill architecture is stopped not by something clearly superior for architecture, but merely because it didn’t have a large enough quantum leap to supplant the existing base of higher-end processors along with chipsets. There are too many cases where the “good enough” is the enemy of the much better system, because the “much better system” had to overcome a rather sizable inertia to change by users, commercial and private.
Past attempts at emulating previous instruction sets (Crusoe with their recompiling on the fly, or pure emulation) have been less than ideal: the most practical thing is that code needs to be completely rebuilt for a native instruction set, and while that can be and has been done, that’s a Super-man’s leap of effort for many to accomplish. Recompiling portable source code is so much easier in many respects to get done right.
Perhaps the security aspects of the Mill may be, in combination with so many of the other things, that straw that healed the camel’s back and brings it into widespread adoption in non-tiny spaces: that, and the fact that x86/ARM architecture with registers and complex instruction decoding seems to be hitting a wall for speed/power, regardless of how many gates you throw at it. At least, that’s what I’m hoping for: so many code exploits are such a problem for people that costs everyone money and insecurity in regards to if your system and data is secure, and software is getting too complex/fast-developed to catch it all that the machine needs to be pro-active in architecture to make it impossible for it to be code-related, even with sub-par code.
Having now watched the slideshow after watching the original live broadcast, I’m wondering if my theory is correct: with a 32-long belt, it appears for predicated branches (first true wins) that it’s possible to have a 16-way speculative branch, as there’s 1 load and 1 branch specifier for each case.
Is this correct?
- in reply to: Introduction to the Mill CPU Programming Model #609
This is one heck of a post. However, there appears to be a minor issue in one paragraph that needs to be fixed, though I’m not 100% of the details of how it should be fixed:
Under “Instructions and Pipelines” this appears wrong:“Different pipelines have different mixes of FUs available, and the number of pipelines and their FU mix is Mill -model specific. There may be 4 or 8 integer pipelines and 2 or 4 floating point pipelines, for example. There are pipelines that handle flow control and pipelines that handle load and saves. In a mid-range Mill “Gold” CPU there are 33 pipelines of which 8 are integer and 2 are floating-point; the Mill Gold CPU can issue 33 operations per cycle, sustained:”
Unless I’m mistaken, the Gold isn’t mid-range: this is the most detailed information that’s been specified about the pipelines, so I can’t state how correct that is.
If there are other mistakes in all of this, I’m not able to spot it with the information I feel I have/understand. All considering, that’s a minor issue 🙂
- in reply to: The Mill's Competition: Can it still win? #3634
Yup!
No problem.
Well, I’d like to think between the two of us this has been sufficiently explained how WebAssembly is irrelevant to the Mill processor family and its development and ecosystem.
- in reply to: The Mill's Competition: Can it still win? #3631
So I didn’t mention the JIT, big deal: that doesn’t materially change the fact that adding WebAssembly into the mix and targeting that instead of what’s already been documented for the specializer is a major waste, and won’t give any advantage to run anything not (usually) targeted to run within the confines of a web browser with 32-bit limitations on a 64-bit processor line. Absolutely, it’s worth having a web browser for the mill processors that handle WebAssembly, and for those that desire to run it outside of the browser as well, why not? It doesn’t have any real benefits outside that context.
I have a huge multiplier of trust with Ivan and his team about their design and implementation decisions over someone second-guessing them as this nonsense is. It makes far more sense to have an intermediate representation that has knowledge of the nature and low-level things a processor line can do, that was generated from a higher-level language that is processor-agnostic, than it does to badly translate some low-level representation of a specific type of virtual hardware and its artificial limitations that was already translated/compiled from the same original higher-level processor-agnostic language and now runs at a fraction of the speed of native code that would be generated by the specializer. This is the code version of playing the telephone game through more than one dissimilar human language, say, Chinese-English-Thai, which loses meaning and efficiency along the way as a result of no 1-to-1 mapping of thoughts.
At least Sweet-16 provided a practical advantage for the Apple 2: a way to execute more compact code on a virtual processor with more registers than the 6502, albeit slower than native. WebAssembly for the Mills only has value as another way to get very limited applications running at a speed and efficiency disadvantage compared to compiling the same code sent through the (say, C++) compiler to Mill without going through a more conceptually-limited VM representation in between. WebAssembly has a very limited use-case scenario in the real world right now, and only makes sense to think about implementing as an entirely separate thing after there is a full OS with web browser for actual Mill CPU hardware already running.
Hi rpjohnst,
For low-level hardware things I’ll leave that to Ivan or someone else formally of Mill Computing.
However, for all hardware Linux runs on, memory management, pids/threads and the like are also an abstraction. Where there’s more required than hardware provides, you’d never see a difference in code, as that’s also abstracted: that sort of thing is handled the same way virtual memory is. That a pid is 32 bits is purely a housekeeping detail that is convenient for the software, and has no connection with the hardware it runs on or its limitations. All those tables can be swapped out as needed by the OS to handle as many as desired.
And, seriously: I’d be shocked if there’s a single CPU with any number of cores that Linux runs on that you’ll find 2^22 threads/processes in-use at any given time 😉
Ok, I’m not surprised: I was very hopeful, that’s true 😉
A great example where such multiway branches can be very useful can be found in the recursive-descent lexer/parser code generated by ANTLR, as it generates (depending on generation options) lots of switch/case situations, recursively.
Now, I’m wondering: is there a way in a single instruction to encode a rather complex if/then/else where you have multiple values meet some conditions (say if(a==b&&c==D|d=a)? I’d expect it’d take 3 CPU cycles, due to phasing, and the practicality of data to load, etc. but can it be encoded and set off in a single instruction?
Using the ganged operations in instructions, I’m suspecting that could be synthesized, but I’m not sure: that could really make code fly.
- in reply to: Introduction to the Mill CPU Programming Model #613
You’re human (darn it, right?).
I’m wondering what the practical differences would be with the Platinum Mill: the one with 64 belt positions.
However, I realize that the cycle limit time for the double crossbar for the belt would definitely lengthen the instruction cycle, OR would require more clock cycles to account for that difference, which would likely make the circuitry not merely twice as much (it wouldn’t be anyway due to the double crossbar) but scale a little closer to squared. Meanwhile, in most cases, I’m not sure how much more parallelism you could extract, and what the cost would be in power for typical workloads.
That’s one major thing I see with Intel with the x86 (32 and 64 bit) architectures: while the process size goes down, it’s easy for them to add more cores on the same area, but most practical workloads (non-server) typically rarely use more than 2 threads: I believe this is a huge reason Apple has gone the direction they have, with 2 higher performance cores in their ARM devices, with one core more likely used for all GUI stuff, and the other core for the background I/O and computations. Besides the parallelism of 2 threads being an easy reach for typical applications (general purpose GUI apps with I/O and background computations) there’s also the memory bandwidth to consider: if a modern superscalar OOO core is waiting 1/3 of its instruction cycles for main memory, adding more cores just means that more cores are idling, waiting for memory to have anything to compute.
Unless I’m mistaken about the nature of the beast, memory I/O also has additional power requirements for storing/retrieving data above regular refresh power, and actively used, RAM and the associated circuitry is one of the more major power consumers: thus, reducing memory I/O for a given amount of useful computation is also a power usage reduction.
- This reply was modified 10 years, 11 months ago by JonathanThompson.
Hi Will,
Unless I’ve misunderstood something meaningful because the videos weren’t too clear, how is it really any different from how a typical load/store CPU works? That is: when you do a branch in a load/store CPU, you know (as the compiler writer) what’s stored where for registers as of the time of the branch: it’s no different with the Mill Belt, and unless I’m mistaken, the Belt doesn’t move merely because of the branch.
That’s not to say that there aren’t instructions in-flight that need to be accounted for: but that’s not really any different from no branch, because the compiler knows exactly how much is in-flight as well, as it’s 100% deterministic, even if there are pipeline stalls from cache misses.
- AuthorPosts