Forum Replies Created

Viewing 15 posts - 1 through 15 (of 42 total)
  • Author
    Posts
  • goldbug
    Participant
    Post count: 53

    Could that be generalized to any arbitrary branch?

    For example, consider this code:

    
    int foo(int a) {
       // just whatever random code
       int b = a * a;
       int c = b * b;
       int d = a + b + c;
    
       // a branch that could be precalculated at the beginning of the function
       if (a) {
          return bar(d);
       } else {
          return baz(d);
       }
    }
    

    At the beginning of the function, we have enough information to know whether we will take that branch or not. A compiler could potentially introduce an instruction at the beginning to precalculate if the branch will be taken and notify the predictor, or there could be “delayed branch” instruction something like “branch in 4 cycles if a is non zero”.

    This kind of code shows all the time in loops and precondition checking in functions.

    • This reply was modified 12 months ago by  goldbug.
    • This reply was modified 12 months ago by  goldbug.
    • This reply was modified 12 months ago by  goldbug.
  • goldbug
    Participant
    Post count: 53

    So I have been taking a class on computer architecture (I am a software guy). The more I learn the more in awe I am with the beauty of the Mill instruction encoding and other features.

    CISC sucks. It needs millions of transistors just to decode 6 instructions.
    RISC is a clear improvement, but the superscalar OoO design is ridiculously complicated, as I learn about the Tomasulo algorithm, wide issue/decode, speculative execution, I can’t help but think “this is insane, there has to be a better way”. It feels like the wrong path.
    VLIW seems like a more reasonable approach. I know binary compatibility problems and stalls have been a challenge for VLIW architectures.

    The Mill is just beautiful, it has a sane encoding and simplicity of a VLIW. But phasing and double instruction stream really take it to the next level.
    The separate load issue and retire is in hindsight the obvious way to solve the stalls due to memory latency that is so common in VLIW.
    The branch predictor is so cool too, you can predict the several EBB’s in advance, even before you start execution. Mainstream predictors have to wait until they get to the branch instruction.
    The specializer is a neat solution to binary compatibility.

    I really hope to see this CPU make it to silicon.

    • This reply was modified 3 years, 6 months ago by  goldbug.
    • This reply was modified 3 years, 6 months ago by  goldbug.
    • This reply was modified 3 years, 6 months ago by  goldbug.
    • This reply was modified 3 years, 6 months ago by  goldbug.
  • goldbug
    Participant
    Post count: 53

    Technical details are very sparse, but from their presentations, they say they are not VLIW.

    They sometimes compare their stuff with Itanium (VLIW), but they claim they don’t stall as much as Itanium. Supposedly it has dynamic issue but it is not out of order. From what I gather, their compiler generates instruction bundles that encode dependencies between instructions. That sounds an awful lot like an EDGE architecture such as the TRIPS and Microsoft’s E2.

    • This reply was modified 3 years, 7 months ago by  goldbug.
    • This reply was modified 3 years, 7 months ago by  goldbug.
  • goldbug
    Participant
    Post count: 53

    They recently discovered SplitSpectre, which is a spectre variant with a much simpler gadget.

    With regular spectre, this was the gadget needed in the victim space:

    
    if (x < array1_size)
      y = array2[array1[x] * 4096];
    

    Which is not that common.

    With SplitSpectre, this is the gadget needed:

    
    if (x < array1_size)
      y = array1[x];
    

    Which happens practically everywhere.

    Access to array2 can be in the villan’s space if y is returned.

    From your talk, I reckon the mill is still not affected.

    • This reply was modified 5 years, 7 months ago by  goldbug.
    • This reply was modified 5 years, 7 months ago by  goldbug.
    • This reply was modified 5 years, 7 months ago by  goldbug.
  • goldbug
    Participant
    Post count: 53

    No, I am long out of school.
    I just take courses online (Udemy) of things that I find interesting.

  • goldbug
    Participant
    Post count: 53

    Fair enough. Your point is that the security benefits the Mill provides can be done in software, and WASM does it albeit with a ~30% performance hit.

    The Mill is supposed to provide 10x perf/watt improvement over OoO superscalar CPUs according to Ivan’s guestimate in his videos. We are all waiting for that simulator and compiler to be available to see some real numbers.

    If you can get 10x perf/watt and all you have to do is recompile your C code, I think that would make the Mill very attractive.

    Another interesting aspect is that microkernels are slow in modern CPUs. A simple call to a driver takes 70-300 cycles, which make microkernels nonstarters. The Mill does have innovation here with their portal calls, which allow one process to call another at the cost of a simple function call. If successful, the Mill can make microkernels competitive, which can improve security significantly.

  • goldbug
    Participant
    Post count: 53

    I think I replied to the wrong person, I agree with everything you are saying.

    I meant to reply to the OP about WASM being competicion for Mill.

    • This reply was modified 3 years, 8 months ago by  goldbug.
  • goldbug
    Participant
    Post count: 53

    WebAssembly is not a hardware ISA. It is similar to java bytecodes or .Net MSIL. The instructions in WebAssembly are meant for a virtual machine.

    There is no hardware that can run WebAssembly directly. Instead, there are programs that can take WebAssembly code and generate the equivalent x86 and ARM machine code, these are called Just In Time compilers, which are core parts of virtual machines. There can be another JIT compiler for Mill machine code.

    So basically WASM does not compete with Mill anymore than it competes with x86 or ARM CPUs.

    If anything WASM can help the Mill. Code that is distributed in WASM format can potentially run in any platform including the Mill. This can reduce the barrier to entry for Mill adoption.

    Of course, someone will have to sit down and write a WebAssembly to Mill JIT compiler.

  • goldbug
    Participant
    Post count: 53
    in reply to: Benchmarks #3505

    What about code size Ivan?

    Your instruction format is so alien, it would be interesting to see if it takes more or less space for comparable code.

    I realize inlining and loop unrolling could have a big impact on code size.

    • This reply was modified 4 years, 11 months ago by  goldbug.
  • goldbug
    Participant
    Post count: 53
    in reply to: Benchmarks #3500

    These are pretty awesome and encouraging number Ivan.

    ” I suspect inlining and pipelining would make little difference to the counts when enabled because they improve cycle time and overall program latency”

    Wouldn’t inlining help a lot in the instruction count? you eliminate the call operation, and if the inlined function is small, you might even be able to squeeze the operations into existing instructions, making the inlined function essentially free.

    • This reply was modified 4 years, 12 months ago by  goldbug.
  • goldbug
    Participant
    Post count: 53
    in reply to: news? #3491

    I could not find anything on google. What is A20? is it a profiler? emulator?

  • goldbug
    Participant
    Post count: 53
    in reply to: news? #3489

    Maybe open sourcing the kernel? there are a lot of file systems out there, I am sure someone could port one. maybe from genode? Even if it is not perfect, it might serve as an MVP

  • goldbug
    Participant
    Post count: 53

    I asked the same thing a while ago.

    They published a paper with the answer
    https://millcomputing.com/blog/wp-content/uploads/2018/01/Spectre.03.pdf

    see the section called “Software and compiler speculation”

    Their answer seems to be a loadtr operation, that will only perform the load if the predicate is met, avoiding branching.

    When I asked, they said they were still trying to decide if there was a better way.

    • This reply was modified 5 years, 7 months ago by  goldbug.
  • goldbug
    Participant
    Post count: 53
    in reply to: MILL and OSS #3258

    If the Mill is half as good as it looks on the presentations, and if/when they manage to get it out the door, I can’t imagine it failing.

    It would certainly be an interesting case study in the classroom. It shows that there is still plenty of room for innovation in CPU architecture

  • goldbug
    Participant
    Post count: 53

    Thank you for taking the time to write that paper. It is very enlightening.

    I saw that you did have a Spectre-like bug and you fixed it by using loadtr prefixed by the guard.

    Is that a new operation? I don’t recall seeing a loadtr in the wiki before?

    By the way, awesome job, such a simple and elegant solution.

Viewing 15 posts - 1 through 15 (of 42 total)