Forum Replies Created

Viewing 15 posts - 1 through 15 (of 22 total)
  • Author
    Posts
  • Symmetry
    Participant
    Post count: 28
    in reply to: The Compiler #1890

    Watching the Compiler talk it occurred to me that it might be a good idea to include “microcoded” instructions in genAsm. Linus has a tendency to wax poetic about x86 and one of his favorite things about it is the availability of instructions like rep mov. It seems like having Mill getAsm instructions that mirror the semantics of C primitives like strcpy, memcpy, memset, etc might be useful. Of course no Mill member would implement them but they could be specialized differently on different architectures in ways that might have important performance implications.

  • Symmetry
    Participant
    Post count: 28
    in reply to: Pipelining #1411

    Have you looked at Rust? It’s something Mozilla has been working on to create a new multi-threaded rendering engine and it has the same sort of numeric typeclasses that Haskell has, though they call them traits.

  • Symmetry
    Participant
    Post count: 28

    Come to think of it I don’t remember if they mentioned anything about the different instruction formats in the Hot Chips slides I was able to see. They do talk about it in the whitepaper they put out, however.

  • Symmetry
    Participant
    Post count: 28

    NVidia does have all of Transmeta’s IP and from outside this looks like an Efficeon with some capacity to execute ARM instructions in hardware letting them only optimize the hot spots, and a few other changes. I don’t see that translating ARM instructions into Mill instructions would be any more difficult than what Transmeta initially did with VLIW, but I suppose that might be the reason NVidia seems to be doing better than Transmeta did.

    Good luck licensing your load semantics to NVidia then? :)

  • Symmetry
    Participant
    Post count: 28

    Well, it’s also converting the code from ARM64 with 32 registers to a 7-wide VLIW format with 64 registers, which could presumably let them do more in terms of optimization than if they had just done ARM64 to ARM64 dynamic optimizations.

  • Symmetry
    Participant
    Post count: 28
    in reply to: Prediction #820

    I seem to recall that you mentioned you loaded new predictions on a function call in the talk, and that seemed rather frequent to me. Now that the security talk is out, are they actually loaded on portal calls?

  • Symmetry
    Participant
    Post count: 28

    No, I’m talking about Spectre. The example given in the paper was

    if (x < array1_size)
    y = array2[array1[x] * 256];

    If array1[x] evaluates to NaR then you won’t get different parts of array2 in cache depending on what values were in memory where you ran off the end of array1. But of course there are lots of Spectre-like attacks that might be possible and who knows if the Mill is resistant to all of them.

  • Symmetry
    Participant
    Post count: 28

    I think there would be NaRs involved on the Mill, at a guess. When a conventional processor speculates a load to an invalid address it has to suppress the fault until it knows whether the speculation was correct or else it has triggered a fault that it should not have only through its own failure rather than the codes. The Mill doesn’t have to suppress returning a NaR in this case because it would be invisible if the execution is later rolled back as false speculation. And since the speculating thread never has its hands on data from outside what it should be able to access there doesn’t seem to be any danger of data exfiltration. Well, I can imagine attacks along these lines that would let someone figure out which hidden addresses are in caches or main memory but not access the contents of the data at those addresses.

    This is speculation, though, so maybe I should just wait for the authorities to answer. 🙂

  • Symmetry
    Participant
    Post count: 28
    in reply to: The Compiler #1900

    One of the nice things about the Mill is that they’ve got a really nice story for being able to add and remove instructions from hardware in response to changing demand without breaking software. So it’s mostly the programmers who have to worry about feature creep but don’t worry, we’re used to it.

  • Symmetry
    Participant
    Post count: 28
    in reply to: The Compiler #1895

    Exactly. But instead of nice atomic ops that happen to be missing, such as floating point, these would be operations that no family member would ever implement. Unless you change your mind about actual microcode but I wouldn’t advice that. 🙂

    EDIT: Of course, since this is purely software there’s nothing preventing your from creating a “MILL ISA 1.1” at some later point with these extra pseudo-ops.

    • This reply was modified 9 years, 4 months ago by  Symmetry.
    • This reply was modified 9 years, 4 months ago by  Symmetry.
  • Symmetry
    Participant
    Post count: 28
    in reply to: The Compiler #1893

    I think I wasn’t clear. I don’t mean literal microcode. I meant “microcode” that would be evaluated into a stream of regular instructions by the Specializer. This would let the compiler emit something like memcpy which would translate into some architecture optimized routine when run on that specific architecture.

  • Symmetry
    Participant
    Post count: 28

    A None plus a NaR makes a None. Think of it as combining a “discard this” and a “causes problems if this isn’t discarded”.

    If you’re using an None to calculate a load the load is skipped and you get another None on the belt. If you’re storing a None that’s a no-op.

    But come to think of it, I’m not sure what happens when you use a NaR to do a load. Does it cause a fault immediately, or does it just drop a NaR on the belt?

  • Symmetry
    Participant
    Post count: 28
    in reply to: Pipelining #1413

    Go has an interesting way of handling this, where constants and literals don’t take on a concrete type until they’re assigned to a variable.

    Link

  • Symmetry
    Participant
    Post count: 28

    The Pentium 4 did, but that didn’t end up working very well for them. The problem is that if you have a variable length encoding then the mapping of addresses between the address you’re jumping to and the location of the instruction in the cache becomes complicated. They ended up using a trace cache rather than a traditional cache to solve this.

    http://en.wikipedia.org/wiki/Trace_Cache#Trace_cache

    Modern Intel processors actually have something like this too, but only as an optional L0 cache for loops.

    In contrast, if you have a fixed length encoding then decoding is usually simple enough that the the cache latency or hit rate you lose by turning your instructions from 32 bit pre-decode instructions to the 60 or whatever bit post-decode instruction isn’t worth it at that stage.

  • Symmetry
    Participant
    Post count: 28
    in reply to: Execution #732

    >A central tenet of the Mill is that people building products can prototype custom processor specifications in the simulator very quickly and choose a configuration with the best power performance tradeoff informed by benchmarking representative code.

    Yes, I’m really looking forward to seeing what tech stack you use to do that. I gues I’l have to wait for the specialization talk to come out to find out how.

Viewing 15 posts - 1 through 15 (of 22 total)