Mill Computing, Inc. Forums The Mill Tools Compilers The Compiler

  • Author
    Posts
  • staff
    Keymaster
    Post count: 45
    #1835 |

    Talk by Ivan Godard – June 10, 2015, at the
    SFBay Association of C/C++ Users


    Slides: mill-cpu-compiler.04 (.pptx)

    This is the tenth topic publicly presented related to the Mill general-purpose CPU architecture. It covers only the tool chain used to generate executable binaries targeted for any member of the Mill CPU architecture family. The talk assumes a familiarity with aspects of CPU architecture in general and C++ programming in particular.

    LLVM meets the truly alien:

    The Mill CPU architecture in a multi-target tool chain

    The Mill is a new general-purpose CPU architecture family that forms a uniquely challenging target for compilation – and also a uniquely easy target. This talk describes the Mill tool chain from language front end to binary executable.

    The Mill tool chain is unusual in that it translates LLVM intermediate representation (IR) not into object code but into a different IR (genAsm), tailored for the Mill architecture family. Then a separate tool, the specializer, converts genAsm input into executable binary code (conAsm) for a particular Mill architecture family member. genAsm is a dataflow language, essentially a programmable representation of a single-assignment compiler IR.

    The Mill has no general registers. Instead, intermediate results are placed on the Belt, a fixed-length queue, and these operands are accessed by temporal addressing. A Mill operation in effect says “add the third most recent value to drop on the belt to the fifth most recent, and drop the result at the front of the belt, and discard the oldest value from the other end of the belt”. The Mill is also a (very) wide issue machine, and many of these actions are taking place concurrently in each cycle. The tool chain, or rather the specializer component, must track the location of operands as they move along the belt, because their belt address changes as other operations are executed and drop results. In addition, the Mill is statically scheduled with an exposed pipeline, so an operation may produce its results several cycles after the operation was issued, possibly with intervening control flow.

    This belt structure leads to unique needs for operation scheduling and operand spilling. These needs are the rough equivalent of instruction selection, register coloring, and spill on a conventional machine. The talk concludes by explaining the algorithms used.

  • vapats
    Participant
    Post count: 11

    Nice to see another talk, thanks! I particularly liked when you touched on the spiller/filler operation — which suggests a topic for the next talk: some details re Mill context-switching; interrupt handling, thread management, OS kernel calls, etc.

    Speaking of which, how do you envision handling real-world I/O? Task (and tax) a Gold Mill as a monoprocessor, or have (say) a few satellite Tins and DMA engines keeping the Gold fed and happy? Given the Mill’s modest transistor budget, it seems that a message-passing multicore approach is not unreasonable…

    • Ivan Godard
      Keymaster
      Post count: 460

      There are two different I/O questions: pin-twiddling I/O, and DMA. We expect all the twiddling to be memory-mapped, so (as with the rest of the Mill) there’s no need for privileged ops and all access control is via the existing turf-based protection mechanism.

      As for DMA, the question is coherence and real-time constraints. We expect that a DMA device will appear to be just another core to the program model. That said, how inter-process communication (other than shared memory) works is still NYF.

      You see, nothing up this sleeve… 🙂

  • Symmetry
    Participant
    Post count: 26

    Watching the Compiler talk it occurred to me that it might be a good idea to include “microcoded” instructions in genAsm. Linus has a tendency to wax poetic about x86 and one of his favorite things about it is the availability of instructions like rep mov. It seems like having Mill getAsm instructions that mirror the semantics of C primitives like strcpy, memcpy, memset, etc might be useful. Of course no Mill member would implement them but they could be specialized differently on different architectures in ways that might have important performance implications.

    • Ivan Godard
      Keymaster
      Post count: 460

      Microcode give me hives 🙂

      Microcode is a bad fit with any wide-issue machine. A micro-copy routine won’t saturate the width (it’s just a loop, and internal dependencies will keep you from using all the FUs). When the same loop is macrocode then other app code can be scheduled to overlap with at least some of the ops of the copy, giving better utilization.

      Most people (Linus included I expect) who prefer micro over a function call or in-line macrocode do so because they expect that a call is expensive and the in-line code will consume architectural resources such as registers, while the micro is cheap to get into and runs in the micro state so it doesn’t collide with app state.

      Guess what: Mill call are cheap and give you a whole private state that doesn’t collide either.

      We’re not opposed to putting complex behavior in the hardware, for which the implementation will look a whole lot like a small programmable device: the Spiller is a glaring example, and the memory controller and the exit prediction machinery are too, and there are more NYF. But these are free-standing engines that are truly asynchronous to the main core execution and do not contend for resources with the application. But classical microcode? No, our microcode is macrocode.

      • Symmetry
        Participant
        Post count: 26

        I think I wasn’t clear. I don’t mean literal microcode. I meant “microcode” that would be evaluated into a stream of regular instructions by the Specializer. This would let the compiler emit something like memcpy which would translate into some architecture optimized routine when run on that specific architecture.

        • vapats
          Participant
          Post count: 11

          I get it, Sym; you mean having pseudo-ops, that are treated by the specializer the same way as (say) a flop would be replaced by emulation code on a Mill that has no FPU, right?

          • Symmetry
            Participant
            Post count: 26

            Exactly. But instead of nice atomic ops that happen to be missing, such as floating point, these would be operations that no family member would ever implement. Unless you change your mind about actual microcode but I wouldn’t advice that. 🙂

            EDIT: Of course, since this is purely software there’s nothing preventing your from creating a “MILL ISA 1.1” at some later point with these extra pseudo-ops.

            • This reply was modified 2 years, 5 months ago by  Symmetry.
            • This reply was modified 2 years, 5 months ago by  Symmetry.
          • vapats
            Participant
            Post count: 11

            LMAO, yeah right, saddle the poor beast with feature-creep, before it even finishes getting out of the gate, LOL… 🙂

            oops, that may have sounded snotty; I was intending to share the laugh with everyone!

            • This reply was modified 2 years, 5 months ago by  vapats.
          • Symmetry
            Participant
            Post count: 26

            One of the nice things about the Mill is that they’ve got a really nice story for being able to add and remove instructions from hardware in response to changing demand without breaking software. So it’s mostly the programmers who have to worry about feature creep but don’t worry, we’re used to it.

        • Ivan Godard
          Keymaster
          Post count: 460

          Yes, function-scale builtins could be emitted by the compiler and then get member-specific substitution in the specializer. The difficulty is in the compiler and host language: how would a program ask for one of these builtins? In all compilers (including those from third parties) and all languages?

          Generally we consider the host language to be off limits. If something is needed by a language’s user community then let the standards process add it to the language and we will support it. Some such additions already exist: std::memcopy and others, and so for those we can do what you suggest. There are other possible candidates in libc and POSIX. However, it’s not clear that there would be much gain over simply having the compiler emit a function call and letting the specializer inline that.

          Mind, it might be a good idea; the problem is that we can’t know until measurement. There is a lot of measurement and tuning in each member. It won’t get integer factors the way the basic architecture does, but a few percent here and there and it adds up too.

          • eddyb
            Participant
            Post count: 1

            Keep in mind that when dealing with LLVM IR, you already have to consider lowering its intrinsics [1] (though I believe you get some defaults like “llvm.memcpy” calling “memcpy”, if your target doesn’t do smarter lowering).
            Some intrinsics (“llvm.memcpy”, “llvm.memmove”, “llvm.memset” and possibly more) are automatically detected from naive loops and are also generated from compilers, e.g. for shallow copies.

            [1] http://llvm.org/docs/LangRef.html#standard-c-library-intrinsics

            • This reply was modified 2 years, 5 months ago by  eddyb.
          • Ivan Godard
            Keymaster
            Post count: 460

            Exactly. The lowering will happen; the question is whether to do it in the compiler or the specializer, and that gets a case-by-case decision.

  • vapats
    Participant
    Post count: 11

    A problem, indeed. I work with embedded MCUs, and can say that the C compilers are rather hideous abattoirs of non-standard __intrinsics__, #pragmas, and asm() statements — not pretty.

    When I was a younger man, I made the mistake of trying to anticipate all use-cases and implement The Kitchen Sink right off the bat… Now I’m a bit wiser, and only implement functionality that is actually needed by real-world programs.

    At some point, Mill silicon (whether FPGA or not) will be self-hosting its own compiler, so I’d expect that strcmp() and its ilk will be obvious examples of use-cases that demand some attention…

  • bmartin
    Participant
    Post count: 1

    If you have to specialize code for each chip, does that mean that different models of Mill will not be binary compatible with each other? How will software developers release binaries that target Mill?

    • Ivan Godard
      Keymaster
      Post count: 460

      The Mill members are load-module (.exe) compatible, but not bitwise encoding compatible. The load module contains member-independent genAsm and a cache of previous specializations for particular members. If you run that program on a member for which there is no binary in the cache then the loader specializes it to the target, as well as doing the dynamic linking, relocation, and the rest of what loaders do, and also writes the new specialized code back to the load module cache so it doesn’t have to be specialized again.

      We expect that specialization will be a routine part of the install process for most software, but you can omit the install step and just get the specialization the first time you run it. Unless you are debugging the specializer you won’t need to look at the actual binary; this is essentially the same way that Java and other systems that distribute machine-independent code work.

  • krdln
    Participant
    Post count: 1

    Where does the specialization of vectorized loops occur (eg., strcpy presented in one of the talks)? It seems that genAsm should have concrete instructions already (compiler has to know the iteration offset, right?), but how’s that possible, when each member can have different vector width?

    • Ivan Godard
      Keymaster
      Post count: 460

      It is done in the specializer, based on extra info (not yet) produced by the compiler. The specializer wants to know the initiation interval and which variables will be vectorized and which are loop-scalar. It has proven complicated to get this info out of the compiler, and it doesn’t work yet.

  • huhtanen
    Participant
    Post count: 2

    In the (a-(b+c)) & ((b+c)*d) example ‘sub’ was issued on cycle 2 while ‘mul’ was issued on cycle 4. Would it be possible to issue both ‘sub’ and ‘mul’ on cycle 4, since both inputs to ‘sub’ should already be available?

    If so, is there an advantage to issuing instructions late as opposed to issuing them as soon as their arguments are available? In this particular case one could, at least naïvely, think that larger part of the logic could have been clock gated on cycles 3 and 2, possibly leading to power savings if ‘sub’ and ‘mul’ would have been issued together.

    Furthermore, wouldn’t the ‘load’ example work automatically without need for special cases if all instructions (including loads) would be issued early (i.e., as soon as their inputs are available)?

    • Ivan Godard
      Keymaster
      Post count: 460

      Generally close to the consumer is best because it reduces belt pressure; you comput a value early and it may have fallen off the belt before it gets used, forcing spill/fill ops to retain it. The exception is loads, which should be issued as early as possible while still retiring as late as possible; maximizing load deferral gives the memory hierarchy the most time to come up with a value.

      Because the inputs are themselves computed as late as possible, usually you don’t have the inputs waiting around on the belt either. There are exceptions: when the inputs are call arguments, or when an input is used more than once, with an early consumer and a late consumer. However, all of these cases are examples of a general pattern: inputs must be produced early for outside reasons, and outputs must be consumed late also for outside reasons, so there is a range between production of input and consumption of output in which the op could be placed. It turns out that where you place the op makes no difference in this pattern; if you place it early then its outputs may have to spilled, but if you place it late then its inputs may have to be spilled; the impact on belt pressure is roughly the same, assuming that the number of inputs and outputs are roughly the same. Because we are placing as late as possible in general, we place this pattern late too, just for simplicity even though it could have been placed earlier without affecting overall latency.

      Incidentally, operation placement is equivalent to the NP bin-packing problem, so it’s a game of heuristics because optimal placement is impractical.

      Power is a different issue that our hardware team will address.

    • Art
      Participant
      Post count: 8

      If so, is there an advantage to issuing instructions late as opposed to issuing them as soon as their arguments are available? In this particular case one could, at least naïvely, think that larger part of the logic could have been clock gated on cycles 3 and 2, possibly leading to power savings if ‘sub’ and ‘mul’ would have been issued together.

      The TL;DR answer: there is no power difference as long as the number of belt value spills/fills remains the same.

      As far as power savings due to clock gating is concerned, there are 2 major factors:

      1. The power consumed in performing the operation itself (the add, sub, mul, etc.)
      2. The power consumed in maintaining operands on the belt

      The power consumed by the operation itself is independent of when the operation is performed. It is what it is, the same every time for a particular set of input values. When an operation is not being performed by a particular functional unit its clock is gated off. When the clock is gated off, the functional unit power consumption is only that due to static leakage current.

      The power consumed in maintaining an operand on the belt is nearly constant and depends greatly upon the number of new result values arriving each clock times the number of potential destinations for each result.

      The conclusion is that the biggest factor in reducing power is the number of belt spills/fills that must performed. The lower the number of spills/fills, the lower the power.

  • huhtanen
    Participant
    Post count: 2

    EDIT: woops, double posted. Any way to remove posts?

    • This reply was modified 1 year, 11 months ago by  huhtanen. Reason: Double post of the same message

You must be logged in to reply to this topic.