Mill Computing, Inc. Forums The Mill Architecture Specification

Tagged: 

  • Author
    Posts
  • staff
    Keymaster
    Post count: 46
    #1068 |

    Talk by Ivan Godard – 2014-05-14 at the
    SFBay Association of C/C++ Users


    Slides: specification.04 (.pptx)

    Processor configuration on the Mill CPU:

    Specifying cores for a range of power and performance points

    The Mill CPU architecture defines a generic Mill processor, from which a family of specific processors can be configured. A particular configuration for a Mill CPU family member is defined by a specification, which is processed by Mill configuration software to build a member-specific assembler, simulator, compiler back-ends, Verilog for the hardware implementation, documentation, and other tools and components. A Mill CPU family member specification is in two parts: one defines the instruction set, and the other defines the components that comprise the functional organization and microarchitecture of the configured family member. The talk describes the specifications and the software components that perform Mill configuration.

    In addition, the talk includes a live demo of configuration. The audience specified a new instruction and a new Mill family member. The speaker built a new configuration from the specification, wrote a short program in Mill assembly language that exercises the new instruction, and executed and debugged the program in a software functional simulation of the newly configured family member.

  • jimrandomh
    Participant
    Post count: 4

    You show a block diagram for a Crimson chip, which has a Copper core and a Silver core. Based on what we’ve heard about the specializer working at install time, I’d naively expect chips to always have to have all their cores be identical, or close to identical, in order to support process migration between cores. How do you handle mixed cores within a single chip? How different can they be from each other, while still supporting process migration?

    • Ivan Godard
      Keymaster
      Post count: 473

      Mixed-core-architecture chips are common in embedded and other special applications – think the Sony Cell. Usually the cores have dedicated functions – for example, one drives your friendly local nuke plant, and one drive the UI, so process migration is not an issue. The cores need not even be from the same vendor nor be of related architectures. Crimson is a testbed for such use in our specification system.

      Process migration across dissimilar cores has been the subject of some academic work. It appears to be practical in a JIT environment, although utility, except for reliability reasons, is questionable.

  • jimrandomh
    Participant
    Post count: 4

    C++ is an interesting choice for an assembly macro language. Will C++ textual source code be a typical target for compilers, or will they skip that stage somehow? (I ask mainly because C++ compilers aren’t exactly known for their speed, and I’d hate to JIT with one.)

    • Ivan Godard
      Keymaster
      Post count: 473

      Our assembler is primarily intended for hardware verification; it is not intended for application use. Manual asm on a Mill is extremely inhumane due to the difficulty of keeping track of belt contents manually; it’s not too hard to read but impractical to write.

      Compiler output, which is also specializer input, is a non-textual graph structure. The specializer turns that into target binary, and you can inspect it or get listings in asm text, and even assemble it if you want, but the normal build cycle does not use an asm stage.

      • Alan Dartnell
        Participant
        Post count: 2

        I use the same “trick” of having a high level language as the macro language for an assembler. In my case I use Ada rather than C++. One of the useful features is a register allocator so I can simply ask for a currently unused register which gets allocated from a free register pool. Registers are then identified not by simple Ids but by Ada variables having meaningful names (i.e. names that are meaningful to the program being written). Registers allocated by this mechanism are then released to the free pool when they are no longer needed.

        You wouldn’t want quite that with the belt, but I can envisage a similar mechanism where belt references are via C++ objects which automatically track their belt number.

        • Ivan Godard
          Keymaster
          Post count: 473

          You are describing what the sim does πŸ™‚

          Of course, the sim is dynamic, while an asm has to be static. And unlike the specializer and sim, asm doesn’t have width information, and so it also does not have latency info, which is needed to track the belt. And then there’s the question of what to do at control-flow joins, especially joins that include the back edges of loops. The problem is roughly equivalent to type inference, i.e. mostly solvable in most code but not in general, and far more work than you’d want to do in an assembler. Good for a PhD if you wanted one though πŸ™‚

        • Russell Johnston
          Participant
          Post count: 9

          If I understand correctly, a slightly higher-level, unscheduled version of Mill assembly is basically SSA, so if you don’t care about the details of op scheduling you could use e.g. LLVM IR.

  • Ivan Godard
    Keymaster
    Post count: 473

    The specializer format is similar to a serialized IR, but unextended LLVM IR is not able to carry all the Mill semantics, and unfortunately has also discarded some information that the LLVM compiler had and the specializer needs. We are in the process of porting LLVM, but there’s a lot to do and limited resources.

  • bhurt
    Participant
    Post count: 5

    I think this is the wrong venue in which to bring this up, but it’s the most correct venue I can think of. In other venues, you’ve asked what instructions could you add to low-end Mills that would make writing a software floating point library easy(-ier). I figured I’d chime in with my thoughts on the matter, for what they’re worth.

    The biggest problem, and the most error prone aspect, of writing a software floating point library, is unpacking and repacking the floating point numbers themselves, and dealing with the edge cases (denormalized numbers, infinities, NaNs). Unpacking a floating point number into it’s components is a longish series of shifts, ands, and adds. The idea is to just have two instructions. One instruction takes a floating point number and unpacks it, producing the mantissa, exponent, and type (is it a NaN, Infinity, etc.?). As a side note, the belt is really nice for designing instructions. πŸ™‚ The other instruction takes a mantissa, exponent, and type and reconstitutes the original floating point number.

    Note that these instructions are useful in other circumstances, not just in software fp libraries, but in functions lke frexp and ldexp and isNaN. These would actually be useful instructions to have even on a Mill Platinum architecture, aimed at super computers and given boatloads of floating point units. Their utility in writing a software FP library should be obvious.

    Two variants might warrant consideration. One is that instead of a single type value, drop multiple flags on the belt- isNaN, isInfinity, etc. The advantage of this is that in a single instruction you could unpack a floating point number and multiway branch on the edge cases to special handling code (“Oh, it’s a NaN- in that case, jump here.”). Another would be instead of the mantissa as a single value, split it up into two values. If you don’t have a 64 bit x 64 bit -> 128 bit multiply (likely on a low-end Mill), this makes it easier to do the obvious four 32×32->64 bit multiplies. The problem with both of these variants is that they drop a lot more values on the belt- if you only have 8 belt positions, having a single operation drop 6 or more values on the belt can be a problem.

    Hope this helps.

    • Ivan Godard
      Keymaster
      Post count: 473

      In this and all other topics, please do not speculate or suggest ideas in the public forum. We have put and are putting a great deal of work into the Mill, and would like to see some tangible reward for our work. We may already have worked out something in great detail and are just about to file the patents on it, only to have an innocent well-intentioned suggestion give enough of the idea that a patent is no longer available, or could be attacked even if granted.

      We do ask for help, but that doesn’t mean that we are looking for postings of ideas, it means that we are looking for real human beings, with expertise – or just thought-out ideas – in the subject, who will join our team, sign the necessary NDA and IP agreements to protect our and their work, and put significant time into the effort on a sweat-equity basis. Contact me directly (ivan@millcomputing.com) if that’s you.

      So please, keep the discussion here only on material that has been part of our public disclosures, or material that has already been published by third parties. The latter is especially useful – a cite to an obscure decades-old paper may save us the cost of a pointless patent filing. Or ask questions about how XYZ from a talk works (but don’t suggest alternatives). Or talk about marketing or business matters – Mill Computing has a real technology, not just a business plan and an MBA as so many startups do, so we have nothing we need to protect about the business.

      But please, nothing about anything that might be a patent someday; speculations and ideas can hurt us and don’t help you.

      Ivan

  • David
    Participant
    Post count: 32

    So just for extreme clarity: The simulator examples given here, and the assembly language shown, is only ever for member-specific code, and never for abstract family code? JITs would generate a LLVM-ish intermediate serialized representation?

    (Related to JITs, but slightly off-topic, but I don’t recall which talk covered it: With the way EBBs and function semantics work, is it possible for a subroutine to have multiple entry points? Just considering optimization possibilities for always-vararg languages.)

    When bin-filling the binary instruction encoding, does your software take into account expected popularity of instructions to prefer common instructions for tighter encoding and uncommon instructions as an otherwise needlessly larger encoding to make room for the common ones?

    Do you anticipate any temporal quantization or aliasing problems with tracking time as (presumably) integer picoseconds in simulation for systems with multiple clock rates? It seems like there could be edge cases where sim would consider activities simultaneous which would be ordered in hardware, depending on the usefulness of that distinction at the sub-picosecond scale.

    Also, as a Lisp programmer and heavy DSL creator, I must say that when you were showing C++ metaprogramming and especially the enum issue, I could only think of Greenspun’s Tenth Rule. πŸ˜‰

    • Ivan Godard
      Keymaster
      Post count: 473

      So just for extreme clarity: The simulator examples given here, and the assembly language shown, is only ever for member-specific code, and never for abstract family code? JITs would generate a LLVM-ish intermediate serialized representation?

      The notation used in assembler is common across all members. What you can say in that notation is member specific. Thus addu(b3, b6), addu(b4, 3); is syntactically valid on all members, but only will produce binary on members with two or more adders. JITS (we expect) will produce specializer input notation, or rather a binary form of that, and the same library that the specializer itself uses will turn the JIT code into target binary.

      • David
        Participant
        Post count: 32

        Will the abstract form expose an infinite-length SSA belt with serial instruction execution in its virtual semantics?

        • Ivan Godard
          Keymaster
          Post count: 473

          In effect yes, though not really. The abstract form has no belt; it is a dataflow graph overlain with a control-flow graph. That is, it’s not a P-code or Java code or any of the other serializations; there’s no reason for the specializer to have to figure out the data dependencies that the compiler already knew but threw away in the intermediate representation.

          Of course, it is trivial to go from the graphs to a serialized instruction stream with infinite belt, and we might even offer a command option to produce that, for those who find such a representation to be an easier way to understand what is going on. But it would only be documentation.

          • David
            Participant
            Post count: 32

            We’re going to be rewriting the AOT and JIT compilers for the next version of our declarative programming server, so user code feeding the specializer is of particular interest to me, even if just for forward compatibility thinking.

            At the location of a compiler backend, yes it is relatively equivalent to convert between graphs and LLVM-style virtual instructions. For compiler writers, it sounds like graph generation will be the expectation on the Mill? That could likely be a bit easier than instructions, thinking about it…

          • Ivan Godard
            Keymaster
            Post count: 473

            I hope so πŸ™‚

            The LLVM middle-to-back representation is more a linearization than a serialization. We plan to back up into the middle end just a little and serialize the internal (graph) representation after all the regular passes and a few Mill-specific passes are done. The compiler-to-specializer representation knows it’s a Mill but doesn’t know which one.

    • Ivan Godard
      Keymaster
      Post count: 473

      (Related to JITs, but slightly off-topic, but I don’t recall which talk covered it: With the way EBBs and function semantics work, is it possible for a subroutine to have multiple entry points? Just considering optimization possibilities for always-vararg languages.)

      It is possible for a function to have multiple entry points, but each entry point must be the entry point of an EBB. All transfers must be to the head (i.e. middle) of an EBB.

    • Ivan Godard
      Keymaster
      Post count: 473

      When bin-filling the binary instruction encoding, does your software take into account expected popularity of instructions to prefer common instructions for tighter encoding and uncommon instructions as an otherwise needlessly larger encoding to make room for the common ones?

      Not really. What you describe is a simplified Gray coding, and we don’t use frequency in determining the encoding of the ops proper. We just determine the natural minimal encoding of each op, and pack for minimal total length. If there’s one op with so many arguments or information to represent that it takes several bits more than any other op then the algorithm will likely use half of the overall entropy just on that one op, and it will encode as the natural length plus one bit to distinguish between that one and all the rest. And so on, for minimal total length.

      That said, there are cases in the encoding in which we are in effect encoding by frequency. For example, excess entropy in the exu block (the computational 2-in-on-out ops like ADD) is soaked up by immediate forms, such as add immediate and equal immediate. The range of immediate values supported in any given slot is determined by how much entropy is left over; we add immediate values until we almost but don’t actually go over a bit boundary. The added supported values are small integers, in order, so one slot may support an add-immediate with values 1-17 and the next slot with values 1-243, depending on the rest of the encoding.

      This algorithm is in a sense a Gray code, because immediate value usage in actual programs drops in frequency with increasing value – add(3) is more common than add(27). So the algorithm is choosing more frequent operations to encode specially, even though the encoded bit length in any given slot is fixed.

      There are other places where the encoding is frequency-sensitive, but in all of them the situation is like with the immediates: we don’t pack single operations by frequency, but we do pack multi-operation sequences by frequency. If there is not an immediate form with the right immediate value, then an expression such as X+243 must be scheduled and encoded as two ops, an con to get the constant on the belt, and then the add. Doing it in one op has better entropy, and also improves schedule latency and slot and belt pressure, so there really is frequency consideration in the algorithm, just not in the encoding of single ops.

    • Ivan Godard
      Keymaster
      Post count: 473

      Do you anticipate any temporal quantization or aliasing problems with tracking time as (presumably) integer picoseconds in simulation for systems with multiple clock rates? It seems like there could be edge cases where sim would consider activities simultaneous which would be ordered in hardware, depending on the usefulness of that distinction at the sub-picosecond scale.

      The sim models multiple clock domains, multiple clock sources (the xtal component), and multiple clock converters (the PLL component). It allows for the cost of the handshaking required to cross domain boundaries.

      The sim does not model skewed clock strategies such as appeared in the P4 and some other barn-burner chips. It could, but the hardware guys are not expecting to use such techniques any time soon, and so we wouldn’t know what to specify.

    • Ivan Godard
      Keymaster
      Post count: 473

      Also, as a Lisp programmer and heavy DSL creator, I must say that when you were showing C++ metaprogramming and especially the enum issue, I could only think of Greenspun’s Tenth Rule. πŸ˜‰

      Had to go look that one up.

      The counter-argument embeds Greenspun’s Seventh Rule.

    • Ivan Godard
      Keymaster
      Post count: 473

      A final reply: as you can see, it’s much easier to reply if you split your posting on the Forum into multiple postings with one topic each πŸ™‚

      • David
        Participant
        Post count: 32

        Thank you as always for your thorough replies. I’ll split them up next time. I took notes while watching, then compiled my questions not answered by the end.

        (Greenspun only had 1 “rule”, about complex programs containing an ad-hoc subportion of Common Lisp (later amended to point this finger at Common Lisp itself πŸ˜‰ ), called it his “tenth rule”, but had no other rules. Which one were you referring to with the seventh?)

        • Ivan Godard
          Keymaster
          Post count: 473

          Exactly πŸ™‚

  • jobol
    Participant
    Post count: 5

    Is there a plan for SIGNED divisions? The ainsi C specs for division of SIGNED are good but it exist an other kind of SIGNED division. When computing A/B and A%B, C say something like:

    [1+abs(A/B)] * abs(B) > abs(A) >= abs(A/B) * abs(B)

    But the euclidian division specify that

    abs(B) > A%B >= 0

    Then for C: -10/3=-3 and -10%3=-1
    But for euclidian: -10/3=-4 and -10%3=2

    If specification is “easy”, do you intend to have 2 signed divisions?

    note ** https://en.wikipedia.org/wiki/Euclidean_division#Examples

    • Ivan Godard
      Keymaster
      Post count: 473

      The Mill already defines signed division, but not in the sense you mean. DIVU accepts unsigned arguments and so the ambiguity for signed does not exist. DIVS accepts signed arguments and produces the C-language result, as is standard in the programming languages that the hardware must accept. We expect that division will be an emulated op (rather than hardware native) on nearly all Mill members. We define a reciprocal-approximation helper op to get the emulation started, and the helper will be native. In practice, the helper and software Newton/Rapheson of the emulation is nearly as fast as a hardware version would be.

      It is easy to specify another operation, but it still must be implemented, tested, verified, and documented, and that is not without cost. Consequently we only add ops when their use can improve performance of widely used languages, or where they are needed by critical applications that will access them through intrinsics.

      So far we know of no language or app that would justify another division. It is quite rare for an app to have division as a performance-critical step, but if you are aware of any then please bring them to our attention.

You must be logged in to reply to this topic.