Mill Computing, Inc. Forums The Mill Architecture Saving entropy in the Mill encoding

  • Author
    Posts
  • Ivan Godard
    Keymaster
    Post count: 689
    #3291 |

    This note was prompted by a question from one of our team who was newly exploring the Mill binary encoding; I repost it here as possibly of general interest.
    ===========================================================================================
    > Getting a bit more familiar with conAsm, it seems like the operations
    > in a instruction are shown in reversed slot order. To me, this seems > to wrong way around. What’s the reason behind that?

    Most Mill operations are slot idempotent: you can put an add in any slot with an ALU and it makes no difference other than result belt drop order. However, a few ops have intra-instruction slot order semantics where relative slot order makes a difference. In our ISA, these slot-sensitive ops are all control flow operations. They fall into two groups: the calls and the branch/return family. These ops have different semantics because they are inherently in-order and cannot be executed in parallel with other ops of the same sort: you can issue three adds from an instruction concurrently, but you can’t call three functions concurrently in real hardware.

    Calls in an instruction are ordered so that inter-call dataflow is possible. A core with multiple slots that support call must define which slot’s call executes first, which next, and so on. This slot assignment is essentially arbitrary, but the textual ordering in a printed-out line of assembler code is not: we read text left-to-right, and execution order matching text order is more “natural” for the human reader. Consequently, the textually leftmost call of a multi-call instruction is executed first. If we developers were speakers of a Semitic language we would use the opposite order and the textually rightmost call would be first, but we are not. However, whether the slot occupied by the first call op is numbered “slot zero” or “slot N” is arbitrary and no one cares but the hardware guys, for calls at least.

    The branch (br/inner/leave) family, and the retn op which works the same way, are slightly more complicated. Like calls, the branch ops are evaluated in order. Conditional forms with a false predicate are simply ignored, but conditional forms with a true predicate, and unconditional forms, cause evaluation to stop and transfer is taken to the op’s target address. Any subsequent branch operation in the instruction is ignored. This evaluation ordering semantic is known as First Winner Rule.

    For the same reason that calls use left-to-right reading order, branches in printed-out assembler text appear in left-to-right evaluation order: the leftmost winning branch is the one taken, to make it easier for the human reader. It might appear that the physical slot number order for branches is as arbitrary as it is for calls, but it’s not. The difference arises because an unconditional branch is always a winner under FWR: no branch evaluated after an unconditional branch can ever be taken.

    This impacts encoding entropy. The bitwise layout of instruction encoding permits us to elide trailing NOPs in an instruction. The slots have indices 0-N (in the flow block, where the ops encode), and a NOP in slot N occupies no space in the binary. In general, trailing contiguous NOPs in higher-numbered slots are free. However, a NOP that is not part of a trailing contiguous group must be explicitly encoded, wasting some entropy. If every slot supported every different op then the encoding would never have any explicit NOPs, but hardware functional units are not free either. One part of tuning a Mill configuration involves assigning the available function units to slots so as to minimize the number of explicit NOPs in some body of test code.

    And this is where slot evaluation order comes in. In an instruction with several branch ops, there will be at most one unconditional transfer op due to FWR. If branches are evaluated in increasing numeric slot order then the unconditional one could wind up in any slot, depending on what other ops are in the same instruction. Consequently, with evaluation in increasing order the encoding must be able to express unconditional branches in any, and every, slot.

    In contrast, if evaluation is done in decreasing slot number order, the last op to be evaluated will always be the one in slot zero. As an unconditional is always the last branch of an instruction due to FWR, with decreasing evaluation order it is always safe to put the single unconditional in slot zero, and, more importantly, slot zero is the only slot that must support the encoding of unconditional branches. The unconditional ops can be omitted from slots 1-N, saving the entropy that their encoding would require.

    It has sometimes be argued that evaluating in increasing numeric order is more “natural” than the decreasing order we use. What is more “natural” is in the eye of the beholder, witness the endianness controversy. In this case, no outside user will every care about the bit level encoding we use. Those of our own people who design the encoding in software and decoding in hardware (and thus do care about bits) presumably know the machine well enough to be aware of the rules. Consequently, we don’t have to decide what is “natural”.

You must be logged in to reply to this topic.