• Author
    Posts
  • Ivan Godard
    Keymaster
    Post count: 406
    #2818 |

    An exchange from comp.arch repeated here:
    ++++++++++++++++++++++++++++++++++++++++++++++++++++++
    @Stephen Fuld:
    The C switch statement, and equivalents in other languages, are interesting. They allow substantial economy of source code, and quite compact generated code, but they do have some issues. The indexed load of the target address is frequently a cache miss, and it is followed immediately by an indexed jump, that is almost always a branch mispredict. Which leads me to my question. Does the Mill have any tricks up it sleeve that make switch statements less costly?

  • Ivan Godard
    Keymaster
    Post count: 406

    The compiler currently turns switches into a binary search branch tree. The case space is partitioned by lss-than compares down to the level at which equal-compares are better, and then equal-compares thereafter. The logic deals well with runs of consecutive values in a case, such as all alphabetic characters; less-than compares are cheap for that. We do not currently generate a jump table. Although early versions of the ISA had jump tables in the code adjacent the the instruction that used them, we took the operation out; code side jump tables seriously muck up the code data paths.

    This strategy is tuned for the wide issue and multi-branch capability in larger Mill family members. For a member that is Silver or wider all the equal compares and their branches fit in one instruction, and a big hunk of the less-than tree will be in prior instructions of the same ebb. Consequently execution can get well down the tree before one of the branches is actually taken and search enters a sub-tree. As the whole ebb has only one prediction (where it will exit to) there are many fewer misses than would occur on a legacy architecture that does branch prediction in a search tree. A bit of arithmetic will show how many taken branches (on average) are necessary to get to the target case (assuming cases are equiprobable and sparse, which is the worst case) for any member configuration of ALU exu slots (for the compares) and branch flow lots (for the branches).

    This works quite well for switches up to the size of say a lexer, but will obviously fail for something like a YACC grammar with cases in the thousands. For those we need to add a cost function and data-side dispatch tables (Mill branches already support label variables) to the specializer; we just haven’t gotten around to that yet.

    • Ivan Godard
      Keymaster
      Post count: 406

      @EricP:
      In what way do code side jump tables muck up the code data paths?
      For example, if there was an indexed branch instruction immediately
      followed in the code by a table of displacements, how would that
      be worse than any other kind of jump?

      Eric

      • Ivan Godard
        Keymaster
        Post count: 406

        That was pretty much what our initial design did, except that it was a real switch op, not just an indexed branch, and bounds-checked the argument into the default case.

        The mucking arose because the Mill code-side is both heavily streamed and heavily pipelined. It’s a fire-hose and the code side has to do all it can to keep everything out of the way. The whle code side is done on a whole-cache-line basis. Recall that each instruction is variable length and may be up to a whole line long and may cross a line boundary. The instruction header says how long it is, so the front end of decode is an instruction shifter that is two lines wide and cracks the line stream into an instruction stream.

        The cracker line fetch sequence is controlled by the predictor, so a predicted taken transfer causes the lines containing the target to replace the lines containing the transfer in the cracker. After the cracker the line relationship is lost. A miss discards the inflight instruction stream and starts the cracker with a new line-pair just as if it had acted on a predicted transfer.

        In those streams there’s no place for an arbitrarily large data structure like a jump table; it won’t be fetched as part of the line stream, and whatever part of it was in the remainder of the line containing the switch op would have been discarded as the cracker was refilled from the predicted switch target. Moreover, we don’t even know that we have a switch until two cycles after the line-cracker, and the cracker is quite likely to have followed one or two predictions before we know – the table is probably no longer in the decode at all any more, and we have to be prepared to get it, or at least the desired address, from cache.

        Consequently, other than the potential value of maybe part of the table having been brought to the icache along with the instruction containing the switch, there’s no value in having the table adjacent to the switch op except that it gives a cheap way to find the address of the table.

        Now that we know the address of the table (having decoded the switch op enough to discover we need an address), we have to do an indexed load and pull up eight bytes that we can compare against the prediction we followed a couple of cycles ago to see if we went where we should have. That’s a round-trip to the cache. However, the icache is set up to do whole lines – and nothing but whole lines. So to get the case target address we have to pull up a line (needs an adder somewhere to add the table address and the case index), and isolate eight bytes out of that line, which means there’s a shifter somewhere too. Can’t use the ALUs or L/D address adders for the add; they are busy working down the predicted path. Can’t use the ALU shifters for the isolate; they aren’t line wide and are also busy. So we need an adder in the branch unit and an isolate shifter in the icache, equivalent to what is on the front of a dcache, and both adder and isolater are used for nothing but the rare case of a switch. And you have several more data paths to mux, and another command sugnal or packet for the icache to deal with

        The latency is bad too. We don’t know we have a switch op until D2, plus a cycle for the add, plus maybe two-three cycles to get the line from the i$1 (the table won’t be in the i$0 microcache – I don’t even want to think about that), plus a cycle for the isolater, plus a cycle back and a cycle for the success comparison, just to confirm that you did indeed predict correctly. Then what happens if the switch target (as predicted) immediately does a switch too? Is switch handling fully pipelined? Probably not – you will stall, but then you will be stalling anyway while you figure out the address.

        Muck.

        Nowadays the jump table (when we get around to it) is on the data side, which already has the address-adder for an indexed load and the isolaters in the data path from the D$1, so no logic need be added to support switches. The ISA has dynamic branches that take the target address from the belt, for which the prediction confirmation takes zero latency, just like any other branch. The indexed load is separate from the branch and so can be hosted into the future as far as the switch argument value is known. With any luck the pointer will already be on the belt by the time we need it in the dynamic branch, and if there’s a data dependence that prevents hoisting the load then we are still no worse off than a D$1 latency.

        No added logic, no added data paths, no added muxes, no high fixed latency, no stalls (if hoistable).

        No muck.

  • Ivan Godard
    Keymaster
    Post count: 406

    p.s. For greater detail, here’s the comment from the specializer function that does this.

    /*
    Each of the ranges can be isolated by a “less than”
    comparison, so the maximal number of branches is the size
    of the ranges vector. A less-than compare requires one
    branch per range, while an equality compare requires one
    branch per non-default value. Consequently equality takes
    one branch for two ranges when a non-default range of extent
    one is adjacent to a default range wheres less-than would
    take two branches; a non-default range of extent greater
    than one takes one branch using less-than and a number of
    branches equal to the extent using equality; and a
    non-default branch of extent one adjacent to a non-default
    of any size takes one branch in both less-than and
    equality.

    The worst case is a run of extent-one ranges. If there are
    N ranges, then sequential equality needs N branches and
    will execute N/2 of them. With less-than you need one
    branch to split N to 2*(N/2), 2 to split 2*(N/2) into
    4*(N/4) and so on. That is 2N-1 total to the bottom level,
    and log2(N) levels. So less-than will always need more
    branches than equality, because the bottom level of a
    less-than is the same as an equality and there are all the
    upper levels on top of that.

    As for latency, the dividing point is when is log2(N)
    smaller than N/2; the break-even is 8. So we less-than
    down until the number of cases is eight or less, and do
    sequential equality after that.

    The job of this function is to partition
    the ranges by adding tests on pred that branch to interior nodes
    of the tree until the test is trivial and we can go to the leaf
    ebbs given by the switch statement.

    “home” is the original ebb that ended with the switch dispatch.

    “node” is the root ebb that ends with the switch, or recursively
    an inner node in the tree of ebbs that results from splitting
    the case space with lss operations.

    “pred” is the switch expression that is being tested; it will
    be one argument to all the comparisons in the tree.

    “ranges” is all the value ranges that must be resolved leafward
    of home in the splitting tree; each has a leaf ebb and a
    contiguous range of cases that go to that ebb.
    */
    void switchPartition(ebbId home, ebbId node, dropId pred,
    row ranges) {

You must be logged in to reply to this topic.