• Author
    Posts
  • staff
    Keymaster
    Post count: 49
    #2898 |

    Talk by Ivan Godard – July 12, 2017, at the
    SFBay Association of C/C++ Users

    Slides: mill-cpu-switches.04 (.pptx)

    This is the eleventh topic publicly presented related to the Mill general-purpose CPU architecture.

    Multi-way branches, known as switches or case clauses in various languages, are a notorious pain for compiler writers and CPU architects. On the critical path in important applications from lexers to byte-code interpreters, switches often suffer from poor branch prediction performance. Short switches with few cases can use a chain of ifs but are hampered by the low rate at which branches issue on conventional hardware and may miss several times while working down the chain. A common alternative is to indirect-branch through a jump table, but that approach is subject to misses in the data cache as well a prediction miss. Some CPU architectures have taken the approach of putting the table in instruction space, but that adds expensive shifters and datapaths to the instruction fetch.

    This talk shows how the ultra-wide-issue Mill architecture responds to the switch challenge. The Mill is a family of wide-issue, statically scheduled, exposed pipeline CPUs. Some family members can issue up to 30 instructions every cycle, including up to five branches — and can retire instructions at the same rate. The talk works through the compiled code for an example with eight cases spanning 100 values. Spoiler: a mid-range Mill does the whole switch, including the action bodies for each case, in three instructions and no table. Which isn’t too bad, considering that hand-crafted Mill code can do it in two.

    NOTE: the slides may require genuine Microsoft Windows PowerPoint to view; many clones, and Mac native PowerPoint, are unable to show the animations correctly. If you do not have access to Windows PowerPoint then watch the video, which shows the slides as intended.

    • This topic was modified 6 years, 9 months ago by  staff.
    • This topic was modified 6 years, 9 months ago by  Will_Edwards.
    • This topic was modified 6 years, 9 months ago by  Will_Edwards.
    • This topic was modified 6 years, 9 months ago by  Will_Edwards.
    • This topic was modified 6 years, 9 months ago by  staff.
    • This topic was modified 6 years, 9 months ago by  staff.
    • This topic was modified 6 years, 9 months ago by  Will_Edwards.
    • This topic was modified 6 years, 9 months ago by  Will_Edwards.
    • This topic was modified 6 years, 9 months ago by  staff.
    • This topic was modified 6 years, 9 months ago by  staff.
  • Thomas D
    Participant
    Post count: 24

    Does the Mill provide any features for handling indirect (virtual) function calls? In, say C++, you can wrap switch cases in classes and turn the switch statement into a virtual function call. Does the Mill have any improvements to handle this?

    Say I have this simple state machine (I hope these come out well):

    #include <iostream>
    
    int main (void)
     {
       long i [10];
    
       i[0] = 1;
       *reinterpret_cast<double*>(reinterpret_cast<void*>(&i[1])) = 3.0; // Assume LP64
       i[2] = 1;
       *reinterpret_cast<double*>(reinterpret_cast<void*>(&i[3])) = 5.0; // And IEEE-754 double
       i[4] = 2;
       i[5] = 3;
       i[6] = 0;
    
       const int BELT_SIZE = 10;
       double belt [BELT_SIZE];
       int front = BELT_SIZE - 1;
       int pc = 0;
    
       while (0 != i[pc])
        {
          switch(i[pc++])
           {
             case 1:
                front = (front + 1) % BELT_SIZE;
                belt[front] = *reinterpret_cast<double*>(reinterpret_cast<void*>(&i[pc++]));
                break;
             case 2:
              {
                int lhs = front;
                int rhs = (front + BELT_SIZE - 1) % BELT_SIZE;
                front = (front + 1) % BELT_SIZE;
                belt[front] = belt[lhs] + belt[rhs];
              }
                break;
             case 3:
                std::cout << belt[front] << std::endl;
                break;
           }
        }
    
       return 0;
     }

    I can make it Object-Oriented, trying to follow the Interpreter Pattern:

    #include <iostream>
    
    const int BELT_SIZE = 10;
    
    class operation
     {
       public:
          operation() : next(nullptr) { }
          virtual operation *execute(double *belt, int &front) const = 0;
          virtual ~operation()
           {
             delete next;
             next = nullptr;
           }
          operation *next;
     };
    
    class value : public operation
     {
       public:
          value(double val) : val(val) { }
          double val;
          operation * execute(double *belt, int &front) const
           {
             front = (front + 1) % BELT_SIZE;
             belt[front] = val;
             return next;
           }
     };
    
    class add : public operation
     {
       public:
          operation * execute(double *belt, int &front) const
           {
             int lhs = front;
             int rhs = (front + BELT_SIZE - 1) % BELT_SIZE;
             front = (front + 1) % BELT_SIZE;
             belt[front] = belt[lhs] + belt[rhs];
             return next;
           }
     };
    
    class print : public operation
     {
       public:
          operation * execute(double *belt, int &front) const
           {
             std::cout << belt[front] << std::endl;
             return next;
           }
     };
    
    int main (void)
     {
       operation *first = nullptr, *cur = nullptr;
    
       first = new value(3.0);
       cur = first;
       cur->next = new value(5.0);
       cur = cur->next;
       cur->next = new add();
       cur = cur->next;
       cur->next = new print();
    
       double belt [BELT_SIZE];
       int front = BELT_SIZE - 1;
       cur = first;
    
       while (nullptr != cur)
        {
          cur = cur->execute(belt, front);
        }
    
       delete first;
       first = nullptr;
       cur = nullptr;
    
       return 0;
     }

    Will the Mill mispredict the while loop as bad as any modern superscalar? Or is there yet another Not-Yet-Filed patent up your sleeve?

    • Ivan Godard
      Keymaster
      Post count: 689

      This is an interpreter, a class of application that is notorious for bad prediction behavior.

      Direct interpreters have two branches per iteration: the switch and the loop back. The loop back will predict, the switch won’t.

      In answer to your question: Mill does support indirect branches and indirect calls. Because there is only one branch site in the loop the Mill use of exit prediction is the same as a conventional branch predictor BTB. The mispredict rate is governed by the quality of the predictor and the “lumpiness” of the branching pattern, and should be similar between Mill and conventional. There may be some advantage to the Mill because the Mill prediction table is expected to be substantially larger than a conventional BTB, and so may be better at picking up history patterns; we have no data yet on that.

      There’s nothing up the sleeve to let us predict the unpredictable. However, we do have one big advantage: Mill mispredict restart cost is five cycles, while that of a conventional is treble that or more.

You must be logged in to reply to this topic.