Forum Replies Created

Viewing 5 posts - 16 through 20 (of 20 total)
  • Author
    Posts
  • Thomas D
    Participant
    Post count: 24

    The scratchpad has a three cycle spill to fill latency, if you spill a value you won’t be able to get it back for 3 cycles because of this the length of the belt is set so that nearly everything lives for three
    cycles on the belt. So the length of the belt needs to be 3 times the number of results that can be produced by functional units in one instruction for that family member.

    That makes sense, but, I can’t imagine that a Tin can only retire three values a cycle, though. Then again, maybe I just suck at understanding real hardware.

    The belt is quite different from the scratchpad

    I’m not sure I understand specifically what you mean by a ‘slower belt’.

    If you think of the belt abstraction: you’ve got this conveyor belt that values go onto and you pull some off, operate on them, and put the result on the belt. The newest results go on the front of the belt and the oldest results fall of the back of the belt. Now, imagine two of these belts. A spill operation moves a value onto the slower belt. It is the only reason the slower belt moves. The fill operation takes a value off the slow belt and puts it back onto the fast belt. The ALU (etc) operates off the fast belt. Values cycle on that belt quickly: it is fast. The slow belt only changes when we need to rename something as being slow.

    The only thing I see with this is that people will find pathological algorithms which require an inane amount of working set to run.

    The size of the available on chip memory is the same cost/speed trade off you make when buying DRAM.

    Tin has only 128 bytes of scratchpad, and Gold has 512. Why so small? I realize that the scratchpad isn’t expected to be used frequently. Then again, maybe the Tin should have more Scratchpad to make up for its lack of Belt.

  • Thomas D
    Participant
    Post count: 24

    I’ve been plotting to write a belt virtual machine and thinking about its consequences. I think that for virtual machines, the stack machine will always rule due to ease of programming (that is, it is an easy target because the compiler doesn’t care how deep a computation goes, it just keeps dropping things on the stack and the machine takes care of it).

    The questions I have are probably proprietary, but here goes:
    How did you decide on a 32 entry belt (for Gold, and 8 entry belt for Tin)?
    Why a scratchpad instead of a second (slower) belt?
    How was the size of the scratchpad decided on?

    I’ve been tossing around two ideas of alternative to a scratchpad. One is to have four belts, with the opcode deciding which belt the result drops onto (and all four belts addressable for inputs). (That sounds horribly complicated for hardware, but easy for a VM). The second is adding a second (and maybe third) belt that results are moved onto, forming a hierarchy of gradually slower belts. As you’ve probably thought of these ideas, what gotchas am I not seeing?

  • Thomas D
    Participant
    Post count: 24
    in reply to: MILL and OSS #3225

    What you could do is release an open-source specializer for that community that takes out some of the magic to placate the VCs, but is OSS. It has been said that the user will be able to write their own specializer ( https://millcomputing.com/topic/meltdown-and-spectre/#post-3172 ). It shouldn’t matter if there are binary blobs on the MOBO if they aren’t being called. It’s not like the specializer will be Intel’s Management Engine (a “feature” of Intel MOBOs that is a SOC that covertly runs MINIX 3)?

    I can see, though, how such a thing (a bare-bones skeleton specializer) is really only useful to two groups: superusers who want to replace it, and competitors.

  • Thomas D
    Participant
    Post count: 24
    in reply to: Switches #3200

    Does the Mill provide any features for handling indirect (virtual) function calls? In, say C++, you can wrap switch cases in classes and turn the switch statement into a virtual function call. Does the Mill have any improvements to handle this?

    Say I have this simple state machine (I hope these come out well):

    #include <iostream>
    
    int main (void)
     {
       long i [10];
    
       i[0] = 1;
       *reinterpret_cast<double*>(reinterpret_cast<void*>(&i[1])) = 3.0; // Assume LP64
       i[2] = 1;
       *reinterpret_cast<double*>(reinterpret_cast<void*>(&i[3])) = 5.0; // And IEEE-754 double
       i[4] = 2;
       i[5] = 3;
       i[6] = 0;
    
       const int BELT_SIZE = 10;
       double belt [BELT_SIZE];
       int front = BELT_SIZE - 1;
       int pc = 0;
    
       while (0 != i[pc])
        {
          switch(i[pc++])
           {
             case 1:
                front = (front + 1) % BELT_SIZE;
                belt[front] = *reinterpret_cast<double*>(reinterpret_cast<void*>(&i[pc++]));
                break;
             case 2:
              {
                int lhs = front;
                int rhs = (front + BELT_SIZE - 1) % BELT_SIZE;
                front = (front + 1) % BELT_SIZE;
                belt[front] = belt[lhs] + belt[rhs];
              }
                break;
             case 3:
                std::cout << belt[front] << std::endl;
                break;
           }
        }
    
       return 0;
     }

    I can make it Object-Oriented, trying to follow the Interpreter Pattern:

    #include <iostream>
    
    const int BELT_SIZE = 10;
    
    class operation
     {
       public:
          operation() : next(nullptr) { }
          virtual operation *execute(double *belt, int &front) const = 0;
          virtual ~operation()
           {
             delete next;
             next = nullptr;
           }
          operation *next;
     };
    
    class value : public operation
     {
       public:
          value(double val) : val(val) { }
          double val;
          operation * execute(double *belt, int &front) const
           {
             front = (front + 1) % BELT_SIZE;
             belt[front] = val;
             return next;
           }
     };
    
    class add : public operation
     {
       public:
          operation * execute(double *belt, int &front) const
           {
             int lhs = front;
             int rhs = (front + BELT_SIZE - 1) % BELT_SIZE;
             front = (front + 1) % BELT_SIZE;
             belt[front] = belt[lhs] + belt[rhs];
             return next;
           }
     };
    
    class print : public operation
     {
       public:
          operation * execute(double *belt, int &front) const
           {
             std::cout << belt[front] << std::endl;
             return next;
           }
     };
    
    int main (void)
     {
       operation *first = nullptr, *cur = nullptr;
    
       first = new value(3.0);
       cur = first;
       cur->next = new value(5.0);
       cur = cur->next;
       cur->next = new add();
       cur = cur->next;
       cur->next = new print();
    
       double belt [BELT_SIZE];
       int front = BELT_SIZE - 1;
       cur = first;
    
       while (nullptr != cur)
        {
          cur = cur->execute(belt, front);
        }
    
       delete first;
       first = nullptr;
       cur = nullptr;
    
       return 0;
     }

    Will the Mill mispredict the while loop as bad as any modern superscalar? Or is there yet another Not-Yet-Filed patent up your sleeve?

  • Thomas D
    Participant
    Post count: 24

    My first two questions were originally about NaR + None, then I found a post that linked to this https://millcomputing.com/topic/metadata/#post-558 . To me, it makes the most sense given the use that the None is supposed to have, but I could be wrong.

    Thinking more deeply, it really doesn’t matter whose metadata for a NaR wins. While I would want the location of the first generated NaR, it actually isn’t helpful. A pseudocode example:

    A = 5 / 0
    B = 6 / 0
    C = A + B

    If this is the code order, there is nothing that stops modern programming languages from reordering the divides as they see fit. Ideally, an unoptimized binary wouldn’t. Ideally.

    As for the fast/slow multiplier question: my mistake was in what form genForm takes. I was putting it closer to the form in the hardware, which would make some specialization more difficult.

Viewing 5 posts - 16 through 20 (of 20 total)