Application Walkthrough

Author
Posts
mermerico
Moderator
January 4, 2014 at 5:39 pm
Post count: 10
#341 |
Now that we know a little bit about how many parts of the Mill work, I think it might be useful to create a walkthrough of a sample application. What I am imagining is a simple C program and a sample compilation in x86-like assembly and Mill-like assembly (just like in the talks, this can be psuedocode assembly). The walkthrough would then show how the two machines would execute the program and approximately how much time it would take. I think this would be useful in integrating all that we have learned so far from the talks into a simple demonstration of why the Mill is a better architecture.
- This topic was modified 10 years, 9 months ago by mermerico.
Ivan Godard
Keymaster
January 4, 2014 at 8:25 pm
Post count: 689
#343
Have you a suggested app or fragment?
Joe Taber
Participant
January 5, 2014 at 6:24 pm
Post count: 25
#344
GCD
Factorial
Fibonacci
Anagrams
FizzBuzz
Find the median of 3 (4,5?) numbers
Matrix or Vector products.
Exponentiation
CRC-32
Run-length encoding
Any of the sorting algorithms, or maybe just the merge operation (I feel like it might be interesting but short in mill asm).
CSV manipluation
Towers of Hanoi (recursion)
Zig-zag matrix
Basically almost everything on Rosetta Code will be interesting, and have sample code in C and assembly. Of course some of these may be too complicated or require os-specific console input/output.
- Ivan Godard
  Keymaster
  January 5, 2014 at 7:40 pm
  Post count: 689
  #346
  A good list. We do have to pick a Mill member, but for now assume one big enough (unlimited slots and belt) for anything; the actual slot and belt requirement is an in interesting result in its own right.
  I’ll take the first: GCD.
  /* code based on Rosetta C++ example: int gcd(int u, int v) { return (v != 0)?gcd(v, u%v):u; } */ F("gcd"); // u in b0, v in b1 neqs(b1, 0), rems(b0, b1); retnfl(b0, b2); nop(4); // wait for rem call1("gcd", b3, b0); retn(b0);
  This needs a 3-long belt, one flow slot and two exu slots (suitably populated); 8 cycles, excluding the nested call body.
  /* code based on Rosetta C++ example: int gcd_iter(int u, int v) { int t; while (v) { t = u; u = v; v = t % v; } return u < 0 ? -u : u; /* abs(u) */ } */ F("gcd_iter"); // u in b0, v in b1 L("loop"); neqs(b1, 0), rems(b0, b1); brfl(b0, "xit"); nop(4); // wait for rem conform(b3, b0); br("loop"); L("xit"); lsss(b0, 0), negs(b0); pick(b0, b1, b2); retn(b0);
  This needs a 3-long belt, two exu slots, onr flow slot and a pick slot; the loop body is 8 cycles, plus 3 cycles for the wrap-up.
  In both I have used speculation to launch the rems operation before it is known to be needed; without speculation the cycle counts would be 8 not 7.
  The code does not use phasing (NYF). With phasing the count drops to 7 cycles for the first, while the second gets a 7 cycle loop and a one cycle wrap-up.
  Your turn 🙂
  This reply was modified 10 years, 9 months ago by staff. Reason: formatting fix
- Ivan Godard
  Keymaster
  January 5, 2014 at 7:46 pm
  Post count: 689
  #347
  Bummer: I see the forum has squeezed out all the extra blanks in my carefully-formatted code. I’ll take it up with forum admin.
  - Joe Taber
    Participant
    January 5, 2014 at 9:00 pm
    Post count: 25
    #351
    Too bad there is no preview option before posting. The code button in the edit bar surrounds selected text with backticks (same key as ~), but I don’t know if it works on multiple lines. Lets see how bad wordpress butchers this:
    (this is a test) (on three) (separate lines)
    I find it interesting you mentioned that we have to pick a mill member. Is there no way to write directly in the intermediate language that gets translated by the Specializer? Or is that what you just did and the IL code is just written for a hypothetical mill member with unlimited slots and belt?
    Is there a reference for all the instructions?
    It seems like it would be easier to tag an instruction and refer to its outputs by tag (and index if multiple outputs) instead of by belt number. The tags should be trivially converted to belt numbers during translation, but be much easier to program in, especially for hand-coded assembly like this.
    - Ivan Godard
      Keymaster
      January 5, 2014 at 11:02 pm
      Post count: 689
      #357
      Yep, I hadn’t noticed the “code” button 🙁
      The code was hand written; as you see, Mill asm is distinctly *not* humane.
      The ability to write relAsm (which is what we have called specializer input) was initially supposed to exist, but then dropped when we realized that there was no need and great cost, if it could be done at all – writing graph structures is no friendlier than absAsm (what the assembler accepts).
      Dave has long advocated a form of absAsm that at least calculated the belt for you, and even has a personal tool that sort-of does that that he used for some benchmarks. While there are issues that so far there are no solutions for, it looks like it could be done and would make asm easier. It remains a someday project, awaiting a volunteer and progress on higher priority projects.
      Unfortunately the entire operation set contains a few NYFs, so we can’t publish it yet. However, it should be possible to write at least a few of the samples using only the known operations, or pseudo-ops standing in for ops that you don’t know the mnemonic for yet. The format I used on the slides omits belt numbers in favor of a comment indicating what was being referenced. Thus the first gcd would be:
      /* code based on Rosetta C++ example: int gcd(int u, int v) { return (v != 0)?gcd(v, u%v):u; } */ F("gcd"); neqs({v}, 0), rems({u}, {v}); retnfl({neqs}, {u}); nop(4); // wait for rem call1("gcd", {v}, {rems}); retn({call});
      Would that help?
Will_Edwards
Moderator
January 23, 2014 at 12:40 am
Post count: 98
#530
One interesting micro-problem that would help me understand the Mill is computing the Euclidean distance between two 3D points.
The source code might look like this:
d = sqrt(sqr(a.x - b.x) + sqr(a.y - b.y) + sqr(a.z - b.z))
Or it might be:
d = sqrt(sqr(a[0] - b[0]) + sqr(a[1] - b[1]) + sqr(a[2] - b[2]);
Which are naturally equivalent if the fields in the point struct used in the first form are adjacent.
The parallelism of the subtractions and squaring is obvious, and easy as vector or as separate parallel operations.
- If vectorised, can you load a non-power-of-two length vector (perhaps it puts a power of two length vector on the belt, with None in the last slot?)
- If vectorised, how do you then sum the values in the vector together?
- And if done as separate operations, do you need two sequential add operations to add them together?
- This reply was modified 10 years, 9 months ago by Will_Edwards. Reason: clarifications
- Ivan Godard
  Keymaster
  January 23, 2014 at 12:54 am
  Post count: 689
  #531
  I don’t know this kind of application and your description is a bit cryptic, so this is a guess.
  The two loads, subtract, and square are routine vector ops. By “sum” I guess you mean an add-reduction; if so, NYF. A scalar sqrt uses a hardware sqrt-approximation and a software in-line algorithm, typically Newton-Rapheson. The whole can be pipelines, NYF.
Author
Posts

You must be logged in to reply to this topic.