Forum Replies Created
- AuthorPosts
Mostly we’ve been working with tests in C; currently 95+% of the suite compile. The C++ library is coming up because we are doing the OS kernel in C++, but it’s really only starting. Of course compile-time only code like the STL ports easily, but the rest will be a lot of work. The only part that is not just a port-and-debug is throw/catch, for which the standard implementation approach doesn’t work on a bare machine. We know what we will do, but that will be new code, which is slow to produce while we are still elfing things out.
For the C tests we have been throwing whatever we can find for free at it. We have the entire Rosetta suite in, for example. Probably the biggest single test in the suite is the micropython interpreter. There are a bunch more tests available that we can’t yet use because they need a file system – the bare-machine problem again; an API to use the host file system using outcalls is on the spike, but low priority..
As for comparisons, we has a commercial harness running (can’t remember the name) that does host vs. target compares across various stats and different test runs. With that we’ve found and fixed several sim-level problems: we didn’t have a stride prefetch predictor configured, exit predictor bugs, poor cache replacement, that kind of thing. By eyeball after that work Silver seems to be running similarly to our server x86s. Of course we can only measure execution, and will have to armwave power and area.
Actually, the biggest issue to my mind coming out of the comparisons is how apples-to-oranges they are, IPC especially. Sometimes the A2O gives us more instructions than x86: all our static speculation counts as instructions, but OOO run-ahead doesn’t; likewise a load-op-store sequence counts as three ops for us while an x86 mem/op/mem counts as one. Sometime A2O gives the x86 more: call is one op for us, potentially dozens for x86. About the only thing that is more apples-to-apples is whole-program wallclock-to-wallclock and there Silver seems pretty close on individual tests and sometimes ahead, and it’s still early days for our code quality. That’s without pipelining up yet; I’m real encouraged because I’m seeing loop sizes compress by a factor of three or more on Silver when I enable piping.
For now Gold isn’t really giving better numbers than Silver – the extra width can’t be used by the test codes and won’t be until we not only have piping working but also can unroll up to saturate the width. All the configs also need tuning – Silver to run one daxpy per cycle piped and Gold to run two, piped and unrolled, both with excess resources cut out to save power and area.
There are tuning issues in the smaller members too, Tin and Copper. There the issue is belt size. Even an 8-position belt is enough for the tests’ transient data, but the codes also have long-lived data too, and the working sets of that data don’t fit on the smaller belts, and some not on the 16-belt of a Silver. As a result the working sets get spilled to scratch and the core essentially runs out of scratch with tons of fill ops; much like codes on original 8086 or DG Nova. This is especially noticeable on FP codes like the numerics library for which the working set is full of big FP constants. Working out of scratch doesn’t impact small-member performance much, but has scratch bandwidth consequences, and power too that we can only guess at. We may need to config the smaller members with bigger belts, but that too has consequences. In other words, the usual tuning tradeoffs.
As soon as pipelining is up and we have comparison numbers we expect to pull the trigger on the next funding round, perhaps ~$12M. There are a bunch of things we want to file more patents on, but those, and more talks, will wait for the round.
- in reply to: Open Source Mill-like Virtual Machine #3446
We have always said that the Mill is merely the first of the belt-machine architectural category; you and Witold are demonstrating that. I’m impressed.
If the legalities matter to you, we’d be happy to give you a free non-commercial license to our patent set.
- in reply to: Loop compilation #3440
Here’s what the compiler gives you today for your example (Silver, no pipelining or vectorization, –O2):
F("f") %0 %1; load(dp, gl("n"), w, 3) %4/27 ^14, scratchf(s[1]) ^11; // V%0 V%1 spill(s[0], b0 %0/26) ^10, spill(s[4], b1 %1/29) ^10; // ^%0 ^%1 spill(s[1], b0 %4/27) ^14; // L7C24F0=/home/ivan/mill/specializer/src/test7.c // V%4 ^%4 nop(); nop(); gtrsb(b1 %4/27, 0) %5 ^15, // L7C22F0 brfls(b0 %5, "f$0_3", b1 %2, b1 %2) ^16, // L7C4F0 rd(w(0)) %2 ^12; // V%2 ^%4 V%5 ^%5 ^%2 ^%2 load(dp, gl("A1"), d, 3) %9/28 ^21; nop(); spill(s[1], b0 %9/28) ^21; // V%9 ^%9 nop(); nop(); inners("f$1_2", b2 %2, b2 %2, b2 %2) ^22; // L7C4F0 // ^%2 ^%2 ^%2 L("f$0_3") %20 %21; // for.cond.cleanup mul(b1 %20, b0 %29/1) %22 ^38, // L12C20F0 fill(s[4], w) %29/1 ^10; // V%21 V%20 V%29 ^%20 ^%29 nop(); nop(); // V%22 mul(b0 %22, b3 %21) %23 ^39; // L12C27F0 // ^%22 ^%21 nop(); retn(b0 %23) ^40; // L12C8F0 // V%23 ^%23 L("f$1_2") %10 %11 %12; // for.body // loop=$2 header mul(b3 %11, b1 %26/0) %14 ^28, // L8C26F0 loadus(b0 %28/9, 0, b3 %11, w, 3) %13 ^27, fill(s[0], w) %26/0 ^10, fill(s[1], d) %28/9 ^21; // V%11 V%10 V%12 V%26 V%28 ^%11 ^%28 ^%26 ^%11 nop(); nop(); // V%13 V%14 mul(b0 %14, b1 %13) %15 ^29; // L9C23F0 // ^%14 ^%13 nop(); nop(); // V%15 fma(b0 %15, b6 %11, b7 %12) %17 ^31; // L10C15F0 // ^%12 ^%11 ^%15 add(b6 %11, 1) %18 ^32; // L7C28F0 // ^%11 V%18 lsssb(b1 %18, b0 %27/4) %19 ^33, // L7C22F0 add(b2 %15, b7 %10) %16 ^30, // L9C14F0 backtr(b1 %19, "f$1_2", b0 %16, b4 %18, b2 %17) ^34, // L7C4F0 leaves("f$0_3", b2 %17, b0 %16) ^34, // L7C4F0 fill(s[1], w) %27/4 ^14; // L7C24F0=/home/ivan/mill/specializer/src/test7.c // V%27 ^%18 ^%27 ^%10 ^%15 V%19 V%17 V%16 ^%19 ^%16 ^%18 ^%17 ^%17 ^%16
Your three ebbs are there; f$1_2 is the loop body. The back branch is backtr(), the exit is leaves(). These take belt arguments (%N…) which are passed to the formals at the start of the target ebb (the formals are listed after the L(…) label). The initial entry is the inners() op, which supplies the starting values for the formals.
Branch argument passing works like call argument passing: there is no actual movement of data, just a renumbering of the belt to put the actuals into the formals order. Absent a mispredict, a branch (including the back/inner/leave forms) takes zero time, including passing the arguments.
The imported loop-constant values “n” and “x” are first spilled to the scratchpad in the prologue, and then filled for each iteration. It would also be possible to pass the values around the loop as an addition formal and back() argument. The specializer did not do that in this case; the latency would be the same, entropy slightly less, and power usage hardware dependent.
Incidentally, the loop body here is nine cycles (count the semicolons); with pipelining enabled it drops to three.
- in reply to: Loop compilation #3439
Please accept my apologies, folks. Somehow a recent update of the Forum software removed me from the notify list, so I was unaware of the discussion. I’ll be working through the backlog promptly.
Ivan - in reply to: Scratchpad design decision #3457
256 bytes stores 32 64-bit values (ignoring metadata), but do you need 256 positions? I would (naively) expect the slow belt to have the same number of entries as the main belt.
You need as many positions as the working set you design for has distinct operands, regardless of size. Working set is per algorithm, not per member, although market segmentation will let us downgrade a little in the smaller members. We haven’t paid much attention to tuning this, but seat-of-the-pants suggest that 64 would be good, maybe 127 better.
However, because the slow belt is populated by long-lived values that are going to be referenced many times during their lifetime, a slow belt would have a larger fill/spill (read/write) ratio than the fast belt
I does? I’m afraid I don’t follow.
You need one spill per operand, and one or more fills per operand. You need a new fill for each reference that no longer has the previous fill reachable. Generally, longer lived values are referenced more times than shorter-lived values, and are more likely to need a new fill rather than re-using a previous fill. As the spill count is constant, and the slow belt can be expected to have more references than the fast belt (and hence more fills) the ratio is different, which drives the fill bandwidth provision among other things.
- in reply to: Scratchpad design decision #3455
More of an Intel vs ARM split; we don’t expect to ever do our own fabbing. Of Intel vs. ARM we plan to model Intel and sell chips, although we’d be happy to do custom chips if the price and NRE were right. Of course, no business plan survives contact with the enemy (apologies Moltke the Elder).
- in reply to: Scratchpad design decision #3453
Tin has only 128 bytes of scratchpad, and Gold has 512. Why so small? I realize that the scratchpad isn’t expected to be used frequently. Then again, maybe the Tin should have more Scratchpad to make up for its lack of Belt.
These sizes are more a matter of marketing than technology. The scratch size is a performance limit, not an architectural limit; on any member a program whose working set runs out of scratch can use the (practically) unbounded DRAM address space via the extended spill/fill ops, they just cost more.
Any given program on any given member will find some architectural limit to increased performance: FU count, cache sizes, retire bandwidth, etc. – and scratchpad size. So for any given configuration there will be customers who will tell us “if you would only increase the size of this feature then my program would run so much faster; pretty please with sugar!”. But as we scale any member’s configuration pretty soon we get to the performance of the next bigger member. So we’ll say “just buy the bigger member”, to which they will say “but that costs more money!”.
Everybody wants Gold performance at a Tin price. We have to set the configurations to provide reasonable coverage of the price/performance spectrum. The currently defined sizes of scratch on the various members are just placeholders, and will certainly be tuned as the members go to market. But the tuning will be driven by market segmentation considerations, not technical ones. Welcome to the high-tech biz.
- in reply to: Scratchpad design decision #3452
A “slow belt” is an interesting idea. The problem is congruence over control flow. When the execution diverges at a branch, the various subsequent paths may drop different numbers of operands, so when the paths rejoins later the older belt items may have been pushed further along on one path than another. The two (or more) possible belt layouts at the join point must be brought into congruence so that in subsequent references each operand has a unique belt offset for its temporal address. The “rescue” and branch operations make things congruent. On the fast belt, the lifetimes are brief enough that there are relatively few operands that need to be shuffled to establish congruence, so the instruction bit space for arguments in rescue and branches is relatively small.
However, the scratchpad, or a slow belt, is for long-lived values by definition, and the lifetime of its working set would extend over much control flow. A slow belt would have to be kept congruent, just as the fast belt must be. However, the working set at the scratchpad lifetime level is much larger than the fast belt working set, so slow rescues or other slow reshuffling ops would need to name operands from a much larger name space, and would need many more bits to encode. A maximal fast rescue on a Silver (fast belt 16) is 16*4 = 64 bits, which is easily encodable. Meanwhile a slow rescue on a 256-position space would need 256*8 = 2048 bits, which we can’t encode. In addition the logical-to-physical mapping hardware for the slow namespace would need to be much bigger than that of the fast belt, likely even bigger than the logical-to-rename-register mapping hardware of a legacy OOO machine.
By the way, a maximal rescue on a 32-belt member needs 160 bits, which needs four flow slots to encode; on a 64-belt a maximal needs 384 bits which is past anything we have configured or have thought to configure; and more than that is right out in the present encoding. It would be nice to be able to encode such ultra-wide constants, not only for wide rescue but also for things like bit-matrix multiply, but we don’t support that at present.
In exchange for the expensive congruence, a slow belt would have a more compact “spill” operation than scratchpad spills do, because slow-belt spills would not need to express the destination address. However, because the slow belt is populated by long-lived values that are going to be referenced many times during their lifetime, a slow belt would have a larger fill/spill (read/write) ratio than the fast belt, which reduces the value of the compact slow spill.
Our scratchpad alternative uses spatial addressing, which moves the logical-to-physical mapping to the compiler. As a result the scratchpad is always congruent and control flow can be ignored regardless of the scratchpad size. The spill op needs to carry an address, but scratch addresses are small (likely 8 to 16 bits in different configurations), and spill is as relatively rare in scratch as it would be in a slow belt.
All in all, conceptually a “slow belt” is possible, but I doubt it would be practical. Definitely worth thinking about though; thank you.
- in reply to: Loop compilation #3445
rd() and wr() are no longer used to reach scratchpad; we now use another overload of spill/fill instead. It’s just a nomenclature change.
Streams remain undisclosed (and incompletely implemented). There will be a talk someday.
- in reply to: Loop compilation #3444
Yes, sadly the Wiki has not kept up with the machine. We are very resource constrained until the next funding round.
- in reply to: Loop compilation #3443
“I am thinking to write a simulator / dynamic recompiler for Mill-like ISA to be simulated (for research mostly), with input machine code mostly written by hand at this time.”
If so, why not just join the Mill team? You’d have access to the real tools, and would get equity in the company.
Be aware that hand writing Mill assembler is not recommended 🙂 It’s hard enough to read, even when you know the machine.
- in reply to: Loop compilation #3442
Witold – Mill branches can only be to the head of an ebb, so you can’t both fall into the epilog and also branch to it. That’s why there’s a “leaves” op in the actual asm. Having the prologue branch around the loop body is not something the specializer does, it’s from LLVM and you should see it when compiling for other architectures too.
- in reply to: Loop compilation #3441
Veedrac, you make an excellent code generator 🙂 The compiler got rid of the add by using a fma() op instead, and you’re using brtr rather than the loop branching forms, but otherwise it’s right on by my eyeball.
- in reply to: The Compiler #3438
Scheduler:
As the specializer is now implemented, the prioritization and the scheduling are not split into two passes. Instead the prioritization is done on the fly during scheduling.Scheduling itself is fairly straightforward; dependencies and latency determines a target cycle, and the op is placed therein if a supporting slot is free. If not then we search for one, up or down the tableau depending on the op. Prioritization is a more complicated set of heuristics, that use both a priority among different categories of ops without consideration of dataflow, and among ops in the same category based on dataflow and belt size considerations. For more detail you’d need to work through the code, and the heuristics are likely to change as we gain experience.
Retire stations:
As with other aspects of family member configuration, the number of stations will be set based of extensive tuning and measurement for workloads characteristic of the intended market. As a rough starting point, a station count equal to the belt length seems to work well.Scratchpad:
That’s an excellent description of scratchpad behavior – would you be interested in writing a white paper? The only change needed: the OS doesn’t spill the scratchpad on task switch. We just note where in the circular buffer the task switch occurred, so the buffer can be being unloaded into one task’s save space while it is simultaneously being loaded or used by a different task from its own save space. The only thing that gets changed at task switch is the address that the hardware uses for load/unload.Questions and responses:
Yes, you are right: it is very hard to set aside one’s own familiarity and answer (or prepare a talk) that addresses the level of knowledge and experience of the audience. Especially so when the audience is quite varied and it’s necessary to neither bore those familiar with the subject nor lose those not so familiar. We do our best. - AuthorPosts