Mill Computing, Inc

Keymaster

April 2, 2019 at 7:16 pm

Post count: 689

Mostly we’ve been working with tests in C; currently 95+% of the suite compile. The C++ library is coming up because we are doing the OS kernel in C++, but it’s really only starting. Of course compile-time only code like the STL ports easily, but the rest will be a lot of work. The only part that is not just a port-and-debug is throw/catch, for which the standard implementation approach doesn’t work on a bare machine. We know what we will do, but that will be new code, which is slow to produce while we are still elfing things out.

For the C tests we have been throwing whatever we can find for free at it. We have the entire Rosetta suite in, for example. Probably the biggest single test in the suite is the micropython interpreter. There are a bunch more tests available that we can’t yet use because they need a file system – the bare-machine problem again; an API to use the host file system using outcalls is on the spike, but low priority..

As for comparisons, we has a commercial harness running (can’t remember the name) that does host vs. target compares across various stats and different test runs. With that we’ve found and fixed several sim-level problems: we didn’t have a stride prefetch predictor configured, exit predictor bugs, poor cache replacement, that kind of thing. By eyeball after that work Silver seems to be running similarly to our server x86s. Of course we can only measure execution, and will have to armwave power and area.

Actually, the biggest issue to my mind coming out of the comparisons is how apples-to-oranges they are, IPC especially. Sometimes the A2O gives us more instructions than x86: all our static speculation counts as instructions, but OOO run-ahead doesn’t; likewise a load-op-store sequence counts as three ops for us while an x86 mem/op/mem counts as one. Sometime A2O gives the x86 more: call is one op for us, potentially dozens for x86. About the only thing that is more apples-to-apples is whole-program wallclock-to-wallclock and there Silver seems pretty close on individual tests and sometimes ahead, and it’s still early days for our code quality. That’s without pipelining up yet; I’m real encouraged because I’m seeing loop sizes compress by a factor of three or more on Silver when I enable piping.

For now Gold isn’t really giving better numbers than Silver – the extra width can’t be used by the test codes and won’t be until we not only have piping working but also can unroll up to saturate the width. All the configs also need tuning – Silver to run one daxpy per cycle piped and Gold to run two, piped and unrolled, both with excess resources cut out to save power and area.

There are tuning issues in the smaller members too, Tin and Copper. There the issue is belt size. Even an 8-position belt is enough for the tests’ transient data, but the codes also have long-lived data too, and the working sets of that data don’t fit on the smaller belts, and some not on the 16-belt of a Silver. As a result the working sets get spilled to scratch and the core essentially runs out of scratch with tons of fill ops; much like codes on original 8086 or DG Nova. This is especially noticeable on FP codes like the numerics library for which the working set is full of big FP constants. Working out of scratch doesn’t impact small-member performance much, but has scratch bandwidth consequences, and power too that we can only guess at. We may need to config the smaller members with bigger belts, but that too has consequences. In other words, the usual tuning tradeoffs.

As soon as pipelining is up and we have comparison numbers we expect to pull the trigger on the next funding round, perhaps ~$12M. There are a bunch of things we want to file more patents on, but those, and more talks, will wait for the round.

Reply To: news?