- Thomas DParticipantMarch 26, 2019 at 7:39 pmPost count: 16
HELLO??? Did Ivan/Roger/Mark leave us and now this startup is dead, too?
- Ivan GodardKeymasterApril 2, 2019 at 7:16 pmPost count: 558
Mostly we’ve been working with tests in C; currently 95+% of the suite compile. The C++ library is coming up because we are doing the OS kernel in C++, but it’s really only starting. Of course compile-time only code like the STL ports easily, but the rest will be a lot of work. The only part that is not just a port-and-debug is throw/catch, for which the standard implementation approach doesn’t work on a bare machine. We know what we will do, but that will be new code, which is slow to produce while we are still elfing things out.
For the C tests we have been throwing whatever we can find for free at it. We have the entire Rosetta suite in, for example. Probably the biggest single test in the suite is the micropython interpreter. There are a bunch more tests available that we can’t yet use because they need a file system – the bare-machine problem again; an API to use the host file system using outcalls is on the spike, but low priority..
As for comparisons, we has a commercial harness running (can’t remember the name) that does host vs. target compares across various stats and different test runs. With that we’ve found and fixed several sim-level problems: we didn’t have a stride prefetch predictor configured, exit predictor bugs, poor cache replacement, that kind of thing. By eyeball after that work Silver seems to be running similarly to our server x86s. Of course we can only measure execution, and will have to armwave power and area.
Actually, the biggest issue to my mind coming out of the comparisons is how apples-to-oranges they are, IPC especially. Sometimes the A2O gives us more instructions than x86: all our static speculation counts as instructions, but OOO run-ahead doesn’t; likewise a load-op-store sequence counts as three ops for us while an x86 mem/op/mem counts as one. Sometime A2O gives the x86 more: call is one op for us, potentially dozens for x86. About the only thing that is more apples-to-apples is whole-program wallclock-to-wallclock and there Silver seems pretty close on individual tests and sometimes ahead, and it’s still early days for our code quality. That’s without pipelining up yet; I’m real encouraged because I’m seeing loop sizes compress by a factor of three or more on Silver when I enable piping.
For now Gold isn’t really giving better numbers than Silver – the extra width can’t be used by the test codes and won’t be until we not only have piping working but also can unroll up to saturate the width. All the configs also need tuning – Silver to run one daxpy per cycle piped and Gold to run two, piped and unrolled, both with excess resources cut out to save power and area.
There are tuning issues in the smaller members too, Tin and Copper. There the issue is belt size. Even an 8-position belt is enough for the tests’ transient data, but the codes also have long-lived data too, and the working sets of that data don’t fit on the smaller belts, and some not on the 16-belt of a Silver. As a result the working sets get spilled to scratch and the core essentially runs out of scratch with tons of fill ops; much like codes on original 8086 or DG Nova. This is especially noticeable on FP codes like the numerics library for which the working set is full of big FP constants. Working out of scratch doesn’t impact small-member performance much, but has scratch bandwidth consequences, and power too that we can only guess at. We may need to config the smaller members with bigger belts, but that too has consequences. In other words, the usual tuning tradeoffs.
As soon as pipelining is up and we have comparison numbers we expect to pull the trigger on the next funding round, perhaps ~$12M. There are a bunch of things we want to file more patents on, but those, and more talks, will wait for the round.
- kwinzParticipantJune 30, 2019 at 12:28 pmPost count: 1
You are programming an OS kernel now in C++?
Sounds like an awful business decision for a hardware startup.
In the interest of “crawl, walk, run”:
Start with an OS-less platform first. Something that doesn’t need an OS, like Arduino platform board.
Yes, it won’t highlight all the novel address space innovations and target “specializations” that you bring to the table but it’s a start.
Then iterate. Similar to how SiFive did it with their RISC-V offerings.
I worry about you guys. All the best!
- Ivan GodardKeymasterJune 30, 2019 at 2:57 pmPost count: 558
Much less burdensome on a Mill, but still will be a long time and a lot of work, yes.
However, we need to start the OS work so as to have a test and verification code suite for the system aspects of the chip. Even an Arduino needs interrupt handling, atomics, … Standard benchmarks don’t exercise those parts of the design.
Various announcements coming soon (for some value of soon).
- goldbugParticipantJuly 17, 2019 at 6:05 amPost count: 45
Maybe open sourcing the kernel? there are a lot of file systems out there, I am sure someone could port one. maybe from genode? Even if it is not perfect, it might serve as an MVP
- goldbugParticipantJuly 20, 2019 at 8:03 amPost count: 45
I could not find anything on google. What is A20? is it a profiler? emulator?
- Ivan GodardKeymasterJuly 20, 2019 at 9:40 amPost count: 558
Don’t know anything about “A20” either, nor why you ask. I did mention “L4”, see https://en.wikipedia.org/wiki/L4_microkernel_family
- cpt_charismaParticipantSeptember 16, 2019 at 5:30 pmPost count: 1
On the Tin and copper (and all members, really), it seems like you can get a 25%-400% (depending on what you’re doing) effective belt length increase if you can get the specializer to add something like conform ops every couple of instructions to clear dead values or move useful ones to the front. If 80% of values are only used once on average, it makes sense to get rid of them asap, leaving more space for useful values. The specializer should already know live/dead info, since it has to figure out whether to spill things. The question is whether you have enough holes in the schedule to take advantage. It might be worth adding conform to other slots or adding a few bits to the encoding specifically for this purpose. It’s extra hardware, but maybe less than a longer belt?
You could also do something goofy like automatically killing belt values that get spilled or written to memory. You would have to add additional ops for cases where you don’t want that behavior. I have no idea whether this would be worthwhile, though.
Of course, this assumes you usually have at least a few dead values on the belt.
- Ivan GodardKeymasterSeptember 16, 2019 at 7:16 pmPost count: 558
An update: Conforming is no built into branch ops, so the only use of a separate op is to recover live values from the mix of live and dead, as you suggest. Consequently the former “conform” op is no spelled “rescue”.
Rescue is cheap in execution time but expensive in code space when a value is only used far from where it is defined. A significant part of the specialize heuristics are devoted to finding a balance between use of rescue and of spill-fill.
We looked into auto-kill, but so far it seems to not fit the hardware timing; figuring out how things get reordered in the specializer is faster than trying to do it at runtime.
Rescue is only useful when the currently live values fit on the belt but some are out of reach. If the live-value count exceeds the belt size then there’s nothing for it but spill and refill when belt pressure is lower.
- Thomas DParticipantSeptember 20, 2019 at 10:50 pmPost count: 16
Conforming is now built into branch ops, so the only use of a separate op is to recover live values from the mix of live and dead, as you suggest. Consequently the former “conform” op is now spelled “rescue”.
When are you going to regenerate the wiki? I imagine that it is quite old at this point….
- Thomas DParticipantSeptember 29, 2019 at 8:22 pmPost count: 16
I don’t believe that I could deliver results commensurate with how exciting I find your work. As a consequent of turning a hobby into a bill-paying job, the job frequently saps my desire to perform my hobby (and given the paperwork that I signed for said job, I am frequently unsure of how to actually contribute to things of pith and moment). I’ll have to pass, for now.
Puke: Nevermind my tights, I thought we were talking about my pith.
Snot: That’s why you have to wash them out!
- LarryPParticipantMay 6, 2019 at 3:48 amPost count: 78
What’s a daxpy?
one daxpy per cycle
And when you write “cycle,” do you mean a clock cycle or a Mill instruction?
- Ivan GodardKeymasterMay 6, 2019 at 5:28 amPost count: 558
What’s a daxpy?
Daxpy is a BLAS function, the most commonly used: “double precision A times X + Y”; see https://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprograms#Level_1. It’s why hardware has FMA (a.k.a MAC) ops.
And when you write “cycle,” do you mean a clock cycle or a Mill instruction?
Absent stalls, Mill issues all instructions of one bundle (formerly described as “all operations of one instruction”, except people got confused by the terminology) in one clock cycle. So cycle and bundle (instruction) are equivalent.
You must be logged in to reply to this topic.