Mill Computing, Inc. › Forums › The Mill › Tools › Simulators › Simulation
Tagged: community software, Forth, functional language, GenAsm
- AuthorPosts
Not yet for public release. The architecture has been running in sim for 5+ years now, but only for conAsm except for our partial compiler, now abandoned. When genAsm and the specializer are ready (Real Soon Now) then we expect to provide either online access via our servers or a downloadable version to run on your own, or both. The full LLVM-based tool chain will take longer.
My idea was to learn Forth by bootstrapping it from scratch, I thought it would make sense to try it with the Mill.
The interesting part of doing Forth on a Mill is not the ops, which are straightforward, but in figuring out how to map the Forth dual-stack program model onto the belt. If you decide to pursue this, get acquainted with the Mill ISA (on the Wiki) and standard Forth, rough out the implementation strategy, and get back to us. If we are ready at that point, you can be a beta tester of the public sim; NDA will be required.
I’ve been thinking about the proper way to get Forth onto the Mill architecture. Like you said, the operations/words are the easy part, once the double stack system is set up.
My idea was to steal from the cheap
call
in the Mill to serve the function of the return stack (though we lose the ability to abuse the return stack for other needs). The data stack makes me think of the belt. colorForth uses an 8 item circular buffer, but is for a different use case. Does the compiler keep track of the virtual stack like this:virtual stack: 3 4
belt: 3 4forth word: +
virtual stack: 7
belt: 7 3 4The belt seems to be much more along the lines of a purely functional language where GC is cheap; whenever something falls off the belt. Here I am torn between:
1) Move towards having a Forth where the stack effects are known and verified at compile time. This allows the compiler to keep track of the “virtual” stack. Compiler manages and saves items to a software defined stack otherwise.
1b) use the scratchpad for the remainder of the stack.
2) keep is a “small” Forth: only use the belt, words CANNOT use more than the belt or temp usage of the scratchpad. Restricted, but simple.
3) go very simple; words use the belt+scratchpad, but words themselves are responsible for putting everything onto a software defined stack. This is simpler, but slower as everything has to go through indirection.
4) Deviate from Forth stack-based concept and use a belt-based concept.
I would enjoy trying these and building upon them. And any suggestions would be appreciated.
It’s clear that it can be done, and will be as efficient as a general-register machine. However, just what will be the best way will require some serious head-scratching and experiment.
My personal starting point would be to keep the Forth stack in belt and scratchpad. That permits everything to be fixed latency, which gives the compiler a chance to generate stall-less code. Some Forth words are not machine primitives, and the expressions can be evaluated directly on the belt. However, most Forth values are either consumed shortly after creation, or wind up lower in the Forth stack for reuse. A naive compiler can use rescue to get rubble out of the belt, but when a live value still would call off then it can spill to scratchpad.
Sounds like (1b). I’ll call that approach “juggling the stack”. The compiler statically keeps track of the stack state and occasionally interrupts the normal flow with stack management. With small enough words, the juggling can all be accomplished during call/return’s.
Tomberek, Ivan and others interested in Forth or similar:
Ah, Forth; the name brings back good memories!
One could certainly port Forth to the Mill — or one could design/implement another interpreted language that makes better use of the Mill’s architecture (and thus almost certainly faster, simpler and thus likely usable sooner.) Which way to go depends mainly on whether there’s real demand to run existing Forth code. People have done some amazing things in Forth (e.g. Chuck Moore, Forth’s creator, who wrote his own chip-design system in Forth, if I recall correctly), but I’m not sure how many prospective Mill users/programmers/system-designers want to run existing Forth code.
On the other hand, having a working interpreted language for the Mill (either via a simulated Mill or a TBD implementation on an FPGA) would IMHO provide a more programmer-friendly way to try out the Mill than the current assembly/simulation tools — and sooner — even if it’s different than Forth. To me, that seems like a worthwhile option to explore, especially since the LLVM toolchain sounds like it won’t be ready as soon as we might wish.
A small, efficient, interpreted language — one designed to make use of the belt model — might well be easier to implement, especially since (as I understand it) the core interpreter will have to be written in Mill ConAsm (at least until the Mill’s specializer is ready to generate simulate-able/runnable Mill executables GenAsm
Ivan, any projection on when you’ll be able to simulate code written in GenAsm internally? (Or is that working already?)
I find myself leaning somewhat toward a not-textbook-Forth interpreted language for the Mill, because many of the most common Forth “words” (such as dup, swap and over) are for juggling values on the stack, so the word you really want to invoke has all its operands on the stack in the correct order. It seems a shame to enforce strict stack semantics on a machine that has something arguable better (the belt) built right into it. However, it there is real demand (or even strong interest) for Forth itself on the Mill, I’d still be interested in getting that working.
Tomberek,
especially if you’re learning Forth and haven’t done a port, I suggest-
Threaded Interpretive Languages: Their Design and Implementation
by R. G. Loeliger. Probably out of print, but used copies are apparently available from a certain large online book (and everything else) seller.
I’d be interested in further discussion and possible collaboration. Feel free to email me at: ursine at (a large, free email service, hosted by g00gle.)
Ivan, any projection on when you’ll be able to simulate code written in GenAsm internally? (Or is that working already?)
It is working already for a subset of the machine: all scalar and most vector operations, and simple control flow. Not working yet: memory, reduction vector operations, automatic vectorization, pipelining, self-tuning prediction, some random corners. Progress is being made, but I cannot promise a schedule.
I’d be interested in further discussion and possible collaboration. Feel free to email me at: ursine at (a large, free email service, hosted by g00gle.)
Feel free to start a topic section here. You also can (and should) use the new public Wiki for design and documentation.
I do not recommend any project written in conAsm. Any program longer than the three-line snippets used in the hardware verification (the intended use) is hopeless.
Have you considered direct interpretation of genAsm?
Ivan
“Have you considered direct interpretation of genAsm?”
I’m considering it. I’ve updated the wiki page on genAsm to fix some EBNF syntactic issues, and have the spec itself parsing correctly now.
Are there any public snippets of genAsm to test against? I know it’s in flux, but that’s why I’d execute from the specification directly, to absorb changes.
David,
You wrote:
I’ve updated the wiki page on genAsm to fix some EBNF syntactic issues, and have the spec itself parsing correctly now.
Please note that most of the wiki pages under architecture or infrastructure (except for the talk/discussion tab) are periodically overwritten from non-wiki source via the “generator.” Therefore your changes are likely to get overwritten. I strongly suggest you advise Ivan of your suggested changes and let him update the source upstream of the wiki.
Offhand, I don’t see much benefit of interpreting genAsm, since to get anywhere, we’ll already have access to compiled genAsm. (Though feel free to convince me otherwise.) I’d rather write a more nearly user-friendly interpreter, like Forth.
Ok, since the EBNF wasn’t completely syntactically correct, I assumed it was just a straight hand-edited wiki page.
To quote Ivan, “Have you considered direct interpretation of genAsm?” 🙂
It sounds like the official system would specialize to conAsm, and the emulator would run a conAsm with useful cycle tracking and such for a particular Mill family member spec. I think there’d be some value in an externally sourced, straight genAsm interpreter (actually, given that I’m working in Lisp, I’d convert it to Lisp code to get native machine speed). Since it would ignore many CPU specifics, it might run faster and be able to iterate features quicker.
Basically, it would be for testing your genAsm conceptually, while genAsm->conAsm->CPU-emulator would test how a Mill runs your code.
Plus, wouldn’t a MillForth emulator be written in genAsm? Need to run that somewhere, and the official chain isn’t out yet. One without all the exacting details nailed down would be a good substrate for getting the basic design decisions implemented (assuming the genAsm VM is correct in what it runs).
Ok, since the EBNF wasn’t completely syntactically correct, I assumed it was just a straight hand-edited wiki page.
It was and is and probably will be. Thank you for the clean-up, and I’d appreciate it if you would keep an eye on the page for things I screw up in the future.
No code samples yet. The only big chunks of genAsm we have are floating-point emulation functions, and they have NYF content. I’ve done a little work on the EEMBC benchmark, but it all needs things that genAsm doesn’t do yet, like loops and memory and structs and …
Ivan,
Can genAsm’s go construct do backwards branches? (that is to a label earlier in the same function??
If so, then that’s sufficient for writing the endless loop of an interpreter (though I’ll welcome the syntactic sugar of better looping constructs.)
From what I can tell, memory operations and statically allocated memory are the key other pieces needed to write a (very) basic Forth interpreter, e.g. one that starts out handling only scalar integers and pointers.
Yes. The label declaration must precede both the label placement and any GO or LEA referencing the label, but there is no ordering requirement between placement and GO/LEA.
Memory is coming. You actually can write memory ops (load/spore/lea) now, what you can’t do is declare memory and reserve space.
LarryP,
I agree. The more I learn about the Mill the more I feel that trying to shoehorn Forth’s stack-based approach into a Mill environment would be frustrating and not very productive. It would only be useful if one wanted to use existing Forth libraries; not my intent.
So what would a higher-level language look like coming from the Mill-style of computing? The immutable values remind me of Functional languages. Items falling off the belt remind me of automatic GC. I did think about writing an interpreter for the Mill Instruction set in Haskell, paramaterized of course by the family-member specifics.
This recent talk by Rob Pike describes a simple APL variant that would be fun to think about on the Mill; an interpreted language that can use the Mill’s vectorization nicely.
A belt-Forth is a very interesting language target. I’d be happy to help, both on behalf of Mill Computing and also personally with my long professional and personal background in language design.
A couple more thoughts/opinions about Forth (or another small, interpreted language) for the Mill:
1. Several common Forth words (for example swap and drop) are in conflict with the Belt’s single assignment nature. We could use the conform and rescue operations to force-fit Forth’s semantics onto the belt, but that seems ugly and problematic — unless there’s some real desire to have Forth per se, on the Mill. If that ugliness were hidden inside the interpreter, I could grit my teeth and live with it. The advantage of porting Forth is that it’d be “just” a port of an existing, well documented interpreter design, and then existing Forth code could be (relatively) quickly up and running on the Mill and be demonstrated to interested parties.
2. Even a minimalist interpreter would need (the functional equivalents of) working memory, std-input/keyboard, std-output/console and persistent storage. Having std-error would also be nice. Compared to modern languages and CPUs, a Forth-like interpreter would have rather modest needs for working memory and persistent storage. If we want to get something running — and not tear out hair trying to code it in ConAsm, then we’d need genAsm to support basic memory operations (load, store and hopefully stack alloc./cutback) and something (simulator outcalls?) to serve as std-input and std-output. If the interpreter is running on a simulated Mill (seems likely short term), then a we could probably make a specified address range of simulated memory simulate non-volatile memory, thus finessing persistent storage (or perhaps simulate all memory as persistent.)
3. Although APL brings back other fond memories (e.g. of writing Dungeons and Dragons support code on an IBM 5100), I think an interpreted language that does scalar ops well is a better near-term target. I suggest leaving vectorization wizardry for version 2.
4. While this kind of preliminary arm waving is fun, I think we need some clarity on how such an interpreted language could be used to help bootstrap the Mill into quicker and broader interest and use. I don’t want to go all UML on folks, but what should be our key use case(s) for Forth or similar on the Mill?
Thoughts, opinions, corrections? Please chime in!
Hi LarryP, tomberek, other lurkers 🙂
What kind of projects has everyone been involved with previously?
I’m just curious 🙂
I can also be reached by email: will at millcomputing.com
Ivan wrote:
You also can (and should) use the new public Wiki for design and documentation.
Is the Wiki now public? I see a login page its the upper right corner, but no obvious way to create an account (and unsurprisingly, forum credentials don’t work.) If the Wiki is open for comments and additions, please let us know how!
Thanks
The wiki is now finally public.
Thanks for reporting the.login problem; we’ll get it fixed!
Here’s a use case that makes sense to me:
If Mill hardware (e.g. an FPGA prototype of Tin or Copper) is ready before the LLVM modifications are sufficient to compile Mill executables, then a simple interpreter could serve the roles of monitor, debugger and image loader. I think that could help a lot with board bring-up. Even if hardware only might be ready first, I think having such an interpreter would be a worthwhile risk-reduction strategy.
Under the above use case, I’d reverse my previous opinion:
In that use case, I’d suggest first porting a minimal version of vanilla Forth, without trying to make optimal use of the Mill’s unique characteristics. But once that Forth interpreter is working and stable, then we could use what we’ve learned from bringing that up on the Mill (or Mill simulation) to design a new interpreted language, one better suited to the Mill’s unique characteristics.Thoughts?
While language design is fun, chips to run existing code are what sells. We are a commercial enterprise, so we can only invest in what gets product sooner. We’re happy to encourage and help a Forth (it *is* fun after all), but we wouldn’t bring up a usable FPGA until after massive sim using compiler output.
The core of the architecture – belt, encoding, memory model is done, and we have enough to do those parts of an FPGA. But larger scale parts need larger scale code to verify, and in practice that means a compiler.
genAsm language question re labels and branches
How are program labels and control transfers (e.g. conditional branches) expressed in the genAsm language? Could you possibly make an example or two available here or on the wiki?Thanks much!
Warning: genAsm is very much a work in progress, and information here and on the Wiki is subject to change without notice.
For ordinary conditionals use the syntactic sugar of the ifClause. There will be similar sugar for looping constructs, TBD.
For the very lowest level you explicitly declare labels using labelDef. You transfer to them using the go pseudo-function. Labels are placed at a join, syntactically the label identifier followed by a colon.
Transfers to and fall-throughs into a join carry data. The widths of the carried data is declared in the labelDef, and the count of carried operands must fit on the belt. Like all ops and functions, go may be used in suffix, infix, or functional form; the target label is the final argument and any other arguments are carried data. The arguments to go (and the current data when falling into a label) must match the declaration of the label in number and width.
Labels are scoped based on the point of declaration. The corresponding join, and any go invocations targeting the label, must be in the same scope. And of course a label be declared before other use, and must appear in exactly one join.
Example:
label w lab;
if (b eql 0) then 17 go lab fi;
….
F(x,y,z) lab: + a -: c;Here if b is zero then control transfers to lab carrying the data value of 17. Otherwise execution continues until the function call on F, whose return value become the fall-through value at the join at lab. However control reaches lab, the data is added to a and that is bound to c.
Transfers to and fall-throughs into a join carry data. The widths of the carried data is declared in the labelDef, and the count of carried operands must fit on the belt.
Ivan,
Would you kindly explain more about the data carried into joins? There are two things I’d like to better understand:
1. when listing/describing belt operands associated with a label, does one describe the entire belt contents, or just what all the branches to that label have added to the belt, with respect to some previous (TBD?) belt state? (Or something else entirely?)
2. If we’re writing genAsm code, then we don’t yet know the belt length — and ideally, we want this code to run correctly on any Mill, no matter its belt length. If the specializer is supposed to model an abstract Mill with an infinite belt in other (non-branch/label) contexts — e.g. by inserting fills and spills as needed — why is this seemingly-member-specific belt-length constraint exposed by genAsm branch labels? Have I misunderstood something?
1) You list only the arguments to the label; there’s no way to list the belt contents anyway, because only the specializer knows that. At a join point anything else on the belt is dead anyway. Note that falling into the label is equivalent to an (elided) go to the label from the falling code.
2) Sorry my explanation wasn’t clear. Labels, like functions, can have an arbitrary argument list in the source, including the genAsm source, and functions (but not labels) can be VARARGS. The specializer maps these arguments in a member-dependent way to a set that are passed on the belt, and a remainder that are passed in memory (functions) or scratchpad (labels). To a first approximation you may assume that leading arguments that are less than or equal to the size of the largest member-supported scalar will be on the belt in argument order, while larger arguments (in particular quad data on a non-quad member) may be split into a tuple and passed piecemeal.
I grant that “must fit on the belt” was misleading; I was thinking in terms of the specializer implementation. Internally the specializer turns label declarations into fewer-argument labels (for which all arguments can go on the belt) plus a set of scratchpad locations that are reserved in the scratchpad map so they won’t be used for spill/fill. The go operations are also transformed into a set of explicit spill ops plus a conform for what gets passed on the belt, plus the transfer op itself (br and friends).
At the join, the current argument set comprises the incoming stuff on the belt and the set of scratchpad locations. About the only thing you could do at a multi-argument label in genAsm is to rebind the incoming arguments, and the rebindings name whatever belt position or scratchpad location they name, and can be used normally thereafter. Loops are not yet done, but I expect that the back-transfer will use the same mechanism to get all the loop-carried data to the next iteration.
VARARGS handling is not yet finalized. Currently we expect to pass the entire “…” list in memory (arguments before the “…” pass normally) but we may come up with better ideas at some point.
Greetings all,
Those of you interested in an interpreter/interpreted language for the Mill may enjoy reading the following paper, available (as of today) free online:
“The Structure and Performance of Efficient Interpreters” by M. Anton Ertl and David Gregg. Here’s the link: http://www.jilp.org/vol5/v5paper12.pdfOne thing this paper focuses on is that on many CPUs, branch prediction for indirect branches (e.g. via function pointers) is often much poorer than branch prediction for direct branches (like simple if/then/else constructs.) I suppose I should re-watch the presentation on the Mill’s prediction mechanism, to see what’s revealed about how the Mill will handle indirect branches. This issue is much less important for compiled code than for interpreted code, but apparently still gets involved in run-time method dispatch,e.g. in polymorphic objects.
I note that Yale Patt’s work is cited several times in this paper and I recall reading other interesting papers by Ertl.
Enjoy!
Summarizing Mill prediction:
Most machines predict branches (taken/not taken) and have a separate table (Branch Target Table) with actual addresses for dynamic branches. These table are consulted only when execution reached the decoder, and so must be very fast and hence small, especially the BTT, which causes pain in codes with lots of indirect branches such as virtual function dispatch.
The Mill predicts exits, not branches. Mill code is organized into Extended Basic Blocks, one-entry-many-exit code sequences. Unusually, the Mill form of EBB may contain embedded calls. If you enter an EBB, no matter how many exit branches the EBB contains only one will be taken, and the Mill predicts that exit, including the target address. In effect, the Mill has only a BTT and no branch table at all, and no entries for exits not predicted to be taken.
The Mill organization permits run-ahead prediction – we don’t need to look at the code to follow a prediction chain. Thus the tables can be well away from the decoders, and hence can be big, slow, and cheap.
Now that Mill advantage assumes that things are actually predictable; it doesn’t buy anything (no predictor does) when control flow is random and you miss all the time. Cliff Click calls such code megamorphic, and besides virtual function dispatch the poster-child for megamorphic is the dispatch loop/switch of interpreters.
Consequently a performance interpreter, for Mill or anything, has to avoid switch statements in the dispatcher. It is far better to take a little more time in the encoding or translating to change op indices into actual jump/call instructions in execution order. Even if the result of a JIT is just a sequence of call ops to exec functions that are not inlined, the JITed code will run *much* better than if the original byte codes were indices into a switch.
Of course, if everything gets inlined then you have one machine EBB for each EBB that is present in the (Java, say) bytecode source – and you run like a house afire.
There are two ways to do the inlining. The slow way, which is member independent and will give the best performance, is to turn each bytecode into a function call on the exec function, all represented either in genAsm or directly in genForm. Mark all the exec functions (well, the simple ones anyway) as “inline”, and use the genAsm and specializer libraries to get you target binary, and then call the OS to mark the result as executable. Not fast. But of course, you can do it once for all of each package at package-load time, and thus amortize the cost.
The fast way to do inlining, which is member-dependent and will not give as good performance as the other route, is to look up in a pre-built table (one per member) the actual raw binary bit pattern of a Mill instruction. Said instruction contains a call on the exec function and whatever belt diddling is needed for the interpretation. Table entries will have two pieces, one for flow side and one for exu side, and your baby-JIT will just glue the pieces on the correct end of the EBB being built. When you hit the end of an EBB in the bytecode then give your new binary EBB to the OS to bless as code and start on the next. If you can’t tell where the EBB boundaries are in the bytecode then just treat each basic block (which you should be able to delimit) as an EBB.
Easy, huh? 🙂
Very nice to see it online; I hadn’t read it.
Has it been behind a paywall previously?
Anton is a name I recognise; he was slightly active in my Corewar-playing youth; small world 🙂
Another great intro to interpreter loop performance was given by Google at I/O 2008:
We often used the GCC computed gotos extensions in Corewars interpreters.
We also used a lot of explicit data prefetches.
Will,
I don’t know when that Ertl paper was/wasn’t paywalled. It turned up in a Google search I did today; I recognized the author and took a look. I hope it stays available.
Gcc computed gotos? Not the infamous “computed come from” statement? 😉
Thanks to Will, we now have a place on the Wiki for community ideas. If you’re logged into the forums, you can add your comments and ideas there. Feel free to chime in!
Under this category on the top-level wiki page, I’ve started the http://millcomputing.com/wiki/Software page. (Puzzlingly, the “Software” link under the category “Community” is still appearing in red, despite the fact that it’s not empty. I don’t know why.)
Under that, I’ve started a page about Forth or similar interpreters for the Mill; see
http://millcomputing.com/wiki/Forth_for_the_Mill_and/or_Forth-like_variants,_better_suited_to_the_MillFrom what I can tell, if you’re logged into the forums, you are also logged into the wiki, and can edit. I’m confining my edits to the discussion/talk tab for anything except those pages under the Community column on the TL wiki page.
- This reply was modified 9 years, 11 months ago by LarryP. Reason: clarity
Will: “What kind of projects has everyone been involved with previously?”
I’m still involved in 8-bit home/retro computing, where Forth is still a pretty well regarded language, and all the fiddly little implementation decisions have a large impact on performance.
Well before hearing about the Mill, I did create a VM interpreter on the 6502 which has a machine-generated instruction set, AcheronVM[1]. (The fact that anything outside of hand-allocated opcode values is astonishing to people is astonishing to me.) I was trying to come up with some balance between stack-based, accumulator-based (both of which can have great code density), and frame-based (which can bring a lot of expressive power to more complex data activity, and less shuffling than the other two). I settled on a fine-grained, non-hiding sliding register window with both direct register access and a variable accumulator (the “prior” used register can act like an implied accumulator). Within the bounds of 6502 code interpreting it, where no parallel effects can be had as in hardware, it’s a pretty good condensing of high level code representation. Of course, in hardware like the Mill, you can really go out of the box and redo everything. In software, the expense of dispatch is serial and significant.
From my work in Forth and various experiments like that (which I am still using for Commodore 64 hobby game development, and have matured farther offline), I agree that a straightforward Forth-like ABI/interpreter planted on the Mill belt model might not be the best way to go. A Forth which uses traditional memory-based stacks would be simple to write, but wouldn’t have the performance gain of belt-based data. A Forth compiler which reworks the stack operations into dataflow expressions would be much more involved to write (especially if keeping all the ANS Forth standard operations), and there would likely be an impedance mismatch between that which is natural and efficient in traditional Forth, vs what comes out natural & efficient in the compiled Mill code.
Beyond that, I’m not sure how much Forth source code reuse is practical. Older Forth code bases would have a strong representation of ANS standard Forth, but there are many other dialects. I’m not certain what more recent users of Forth go for. Beyond the most basic operations, it is a very plastic language. Even Chuck Moore laments the lack of new Forth language ideas post-standardization. The dialects he uses diverge from the ANS standard.
So if something with a good matchup for the Mill can be designed, I think Forth is a great language for allowing easy, simple bootstrapping of a system, but not necessarily to pull in large existing Forth-language code bases. Given its age, Forth’s memory and filesystem abstractions are next to non-existent, and stuff like threading and networking are often outside its realm completely. Forth applications tend to be system-owning, hitting the hardware & memory space straight.
I’m actually quite interested in being involved in building non-cycle-exact Mill emulators to help bootstrap the user software ecosystem.
David,
Good to hear of another person interested in Forth on the Mill.
I started three wiki pages about “Mill Forth.” See:
http://millcomputing.com/wiki/Preliminary_Design_for_Mill-Forth
http://millcomputing.com/wiki/How_best_to_map_the_core_components_of_Forth_onto_the_Mill%3F
Please take a look and feel free to add your ideas, perspectives and useful links. Same for anybody else reading. If you’re logged into the forums, you should also be logged into the wiki.
The idea of using the belt and scratchpad to serve as Forth’s data stack is attractive for speed reasons. However, I think that’s not the right approach for an initial version, since the belt doesn’t model a FIFO stack. While I’d like to let the Mill’s call/return machinery be Forth’s return stack, I don’t think that’ll work well, because the inner interpreter needs read/write access the stack for more than just call/return matters, including for the common DO … LOOP constructs. Breaking the interpreter’s ability to manipulate the return stack would IMHO make the result too far from Forth.
Overall, I agree that the only people likely to be porting much existing Forth code to the Mill are those of us interested in Mill Forth. Still, my instinct is to design something that’s recognizably Forth, for the first iteration. A belt-oriented interpreted language (or a type-aware Forth) are interesting notions, but I’d like to separate those from getting a Forth interpreter running on the Mill (initially under sim.)
So, my instinct is to go for a fairly vanilla port of Forth for the first iteration, with both stacks in memory. Doing so does not preclude using the Mill’s stack pointer, though the automatic stack cut-back upon true function return might make that problematic. Though if the inner interpreter never returns, we might be able to use it that way.
Happy New Year,
Larry
- This reply was modified 9 years, 11 months ago by LarryP. Reason: clarity
I’ve been thinking a bit about how to do interpreters in general on the Mill, if you are not doing a JIT. The ideas are applicable to Forth, although they might be easier in a one-stack model.
The actual execution part of a code will tend to be only a few and frequently only one instruction on a wide Mill. Consequently the bulk of the cost will be in the dispatch overhead. Of course, some codes may take much longer, but the fast-exec codes should dominate.
The dispatch overhead in turn will be dominated by the cost of the dynamic jump in the direct-threaded code (or the switch, if that is used), and icache churn if the engine doesn’t fit. However, at least for Forth I think 2X32k should hold the entire interpreter. At 3 bytes/op (generous, even on big Mills) you have 20k ops available for the interpreter; the Forth code itself is on the data side and doesn’t collide.
The problem with that dynamic jump is that it is megamorphic and will miss all the time. The next exec is not likely to be in the i$0 microcache, so besides the 4 cycle miss we have a i$1 fetch to do, say 7-8 cycles total depending on banking and cache implementation details. When the exec is only 1-2 cycles you can see where the time goes.
So the question is how to make the indirect jump go away. Let’s assume that we know a priori that all ops take two exec cycles and the Forth code has no branches. If we do:
dispatch0
exec0
dispatch1
exec1
…then we stall in every dispatch. But the Mill supports deferred branches: compute the address and predicate now, but do the transfer then. See First Winner Rule in the docs. This would permit us to dispatch early:
dispatch0
noop
dispatch1
noop
dispatch2
<- transfer of dispatc0 happens here exec0 dispatch3 <- transfer of dispatch1 happens here exec1 dispatch4 <- transfer of dispatch2 happens here exec2 dispatch 5Because we are computing the actual target, the true target will override a (bogus) prediction in the decoder and we won’t miss on the transfers; essentially we have pipelined the execution.
There are issues. Mill deferred branches take a static deferral value; we can’t calculate how long we should be waiting to get all the in-between execs done. So the amount of the deferral has to be selected to fit an optimal number of dispatches and typical execs in the number of cycles it will take for the prediction/decode hardware to have the target instruction ready to go; that is a very member dependent number, but readily tunable.
Then there are the execs that don’t fit in the allotted standard cycle count. Those can be done by a call out to the full exec routine; branch deferral is frame-local on the Mill, so it doesn’t matter how long the out-of-line exec routine takes, the deferred transfer will be waiting to take effect after the return. The call itself should always predict, because it is static.
The biggest problem is what to do when the Forth etc. codes themselves do a branch, because we will have already dispatched (and have transfers waiting for) the codes down the fall-through path. In essence the deferred dispatch is a form of branch prediction of the Forth code, and the prediction can miss at the Forth level even though there are no misses at the interpreter level. When we do miss we need to get rid of those queued-up transfers.
One way is to use the Mill inner op and then exit the putative loop when we execute what turns out to have been a Forth branch code that didn’t fall through. The exit discards all in-flights, both data in FUs and FWR transfers pending. We will then need to do a new inner and a few dispatches without execs to get the pipeline restarted at the correct Forth target, but that shouldn’t be hard. However, inner and its exit are spiller-active and have a power cost of roughly a call; it’s not clear whether this matters in the event though.
A second approach would be to put an explicit “branch coming up” code in the Forth code, which would cause the dispatch logic to stop putting out deferred branches and let the pipeline drain to empty when the exec reached the branch. The branch exec would then restart the pipe down whichever path was correct.
Which leads to a final approach: add deferred branching to the Forth codeset. With that, the dispatch pipeline always knows where the next real code will be and can do a Mill deferred branch to have it ready for exec at the right time. The drawback is that frequently there is nothing useful to do between the time the Forth predicate is known and when we have to defer to to avoid a Mill miss; welcome to no-ops (or take the Mill miss and stall).
Whether any of this pays depends on the dynamic frequency of branches among Forth codes. Unless you add a Forth-level predictor you will be taking a 7-8 cycle hit at every branch to restart the pipeline, and a predictor is asynchronous enough to the execution cycle that I don’t think it can be coded into the static code of the Mill along with the dispatch and exec. The rule of thumb for common languages is one branch every 5-8 instructions, so pipelining the dispatch should be a win: a 6-instruction basic block with a two-cycle dispatch/exec becomes 5×2+8 or 18 cycles, whereas without the dispatch pipeline it is 6×8 or 48 cycles. Factors of 2.5 are not to be sneezed at.
Ivan,
Are there any instructions that ask for/demand a pre-fetch of code from a *different* ebb?
Since the Forth data stack will almost certainly be in memory, the interpreter will likely often spend time loading operands from the data stack onto its own belt, before calling functions that implement Forth primitives. So if there’s a way to get code (e.g. for an upcoming primitive) fetched ahead of time, that would help speed up the interpreter loop considerably.
For a first-ever Mill interpreter, I’m more concerned with getting something to work correctly (with understandable code — a challenge in assembler!) than I am with optimal efficiency. That said, fast execution is certainly a plus.
Thanks,
Larry
The only prefetch facilities under program control are intended to deal with DRAM misses. We believe that manual control of prefetch at higher levels will just make things run slower. Note that manual prefetch is largely unnecessary on a Mill because of run-ahead prediction, which automatically prefetches down the predicted path. Conventional prediction techniques cannot do that.
Just to stick my oar in, but if the tricks you need to pull to make a decent interpreter exceed those needed to write a naive compiler, why not just write a naive compiler?
As a languages/compilers enthusiast, Forth has always struck me as an amusing exercise in implementation, but a total pain in the posterior to read or write. Maybe that’s just me.
I would *really* love it if the first language on the Mill was inspired by the lambda calculus…
Would you like to tackle a Scheme port?
I shudder to think of what call/cc would require on the Mill, if the built-in stack frame architecture is used. Would it require some sort of privilege escalation to introspect and shuffle the stack around? The scratchpad and potentially the spiller would need to be involved, too.
There will be specific tail-call support; NYF.
As for playing games with the stack frames, no, you can’t do that manually: if you can mangle the stack frame, so can a Bad Guy attacker. However, we have made threading so cheap that you can use system threads instead.
Regular continuations are essentially the same as exceptions or longjmp as far as the hardware is concerned, and are supported; NYF. Resumeable continuations we are still thinking about.
I would love to tackle something Lispish, but it pains me to say I couldn’t guarantee the time to do it. Tempt me with a simulator and I might have a go!
Regarding Scheme’s tail-calls and weird call/cc stuff: for a decade I worked on the Mercury compiler which had several back-ends, typically targetting GCC (there was at least one register-machine like “VM” and at least one native C back-end). For tail calls in the “native” C back-end, we’d resort to using GCC’s “goto label-variable” in various places. Tail recursion was converted into while loops, but non-inlined mutual recursion just left stuff on the stack. I cannot recall an instance where this got us into trouble, and we were running very large Mecury programs (a few commercial enterprises use Mercury as well).
I got something lispish in the back of my head, and already wrote a little writeup for it that sits somewhere on my hardrive.
It goes even much further than traditional lisps when it comes to DSLs and parsing to the point where the main language is little more than a rule set to link any kind of DSL together into a common framework, i.e. it is a DSL to combine DSLs and as such provides full control over the whole compiler/runtime stack from within the language itself. The type system is its own DSL, and the core language or compile target is a form of mill general assembly, also its own DSL, there is a DSL to define new DSLs etc.But there are other things with higher priority to do. Mill is not in the language design business. My writeup may find its way onto the community wiki at some point though and when the mill architecture itself is a little more stable I may attempt an implementation.
# Flom – a Functional Language on the Mill
Ralph Becket at outlook dot com, 2-Feb-2015
I considered calling this “Mill Functional”, but that had an unfortunate contraction of its name.
## Introduction
The [Mill](http://millcomputing.com) is a new CPU architecture. The designers have invited curious parties to propose a simple language to be compiled or interpreted under simulation. Real Programmers use functional languages and this is my contribution.
Flom is a statically typed, strict functional programming language in the vein of [Core Haskell](https://ghc.haskell.org/trac/ghc/wiki/Commentary/Compiler/CoreSynType) as used in GHC (although GHC’s Core is, of course, lazy). Some of the implementation ideas I have taken from my experience developing the [Mercury](http://mercurylang.org/) compiler.
The goal is to produce something useful, which can be implemented reasonably efficiently with modest effort, and which can be later extended largely via syntactic sugar.
## Some examples
If you’re familiar with Haskell or ML or F#, these examples are standard. (The syntax is somewhat conjectural and the reader is invited to use their imagination to fill in any gaps.)
### Fibonacci
The slow way:
let fib n = if n < 2 then 1 else (fib (n - 1)) + (fib (n - 2))
The quick way:
let fib n = let fib' n i j = if n <= 1 then i else fib (n - 1) j (i + j) in fib' n 1 1
### Lists
// All introduced types are either type synonyms or // discriminated unions (aka algebraic data types). // Types can take type parameters, as here. // type list T = nil | cons T (list T) let len xs = case xs of nil = 0 | cons _ xs' = 1 + (len xs') let map f xs = case xs of nil = nil | cons x xs' = cons (f x) (map f xs') let foldr f xs z = case xs of nil = z | cons x xs' = f x (foldr f xs' z) // Note that type constructors are also functions: // let append xs ys = foldlr cons xs ys let filter p xs = let f x ys = if p x then cons x ys else ys in foldlr f xs nil
### Quicksort
let qsort xs = case xs of nil = nil | cons x xs' = let as = qsort (filter (> x) xs') let zs = qsort (filter (< x) xs') in append as (cons x zs)
## Core Flom and desugaring
Expr ::= k // Literal constant; e.g., an integer. | ID // A named quantity (variable, function, etc.) | Expr Expr ... // Function application | LetsExpr | CaseExpr | ( Expr ) LetsExpr ::= LetExpr* in Expr LetExpr ::= let ID ID* = Expr // Parameters indicate a function. CaseExpr ::= case Expr of Pattern | Pattern ... Pattern ::= ID (ID | _) * = Expr // First ID is the constructor tag; | _ = Expr // _ is the "don't care" pattern. Type ::= type ID ID* = ID TypeExpr* | ID TypeExpr* ... TypeExpr ::= ID ID* // Type constructor or type synonym.
### Desugaring
if
expressionsif x then y else z
desugars to
case x of True = y | _ = z
### Desugaring nested expressions
f (g x)
desugars to
let tmp = g x in f tmp
(and likewise for all the other cases until we’re left with nothing else to simplify).
### Desugaring currying
If, say,
f
is a function of arity three occuring in an application of arity one, thenf x
desugars to
let f' y z = f x y z in f'
### Desugaring free variables and local functions
A free variable is defined here as a name whose value cannot be statically determined. The free variables of a function are desugared away by explicitly constructing and passing an environment argument and lifting the local function definition to the top level. For example,
let f x y = let g a b = a * x + b * y in g 11 22
desugars to
let f x y = let env = tuple2 x y in g' env 11 22 let g' env a b = case env of tuple2 x y = a * x + b * y
where
tuple2
stands for some built-in arity-2 tuple data type.### Other desugaring opportunities
The usual opportunities exist for adding nicer syntax for, say, lambdas, lists, etc.
## Typechecking and type inference
The type system is plain old Hindley-Milner and [Algorithm W](http://en.wikipedia.org/wiki/Hindley%E2%80%93Milner_type_system) will do the trick nicely.
## Implementation detail
[This section is essentially a set of suggestions for writing a reasonably efficient Flom compiler without great effort. The first instance would be a cross-compiler, which ought to be sufficient to craft a native compiler in Flom itself.]
All atomic values (built-in types such as ints, chars, floats, etc.) occupy space equivalent to a pointer.
All values constructed via data constructor live on the heap and are referenced by pointer. Each such construction is a vector of pointer-sized words, comprising the data constructor tag followed by any argument values.
The usual discriminated-union optimisation tricks can all be applied at some point.
Equality and comparison on atomic values is by identity. Equality and comparison for algebraic data types is lexicographic.
There is no destructive assignment! Any such facility, including IO, must be provided via functions written in something other than Flom and somehow imported.
Garbage collection ought to be straightforward, but it isn’t clear to me exactly what is involved with the Mill.
### Translation to genAsm
This is just a sketch… and my grasp of genAsm is wobbly at best.
[[e]]v
denotes the translation of Flom expressione
, bound to genAsm symbolv
.k ---> k -: v_k where k is a constant x ---> v_x where x is a function parameter or local let x = e local or top-level variable ---> [[e]]v_x case e of patExpr1 | patExpr2 | ... ---> [[e]]v_e if [[patExpr1]]v_c elif [[patExpr2]]v_c ... else <disaster!> where [[tag x1 x2 ... = e]]v ---> load(v_e, 0) <tag> eql then load(v_e, 1) -: v_x1 load(v_e, 2) -: v_x2 ... [[e]]v f x1 x2 ... function application ---> [[x1]]v_x1 [[x2]]v_x2 ... call f v_x1 v_x2 ... -: v_f N.B.: the simple direct tail recursion case should be compiled into an ordinary loop. ctor x1 x2 ... data constructor function ---> func ctor(x1, x2, ...): <alloc space>v_c store(v_c, 0, <tag>) store(v_c, 1, x1) store(v_c, 2, x2) ... retn v_c let f x1 x2 ... = e top-level function definition ---> func f(x1, x2, ...): [[e]]v retn v
- This reply was modified 9 years, 10 months ago by ralphbecket.
- This reply was modified 9 years, 10 months ago by ralphbecket.
- This reply was modified 9 years, 10 months ago by ralphbecket.
- This reply was modified 9 years, 10 months ago by ralphbecket. Reason: Trying to preserve code formatting
Should this be continued here, or on the Wiki?
Apologies if this was the wrong spot — I just wanted to keep the discussion going.
Sorry I wasn’t clear. I was not complaining at all, just looking ahead, assuming wild success of your project, and wondering how best we could support such Mill-related but non-product projects. I agree with your choice that here in the forum seems to be a good starting place, but the sequential-post style of the forum doesn’t seem best for a shared development effort for such projects if they take off. And we wouldn’t move it into our company product development environment, because we won’t be inventing new languages as products. So, assuming success, where should those working on a volunteer and unofficial but Mill related project do their work and communicate?
Now that I’ve thought about it for a while, the Wiki doesn’t seem quite right either – suitable for canon documentation perhaps, if the project reaches public release, but not really a good interchange medium for those working on such projects.
The source code (assuming it got that far) would be a natural fit for github. I’ll bring the question of the rest up at tonight’s general meeting.
Of course, all this is moot until such projects actually happen. But from your posting here, and various questions and comments from academia I’ve heard, it looks like they will happen sooner or later. Better to have some support plans at least thought about in advance.
Here’s the agenda item for tonight; I’ll post a summary of discussion:
++++++++++++++++++++++++++++++++++++
There has been discussion on the forum about how to implement functional languages on a Mill. Recently a poster put in a long writeup for a new language with features especially tailored for a Mill (http://millcomputing.com/topic/simulation/#post-1688). This is an example of a general category: independent non-product volunteer efforts that are Mill-related in some way. There will with time be a lot of these as an ecosystem develops around us.The question: how should we support such efforts?
+++++++++++++++++++++++++++++++++++++
@ralphbecket: I am trying to figure out additional ways in which Flom can take advantage of Mill specifics/peculiarities.
On the Forth topic: When can we expect to be able to have access to some tooling. I think talking about it may help pave the way, but nothing beats actually putting ideas into practice.
We talked a bit at our Tuesday meeting about how best to support such independent efforts and projects. We concluded that the Forum here was the right venue for discussions; the Wiki is more for canon, official documentation of the Mill and our products, but would be appropriate for such an effort when it moved from being a discussion subject to an actual facility with users of its own. For the source code, when an effort got that far, github seemed right; putting it on our SCS ran risks of mutual IP pollution.
As for tooling and its ETA – I wish I knew, but our resource base is still so small that real scheduling is impossible. We did decide that we would release, and support, alpha-level tooling on a “use at your own risk” basis. A minority felt that we should wait for commercial-grade for fear that alpha-grade would forever paint us with “broken and unusable”, but the majority felt that we should not try to hide our rough-shod nature.
Please be patient.
Ivan wrote:
We did decide that we would release, and support, alpha-level tooling on a “use at your own risk” basis. A minority felt that we should wait for commercial-grade for fear that alpha-grade would forever paint us with “broken and unusable”, but the majority felt that we should not try to hide our rough-shod nature.
I’m glad that some Mill software will be released in some fashion before reaching commercial grade. I’m happy to support any NDA or other terms/guidelines to make such an alpha-grade release help — rather than hinder — bringing the Mill architecture to market.
It seems that the genAsm assembler, specializer and a corresponding Mill simulator would be the first pieces of the toolchain available. So my focus is more focused on what we can do with those tools. Forth (or an interpreted language in that vein) seems doable with that partial Mill toolchain. I don’t know enough about functional languages, though this discussion has made me learn a little about them.
@tomberek – I’m not sure that Flom is intended as a demonstration of the Mill’s cleverness. Rather, my idea was to come up with something that
- is easy to implement on the Mill (Forth’s dual stacks seems problematic),
- is comfortable enough for something like real programming,
- and is amenable to optimisation.
The simple, regular, statically checked type system of Flom (it’s a stripped-down ML) isn’t just expressive, it makes code generation much simpler and the generated code can be much more efficient.
Ralph: Just looking at those constraints, I think that modifying Forth might be an easier language to bootstrap. Here are some of the basic modifications:
*** Edit: How do you format text on this board?
Operations would read existing belt data, and add new values
Would operations be able to address the belt, or would there be a belt “PICK” operation to put previous items onto the front of the belt?
Former would be more Mill-ish, latter would be more Forth-ish
Even in the Mill-ish form, addressing the stack would be optional, defaulting to the front of the belt
Addressing the stack should be able to be done via label, not just depthlit 1 \ push a literal number lit 3 x: add \ Declares 'x' to be the location on the belt holding the (first?) return value of this instruction ... lit 4 add x \ References wherever on the belt 'x' was. The other parameter defaults to the front of the belt. sub y,0 \ Numeric references are a belt depth. Does this look too much like subtracting the literal number zero?
If ‘lit’ is properly hidden by the parser, then it could look more Forth-like instead of asm-like, but need to delineate the parameter references:
1 3 x: add ... 4 add (x) sub (y 0) \ Parens are a little un-forthy 4 add ^x sub ^y ^0 \ Decorate each reference instead of the beginning/end tokens?
No sharing of stack data between functions, except parameters and return values
Function declarations would have to specify the number of params & returns
Can’t be variable, unless there’s vararg lists in memory, which I wouldn’t bother with
This limits a number of clever operations and could bloat up user codeDynamic types
The compiler would likely need simple type inference to figure out which typed instruction to generate for ‘add’ etc
Could either foist that on the user, or does the specializer do some of this?
Forth foists it on the user, so might as well do that hereI think building on this basic thinking, of applying Mill-isms to Forth instead of vice-versa, would yield a simple usable bootstrapping language. I think it would end up looking a fair amount like GenAsm, though.
- This reply was modified 9 years, 10 months ago by David.
In theory normal HTML tags should work for editing.
Any stack-model language would not have much semantic kinship with genAsm, which is dataflow.
There should be an editing toolbar above where you type in your post. The left button labeled ‘b’ starts bold, so put your cursor where you want the bold part to start, then click the button. When I did it, I got the tag <strong> rather than <b> for bold.
I looked this up on the bbPress support forum, and bbPress has a restricted set of allowed html tags for anybody other than admins for security reasons. Surprisingly, <b> was not on that list.
Still interested in a minimal bootstrap capability. This is less for the OS support that seems to be the current focus and more for the single-board/coprocessor use case, but something that would be fun to work on. From other posts it seems this is now in the realm of possible.
The constraints are legal (patents), financial, and administrative. We are all impatient.
- AuthorPosts
You must be logged in to reply to this topic.