Forum Replies Created

Viewing 15 posts - 406 through 420 (of 674 total)
  • Author
    Posts
  • Ivan Godard
    Keymaster
    Post count: 689
    in reply to: The Belt #1578

    What are security implications of this?

    Exactly. Nothing “enforced by the compiler” can be called “security”; it is at most a sanity check and aid to debugging. Compilers have bugs and Bad Guys have assemblers.

  • Ivan Godard
    Keymaster
    Post count: 689
    in reply to: Simulation #1627

    Ok, since the EBNF wasn’t completely syntactically correct, I assumed it was just a straight hand-edited wiki page.

    It was and is and probably will be. Thank you for the clean-up, and I’d appreciate it if you would keep an eye on the page for things I screw up in the future.

  • Ivan Godard
    Keymaster
    Post count: 689
    in reply to: Simulation #1616

    Summarizing Mill prediction:

    Most machines predict branches (taken/not taken) and have a separate table (Branch Target Table) with actual addresses for dynamic branches. These table are consulted only when execution reached the decoder, and so must be very fast and hence small, especially the BTT, which causes pain in codes with lots of indirect branches such as virtual function dispatch.

    The Mill predicts exits, not branches. Mill code is organized into Extended Basic Blocks, one-entry-many-exit code sequences. Unusually, the Mill form of EBB may contain embedded calls. If you enter an EBB, no matter how many exit branches the EBB contains only one will be taken, and the Mill predicts that exit, including the target address. In effect, the Mill has only a BTT and no branch table at all, and no entries for exits not predicted to be taken.

    The Mill organization permits run-ahead prediction – we don’t need to look at the code to follow a prediction chain. Thus the tables can be well away from the decoders, and hence can be big, slow, and cheap.

    Now that Mill advantage assumes that things are actually predictable; it doesn’t buy anything (no predictor does) when control flow is random and you miss all the time. Cliff Click calls such code megamorphic, and besides virtual function dispatch the poster-child for megamorphic is the dispatch loop/switch of interpreters.

    Consequently a performance interpreter, for Mill or anything, has to avoid switch statements in the dispatcher. It is far better to take a little more time in the encoding or translating to change op indices into actual jump/call instructions in execution order. Even if the result of a JIT is just a sequence of call ops to exec functions that are not inlined, the JITed code will run *much* better than if the original byte codes were indices into a switch.

    Of course, if everything gets inlined then you have one machine EBB for each EBB that is present in the (Java, say) bytecode source – and you run like a house afire.

    There are two ways to do the inlining. The slow way, which is member independent and will give the best performance, is to turn each bytecode into a function call on the exec function, all represented either in genAsm or directly in genForm. Mark all the exec functions (well, the simple ones anyway) as “inline”, and use the genAsm and specializer libraries to get you target binary, and then call the OS to mark the result as executable. Not fast. But of course, you can do it once for all of each package at package-load time, and thus amortize the cost.

    The fast way to do inlining, which is member-dependent and will not give as good performance as the other route, is to look up in a pre-built table (one per member) the actual raw binary bit pattern of a Mill instruction. Said instruction contains a call on the exec function and whatever belt diddling is needed for the interpretation. Table entries will have two pieces, one for flow side and one for exu side, and your baby-JIT will just glue the pieces on the correct end of the EBB being built. When you hit the end of an EBB in the bytecode then give your new binary EBB to the OS to bless as code and start on the next. If you can’t tell where the EBB boundaries are in the bytecode then just treat each basic block (which you should be able to delimit) as an EBB.

    Easy, huh? 🙂

  • Ivan Godard
    Keymaster
    Post count: 689

    You ask good questions 🙂 (A variation on the Chinese curse: “May you receive good questions”). I had to go look at the code in the sim to answer.

    We detect various problems in the load/store logic in different ways and at different places, so please bear with a rather complex explanation.

    The generic Mill address is base+(scaled index)+offset. Offset is a literal in the instruction; index if present is from the belt, and base may be either from the belt or one of a specific set of specRegs.

    SpecRegs cannot be None or NaR. However, the encoding used for the specReg (in the instruction forms that use it) may have bit combinations that are not valid. Use of such a combination, as with other illegal encodings, reflects a compiler bug and gets a fault and the usual trip through handlers. There is special handling for tpReg (“this” pointer). If tpReg is used as a base when the register has not been set then a NaR is passed to the rest of the address calculation.

    There are a different set of checks when the base is a pointer from the belt. There is a width check that the argument is a scalar of pointer width; failure faults because this is a compiler bug. None/NaR is passed on to the index check.

    If the index is present there is a width check and we fault a vector index; a scalar index can be of any width. The index is then resized to pointer size, and scaled. If the resize or scaling overflows then we fault. This should probably pass on a NaR instead, but does not in the present sim. If the index was a None/NaR then these are passed on. For a missing index we pass on a zero.

    At this point we have a base, index, and offset input to the 3-input address adder, and either base or index could be None/NaR. The address adder applies Mill pointer arithmetic rules, and in particular checks for overflow into the non-address bits in the pointer. An overflow is noted.

    Now comes the None/NaR handling: if either of the base or index were None then the final address is one of the input Nones, otherwise if either of the inputs is NaR or the address-adder result overflowed then the final address is one of the NaRs from the input or a ne NaR if only overflow happened. Otherwise the final address is normal.

    Now, if the operation is a store and either the final address or the value to be stored is None then the store is suppressed. If either the final address or the value to be stored is NaR then we fault. Otherwise the store proceeds, but may fault the protection check on the final address.

    If the operation is a LEA and the final address is None then the result is the None as a pointer-sized scalar. Otherwise if the final address was NaR then the result is the NaR as a pointer-sized scalar. Otherwise the result is the final address as a pointer-sized normal scalar.

    If the operation was a load and the final address was None then the result is an operand of the loaded scalarity and width all of whose elements are the None scaled to the element width. Otherwise if the final address was NaR then the result is an operand of the loaded scalarity and width all of whose elements are the NaR scaled to the element width. Otherwise the final address is normal and the loads proceeds.

    You’re welcome.

  • Ivan Godard
    Keymaster
    Post count: 689
    in reply to: Simulation #1607

    1) You list only the arguments to the label; there’s no way to list the belt contents anyway, because only the specializer knows that. At a join point anything else on the belt is dead anyway. Note that falling into the label is equivalent to an (elided) go to the label from the falling code.

    2) Sorry my explanation wasn’t clear. Labels, like functions, can have an arbitrary argument list in the source, including the genAsm source, and functions (but not labels) can be VARARGS. The specializer maps these arguments in a member-dependent way to a set that are passed on the belt, and a remainder that are passed in memory (functions) or scratchpad (labels). To a first approximation you may assume that leading arguments that are less than or equal to the size of the largest member-supported scalar will be on the belt in argument order, while larger arguments (in particular quad data on a non-quad member) may be split into a tuple and passed piecemeal.

    I grant that “must fit on the belt” was misleading; I was thinking in terms of the specializer implementation. Internally the specializer turns label declarations into fewer-argument labels (for which all arguments can go on the belt) plus a set of scratchpad locations that are reserved in the scratchpad map so they won’t be used for spill/fill. The go operations are also transformed into a set of explicit spill ops plus a conform for what gets passed on the belt, plus the transfer op itself (br and friends).

    At the join, the current argument set comprises the incoming stuff on the belt and the set of scratchpad locations. About the only thing you could do at a multi-argument label in genAsm is to rebind the incoming arguments, and the rebindings name whatever belt position or scratchpad location they name, and can be used normally thereafter. Loops are not yet done, but I expect that the back-transfer will use the same mechanism to get all the loop-carried data to the next iteration.

    VARARGS handling is not yet finalized. Currently we expect to pass the entire “…” list in memory (arguments before the “…” pass normally) but we may come up with better ideas at some point.

  • Ivan Godard
    Keymaster
    Post count: 689
    in reply to: Simulation #1600

    Warning: genAsm is very much a work in progress, and information here and on the Wiki is subject to change without notice.

    For ordinary conditionals use the syntactic sugar of the ifClause. There will be similar sugar for looping constructs, TBD.

    For the very lowest level you explicitly declare labels using labelDef. You transfer to them using the go pseudo-function. Labels are placed at a join, syntactically the label identifier followed by a colon.

    Transfers to and fall-throughs into a join carry data. The widths of the carried data is declared in the labelDef, and the count of carried operands must fit on the belt. Like all ops and functions, go may be used in suffix, infix, or functional form; the target label is the final argument and any other arguments are carried data. The arguments to go (and the current data when falling into a label) must match the declaration of the label in number and width.

    Labels are scoped based on the point of declaration. The corresponding join, and any go invocations targeting the label, must be in the same scope. And of course a label be declared before other use, and must appear in exactly one join.

    Example:

    label w lab;
    if (b eql 0) then 17 go lab fi;
    ….
    F(x,y,z) lab: + a -: c;

    Here if b is zero then control transfers to lab carrying the data value of 17. Otherwise execution continues until the function call on F, whose return value become the fall-through value at the join at lab. However control reaches lab, the data is added to a and that is bound to c.

  • Ivan Godard
    Keymaster
    Post count: 689
    in reply to: Simulation #1598

    While language design is fun, chips to run existing code are what sells. We are a commercial enterprise, so we can only invest in what gets product sooner. We’re happy to encourage and help a Forth (it *is* fun after all), but we wouldn’t bring up a usable FPGA until after massive sim using compiler output.

    The core of the architecture – belt, encoding, memory model is done, and we have enough to do those parts of an FPGA. But larger scale parts need larger scale code to verify, and in practice that means a compiler.

  • Ivan Godard
    Keymaster
    Post count: 689
    in reply to: The Belt #1586

    Sorry for the confusion. In fact those are *both* belt nesting numbers and stack frame numbers!

    The problem is that internally (where we who work on the Mill live) each thread actually has two stacks that are disjoint in the address space, whereas a conventional has only one. One of the two Mill stacks is the data stack, rooted in the initial stacklet, and containing content that is addressable by the application. Logically each call adds a new data stack frame, but a frame may be empty and have no content.

    The second stack is the spiller stack, rooted in the initial spillet. It’s content is not application-addressable (debuggers etc. go through a special library to get at it). Each call adds a new spiller stack frame, which is never empty.

    So those numbers are both belt nesting numbers and spiller stack frame numbers, which are in fact the same thing. And they are also data stack frame numbers, but some data stack frames are invisibly thin.

  • Ivan Godard
    Keymaster
    Post count: 689
    in reply to: Simulation #1584

    A belt-Forth is a very interesting language target. I’d be happy to help, both on behalf of Mill Computing and also personally with my long professional and personal background in language design.

  • Ivan Godard
    Keymaster
    Post count: 689
    in reply to: Simulation #1583

    Ivan, any projection on when you’ll be able to simulate code written in GenAsm internally? (Or is that working already?)

    It is working already for a subset of the machine: all scalar and most vector operations, and simple control flow. Not working yet: memory, reduction vector operations, automatic vectorization, pipelining, self-tuning prediction, some random corners. Progress is being made, but I cannot promise a schedule.

    I’d be interested in further discussion and possible collaboration. Feel free to email me at: ursine at (a large, free email service, hosted by g00gle.)

    Feel free to start a topic section here. You also can (and should) use the new public Wiki for design and documentation.

    I do not recommend any project written in conAsm. Any program longer than the three-line snippets used in the hardware verification (the intended use) is hopeless.

    Have you considered direct interpretation of genAsm?

    Ivan

  • Ivan Godard
    Keymaster
    Post count: 689
    in reply to: The Belt #1576

    Say frame 17 calls frame 18, which does a MULF but returns to 17 before the MULF finishes. When the result pops out of the multiplier it is labeled as belonging to 18, known to be exited, and is discarded.

    If 18 does a tail-call instead of a return, the called function just replaces frame 18, it doesn’t add a new frame 19 the way a regular call would. That way it can use the return linkages of 18 when it finally returns back to 17. But now the MULF comes out looking like it belongs to the current 18 (the tail-called function), but it doesn’t. Oops.

    The structures to solve this are obvious, but add cost to everything, not just tail calls.

  • Ivan Godard
    Keymaster
    Post count: 689
    in reply to: The Belt #1574

    2) Well, I waded through the paper you cite. Once the obfuscation is removed, their method appears to be to accumulate the active set of permissions at the top frame, adding to that set with each tailcall as necessary. This is different than keeping each added permission in the stack at the level of its grant, and then searching down the stack for a relevant permission when needed. That having a mutable current set at the top removes permission-stack-overflow is obvious, but it is necessary to wrap the observation in a few lambdas to achieve publication 🙂

    The approach is not really relevant to the Mill, because it assumes that permissions can be granted piecemeal locally, as in an access control list (ACL) threaded through the stack. Grants are piecemeal on the Mill, but are global and heap-like. No stack search is required; instead the hardware maintains a persistent table accessed through the PLB. This table does not itself change as a result of call, and so does not inhibit TCO.

    However, on the Mill packages of permissions are grouped into named turfs as an implementation optimization, and a program has the rights of a single turf at any one time, in addition to any global grants it may have acquired. The current turf is local to a frame, and hardware admits changing the current turf as a consequence of call (and restoration of the prior turf on return). Because turf is a local right, it potentially interferes with TCO, and in practice the hardware (the spiller) maintains information in each spiller frame sufficient to restore the turf context of the caller. It is this data structure that would inhibit TCO across portal calls.

    To admit TCO across portals, it is necessary to overwrite the current top spiller frame with new information reflecting the new execution context, without overwriting the restoration information. This is trivial to do in the abstract, but hardware is “interesting” in the concrete. For example, what happens if the spiller has the structure half-overwritten when it gets an ECC error on a memory access? These issues are non-trivial, and in software would be addressed by Read/Copy/Update methods, but hardware doesn’t do RCU very well.

    So at this point the most I can say is that I think I understand the issue; we can TCO anywhere that Java and .net can; and we might be able to TCO across portals but I don’t guarantee it.

    3) Belt semantics is more than belt frames. For example, nothing stops a program from starting a MULF and then doing a tail call while the operation is in flight in the multiplier. Yes, it will be thrown away, but there’s no way to reach inside the multiplier and stop the result from coming out. We don’t want that in-flight value dropping on the called function’s belt, and the frame it should be dropping to is gone, courtesy TCO.

    The problem with the belt and TCO is that we don’t want to complicate the engine in a way that everyone pays for, just to provide something of theoretical interest only (TCO over portals) or lacking a significant customer base (TCO in general). I think the spiller frame-ID problem will fall out of the spiller implementation if we are careful, which will give us regular TCO. I’m less confident about portal TCO.

    5a) VARARGs etc will have to be passed on the heap or copied in such a way that they don’t overwrite themselves. Either way it’s the province of the compiler, and nothing the hardware can do.

  • Ivan Godard
    Keymaster
    Post count: 689
    in reply to: The Belt #1572

    As pointed out by lpsmith in #1552 above, the Mill does simplify tail-call considerably. At the same time it introduces other issues that other machines don’t have. Here’s a rough-out of what comes to mind; we won’t actually do any of this until we have a complete tool chain to practice on.

    1) Mill makes no difference between calls within and calls across compilation unit boundaries.

    2) Mill does distinguish between calls within turf and portal calls across turfs. The latter must be secure in both directions, and argument passing of memory-resident arguments is complex. It will be impossible to support portal tail calls because the spiller needs a spill frame to hold the info to return to the calling context. It might be possible to skip an intermediate spill frame if there were immediately nested portal calls, but this case is too specialized to be worth doing anything about. However, the calling code will not have to know whether it is making a portal call or not; the only consequence will be that the calling frame will not be cut back until the portal call returns.

    3) Operands on the local belt are identified by a nesting number, similar to a frame number except Mill functions do not necessarily have a frame. A call creates a new nesting number, which is how the belt is emptied in one cycle but is still there after the return. A tail call cannot reuse the nesting number of its caller because that would give it the caller’s belt rather than an empty belt, so a tail cal must also get a new nesting number. However, if tail calls are statically identified by an op distinct from normal calls, the hardware can save the caller’s nest number and return to that rather than to one less than the callee’s number. The nest number is small and wraps around, so there would have to be some arm-waving in the spiller to do the cut-back, but there seems nothing insurmountable.

    4) The scratchpad and stack frame (if present) can be cut back as part of the special op. The called function would have to do a normal scratch/frame allocation if it needed to, but it would always be doing that anyway.

    5) I don’t see any way to support in-memory arguments (VARARGS or large objects passed by value). In a true tail call these would be on top of the caller’s frame, and hence likely on top of the actual argument that is trying to be passed. A compiler could in principal move everything out of the way, cut the frame, and copy back the actual arguments, but I don’t see any actual compiler ever going to the trouble. Nor am I aware of any tail-call implementation that handles this case; please reply with citation if you know of one.

    5) All this addresses true calls. Compilers can turn tail calls into loops in the generated code as well on the Mill as any machine.

    6) All the implementation issues appear to be in the spiller in the Mill; it has to know that there is a missing spill frame. In turn, that means that it has to be told, and on a Mill that means a distinct operation. Because the internals of the spiller are per-member and not visible to the application, the implementation of tail calls is likely also to be per-member. In particular the spiller in a multicore Mill is quite different from that of a monocore, to avoid race conditions and coherence overhead.

  • Ivan Godard
    Keymaster
    Post count: 689
    in reply to: Simulation #1569

    It’s clear that it can be done, and will be as efficient as a general-register machine. However, just what will be the best way will require some serious head-scratching and experiment.

    My personal starting point would be to keep the Forth stack in belt and scratchpad. That permits everything to be fixed latency, which gives the compiler a chance to generate stall-less code. Some Forth words are not machine primitives, and the expressions can be evaluated directly on the belt. However, most Forth values are either consumed shortly after creation, or wind up lower in the Forth stack for reuse. A naive compiler can use rescue to get rubble out of the belt, but when a live value still would call off then it can spill to scratchpad.

  • Ivan Godard
    Keymaster
    Post count: 689
    in reply to: Pipelining #1562

    Sorry, NYF. There will be a talk on compilation as soon as we can get to it, expected to be in Seattle.

Viewing 15 posts - 406 through 420 (of 674 total)