Forum Replies Created

Viewing 15 posts - 226 through 240 (of 674 total)
  • Author
    Posts
  • Ivan Godard
    Keymaster
    Post count: 689

    I had to ask the hardware guys about this (IANAHG). They say:

    For add/subtract, the power increase with element size N bits is roughly proportional to N*log3(N).

    For shift/rotate, N*log2(N).

    For multiply, N*N (or N*N/4 with a Booth first stage).

    Power increases linearly with number of elements.

    These are only very rough, order of magnitude accuracy power terms, meant to give an intuitive feel.

    AVX-512 contains not only ALUs, but shifters and multipliers as well. When the multipliers kick in is when the lights dim.

    The big power win on the Mill is in getting rid of all the OOO machinery and the huge number of registers.

    The configuration machinery separately specifies the number of ALUs, multipliers, and divide/sqrt units so configurations can be tuned to workload and the power/performance trade-off.

  • Ivan Godard
    Keymaster
    Post count: 689

    These are commonly called “dope vectors” and are used in Fortran and Ada among others. Any single dimension is a normal indexing access (effectively a LEA for all dimensions but the last). The multidimensional case is hard to put into the hardware because it needs a way to encode several indices, and most hardware has hard limits on the number of arguments per op (not the Mill though). Another issue is the size of the dope vector itself, needing a stride and offset at each level, which gets out of hand after the second dimension. However, the biggest reason for not directly supporting dope vectors is the embedded stride multiply: you can overlap all the rest of the indexing with the multiplies, so making it one op rather than several doesn’t buy anything.

  • Ivan Godard
    Keymaster
    Post count: 689

    We do it for speed.

    “checking all the bits” is not free; you need an OR tree, the full operand width. That can’t be done in the gap between cycles, where the branch is resolved, or at least can’t be done without unattractive clock rate impact. However, the NEQ can be in the same instruction as the BRTR but happen in opPhase so its result (one bit) is available for the branch. The NEW is just making explicit the timing dependencies that your proposed NEW/BRTR merge does implicitly.

    Predicate gangs also use the same one-bit paths. It costs nothing for an ADD (say) to also yield one bit values showing the comparison of the result with zero; many conventionals do this and set the condition codes with them. A predicate gang selects one of the generated signals and drops it as a boolean with a known-to-be-one-bit value that the BRTR can use.

    Our approach does use an encoding slot and a belt position that a merged test-and-branch would not. However, it’s more general and costs no added time.

  • Ivan Godard
    Keymaster
    Post count: 689
    in reply to: NULL pointer #3094

    It’s a global pointer. It has to be, because a local zero is an alias for the start of the turf’s private region.

    The language does not in fact require a NULL to be binary zero – the “0” in “p = 0” is a token, not a value. Thus “p = (void*)(x-x)” is not guaranteed to leave p NULL. However, nearly all modern machines do in fact use binary zero for NULL, and we do too.

  • Ivan Godard
    Keymaster
    Post count: 689
    in reply to: IPC questions #3069

    1) What happens if a callee of a portal call would need more than a page frame of stack?
    We get stack with the alloca op, and if it overflows the current remaining free space then the system grabs more. The stack is segmented, so frames do not have to be contiguous with each other. This precludes certain stack games (often exploits) but most of those are impossible anyway because the stack links are not in user space.
    2) Can a temporary grant to an address region to a callee in a portal call be temporarily forwarded to a third turf or to another thread in the callee’s turf?
    Temporary grants can be re-temporary-granted. Whether they are visible to other threads using the same turf depends on the details of the passing mechanism used by the language RTS, but the hardware does not make it easy.
    3) Are portal calls compatible with C++’s exceptions?
    Portal calls are compatible with exceptions; you can use a portal anywhere you could use a non-portal call.
    4) What if a caller’s turf/process has a thread that crashes and the OS decides to kill all other threads attached to that turf … but one of those other threads is in a portal call?
    I read that Sun Spring’s Door semantics would in that case make the callee finish and “return to the kernel”. I find that to be the most reasonable course of action. But how would the kernel be able to set up a different return address, trap or syscall to end the calling thread upon return?

    OS policy is up to the OS. The hardware spec has little to do with it, other than to make sure that all plausible IS policies are possible. Because any thread (process, portal) that is in a turf that gives the right access can access the memory where the thread info and stack links are kept then that thread/process/portal can rearrange things arbitrarily. We expect that there will be a service that has the necessary access and exports an API to do useful things (like kill a thread and recover its resources) for use by everything from kill -9 to a debugger. We’re embarked on a reference implementation, but the policies are still unsettled.

  • Ivan Godard
    Keymaster
    Post count: 689

    Yes and no.
    We have to support C, and C has no arrays in expressions – any array name is promoted to a pointer. For explicit indexing where the type is known the compiler can insert explicit bounds checks, but often the underlying type is no longer known. For other languages, or under compiler option, checking using ordinary operations and the machine width does about as well as custom operations for bounds checking. The predicated fault op (for stores) and the one-legged NaR pick (for loads and LEA) are useful.

    Valgrind and similar tools check for access outside of an allocated object, not just array indexing. To do that properly requires capabilities or fat pointers, which break data structures that assume skinny pointers. We have some NYF support for this usage.

  • Ivan Godard
    Keymaster
    Post count: 689
    in reply to: Peripherals #3056

    There’s no single answer here because there are tons of peripherals with idiosyncratic access mechanisms. The simplest peripherals are accessed by MMIO and normal Mill memory protection controls the access. More complex “smart” peripherals may be partly MMIO and partly handshake protocols using interrupts. Interrupts are fielded by the hardware and dispatched through a fairly conventional vector and appear to the handler to have been a normal function call with belt arguments supplied by the hardware unique to the particular interrupt. Normally the vector entry will be a portal rather than a function pointer, so the handler runs in its own protection domain rather than that of the interrupted program.

    Once a service gets its handler installed in the vector and gets permissions to access the relevant MMIO then that service “owns” the peripheral. The Device Manager service that provides access to the vector and the MMIO must be trusted code, but the particular device service need not be. That is, drivers (device services) run at application level and cannot stomp on the system.

    More complex still are shared devices such as the network stack, in which there can be several levels of dispatch, but the basic organization remains the same.

  • Ivan Godard
    Keymaster
    Post count: 689

    The widening op forms always widen, and when applied to vectors they drop two vectors each of half the number of elements.

  • Ivan Godard
    Keymaster
    Post count: 689
    in reply to: NULL pointer #3098

    The possibility is for legacy compatibility, for machines like the PDP-10 for which address zero had a hardware-defined meaning.Some future standard may deprecate it.

    The Mill reserves more than the zero address, to catch code like:

    struct S{int a; int b}; int* p = static_cast<struct S*>(NULL);
    ... //other code
    int i = *p;

    Of course app code won’t have permissions for ultra-low addresses, but system code may, and hardware poisoning the range helps catch bugs in privileged code. In addition, it lets the fault distinguish between (likely) null-pointer vs. wild address.

    We try to be friendly.

  • Ivan Godard
    Keymaster
    Post count: 689

    We welcome questions!

    Overshift is implementation defined in C because different ISEs (that are too widely used to ignore handle) it differently. For us, the easy case is the excepting form “shiftlx”, which will get you an overshift NaR. The saturating form “shiftls” is also simple, as it saturates rather than overflows. Nearly as simple is the widening form “shiftlw” which will give you a double width result if the count is within that double width. However, if it is also not within the double width then you will get a non-NaR zero. And plain “shiftl” also overshifts to a non-NaR zero too.

    There is no single Mill op that uses the approach in which the count is modulo-divided to the operand size such that shift by 32 and shift by zero are synonymous (for word-width data). If you want those semantics you must explicitly mask the shift by an AND.

  • Ivan Godard
    Keymaster
    Post count: 689

    Nudge says starting after Thanksgiving, but still low priority.

  • Ivan Godard
    Keymaster
    Post count: 689

    Weekly. We all agreed it was a great idea that we should proceed on – and then it drifted down the priority list. I have nudged; we’ll see.

  • Ivan Godard
    Keymaster
    Post count: 689

    The running lag on each side is maintained in an internal register in that side, which are independently counted down each issue clock and are reset up as added lag is decoded. The difference in the contents of the lag resisters tells which side restarts first, or rather that is the effect, because each is independent and restarts when its private lag runs out. Of course, if the registers have the same value they restart together. In your example, because both are non-zero they both stall and the left counts to zero and the right to one. The next cycle the left (at zero) issues and the right counts to zero. The cycle after that they both issue.

    This of course leaves the question of how to know the running lags at some random point of the code. The call operation saves the lags and return restores them, which takes care of calls. However, the debugger is more of a problem. The compiler statically knows when each half-instruction issues relative to the half-instruction on the other side and can put the relative lag in the debug info. However, if both are lagging, there’s no way (at present) to tell the debugger/sim to startup (or examine) one, two, … cycles before lag-out and issue resumption, and instead you see the program as of the first actual issue after the point you select if that point is in the middle of bilateral lagging. We could add a command to the UI to let the user set the lag point, but it oesn’t seem to be actually useful and we haven’t.

  • Ivan Godard
    Keymaster
    Post count: 689
    in reply to: Prediction #3049

    This goes about half-way to what the Mill does. You are correct that the usual legacy branch-predictor doesn’t know its got a branch until too late, after it has decoded the branch. There are known improvements, such as marking branches in the I$1 cache lines, which lets you know you have a branch coming when you fetch the line. However this is also usually too late in branch-rich code. Still more complex approaches use the regular predictor to build branch-free code sequences, often already partly decoded; this can save a lot in loops but doesn’t help in open code.

    The need to look at “here” before you can realize that you will need to go “there” introduces potential memory delays in prefetch. Adding an explicit prefetch operation doesn’t really help, because you have to decode and execute the prefetch op and generally that comes too late too. A prefetch…load sequence would hide the I$1 latency, but not a 300 cycle DRAM latency – the compiler has no idea where it will be 300 cycles in the future. However, runtime often can have a pretty good idea, and we use that.

    The Mill moves the whole predict and fetch machinery away from the code into a side structure with its own engine unrelated to the decoders. That engine can run arbitrarily far ahead of decode; in particular, it can run more than a memory-fetch-time ahead.

    Your approach of a second buffer might help on a legacy single-issue architecture, but not on a Mill where an instruction can contain several branches, only one (or none) taken; a side buffer for each would blow out the hardware, especially as the following instruction can have several too – see the code example in the “Switches” talk. There is an I$0 microcache in the Mill, which is vaguely similar to multiple side buffers, but this is to cut the power/bandwidth demand in loops more than to cover mispredictions.

    But even if you keep a side buffer per instruction the side buffers greatly increase the bandwidth demand into the decoders; instruction bandwidth is a limiting factor for high-end Mills. The Mill only moves the predicted path to the decoders; the alternate paths are prefetched to the caches but not fetched to the decoder. That lets prefetch be slow and relatively rare (and so not in the way), with the run-ahead prediction giving the necessary advance notice. We still have to get a missed target from the caches, but we avoid the great bulk of the cost of getting from memory.

    To answer your final question: no, branch prediction is not actually necessary, and Mill uses exit prediction instead. But so long as memory latency is much longer than a compiler can see, and path fanout remains larger than hardware can handle in parallel, prediction of some kind will be necessary for performance.

  • Ivan Godard
    Keymaster
    Post count: 689
    in reply to: IPC questions #3044

    Patience, waiting is.

Viewing 15 posts - 226 through 240 (of 674 total)