Forum Replies Created
- AuthorPosts
- in reply to: Application Walkthrough #343
Have you a suggested app or fragment?
- in reply to: Site-related issues (problems, suggestions) #333
My posting in the memory topic at January 3, 2014 at 6:27 pm contained a link right after “see”, but no link appears on the page when I view it. I used the “link” button to make it, not manual tagging that I would certainly screw up. ????
- in reply to: Site-related issues (problems, suggestions) #319
Why aren’t forum topics showing up as threaded? When I reply to several posts in a topic they all wind up together at the bottom with no indication of what they were responding to. I haven’t found an option to turn threading on – doesn’t WordPress do that?
Ivan
As described in the Memory talk, page aliasing and COW work straightforwardly, although with one catch on the Mill: an unwritten COW page will have only one copy in memory because the physical address is aliased, but any lines in cache will be duplicated because the virtual addresses differ even though the data will be the same 🙁
- in reply to: Application Walkthrough #347
Bummer: I see the forum has squeezed out all the extra blanks in my carefully-formatted code. I’ll take it up with forum admin.
- in reply to: Application Walkthrough #346
A good list. We do have to pick a Mill member, but for now assume one big enough (unlimited slots and belt) for anything; the actual slot and belt requirement is an in interesting result in its own right.
I’ll take the first: GCD.
/* code based on Rosetta C++ example: int gcd(int u, int v) { return (v != 0)?gcd(v, u%v):u; } */ F("gcd"); // u in b0, v in b1 neqs(b1, 0), rems(b0, b1); retnfl(b0, b2); nop(4); // wait for rem call1("gcd", b3, b0); retn(b0);
This needs a 3-long belt, one flow slot and two exu slots (suitably populated); 8 cycles, excluding the nested call body.
/* code based on Rosetta C++ example: int gcd_iter(int u, int v) { int t; while (v) { t = u; u = v; v = t % v; } return u < 0 ? -u : u; /* abs(u) */ } */ F("gcd_iter"); // u in b0, v in b1 L("loop"); neqs(b1, 0), rems(b0, b1); brfl(b0, "xit"); nop(4); // wait for rem conform(b3, b0); br("loop"); L("xit"); lsss(b0, 0), negs(b0); pick(b0, b1, b2); retn(b0);
This needs a 3-long belt, two exu slots, onr flow slot and a pick slot; the loop body is 8 cycles, plus 3 cycles for the wrap-up.
In both I have used speculation to launch the rems operation before it is known to be needed; without speculation the cycle counts would be 8 not 7.
The code does not use phasing (NYF). With phasing the count drops to 7 cycles for the first, while the second gets a 7 cycle loop and a one cycle wrap-up.
Your turn 🙂
- This reply was modified 10 years, 11 months ago by staff. Reason: formatting fix
This is all filed, but was at most alluded to in the talks so far. so here’s a more complete explanation.
Code:
All transfers use one of two modes: indirect through a pointer on the belt (no offset), or (up to 32-bit manifest) signed offset from the entry address of the current function. There is also a LEA form that returns the offset-from-function-entry result as a pointer dropped to the belt.Data:
All address modes (load, store, and LEA) comprise a base, an up-to-32-bit manifest signed offset, and optionally an index from the belt with a manifest scale factor that left-shifts the index by 0-4 bits. The base may be either a pointer from the belt or one of a small set of specRegs that contain addresses. Currently these are:
cpReg base of code region for load module
cppReg base of constant pool in load module
dpReg base of static data region
fpReg base of stack frame, or NaR if no frame
inpReg base of inbound memory-passed function arguments, or NaR if none
tpReg base of thread-local storage, or NaR if noneThis list is expected to change as we discover issues when porting the OSs. In particular, cpReg and cppReg are likely to go away; there may be an outpReg added; and there may be a few levels of display added (for support of languages with closures). With these, all addresses in the code itself are static and no load-time relocation is necessary.
cpReg, cppReg, and dpReg are initialized by the loader when the process is created. fpReg and inpReg are managed directly by the hardware via the call, retn, and stackf operations. The values of all these registers are currently readable by the rd operation, but that too may go away. They are reachable by MMIO, as used by the init ROM in the power-up sequence to set the initial execution environment.
Well, some answers are not NYF 🙂 Though there wasn’t a clear deductive line that led to SAS; it evolved by fits and starts like everything else.
I’m somewhat mystified when you say “Even when those instructions themselves are very effecient, they require separate loads and the use of precious belt slots.” True label- and function-pointers do exist on a Mill, and they must be loaded and then branched or called through, just like for any other machine, but the code would be no different with private address spaces. The great majority of transfers are direct and don’t use pointers, and those have neither load nor belt positions. You can branch to a label or call a function in a single operation with no belt change. Color me confused.
And yes, the predictor machinery does do prefetching. The phasing mechanism used to embed calls in an EBB is NFY, but will be covered in the Execution talk 2/5 at Stanford.
We assumed from the beginning that any architecture needed 64-bit program spaces; a 32-bit wall is just too constraining. We never really considered supporting both 32- and 64-bit program spaces; apps are written for both on x86 solely because of the massive install base, which we don’t have. We were afraid that either 32- or 64-bit mode would become dominant by accident and the other would die due to network effects, and our double work would be wasted. So pointers had to be 64-bits, less a few we could steal if we wanted to (and it turned out we did).Given a 64-bit address space, static linking is right out: 64-bit offsets would completely hose the encoding and icache. So what gets statically linked in such systems, and what could replace it? Two answers: code, and bss (static data). Turns out that nobody has 4GB worth of code, and all big data is held in dynamic memory (malloc, with mmap behind it), not in static, so global static is under 4GB too. Sure, you can have a program that statically declares a 100GB array – but look at the code the compiler gives you for that – you’ll see something else going on behind the scenes, if the compiler doesn’t err-out right away.
So both code and static only need 32-bit offsets, off some address base. That takes care of the encoding issues, but also obviates static linking – there’s no advantage to fixing the base, because the instructions are now position-independent because they carry offsets, not addresses. Sure, you need an address-adder, but you needed that anyway to support indexing, unless you are a RISC Puritan and are willing to do individual shift and add operations, and pay for your purity in icache and issue slots. The Mill has quite conventional address modes: base, index and offset, requiring a specialized three-input adder. No big deal.
So now we have position-independent code (PIC), and 32-bit code and static data spaces within 64-bit overall space. Are all those 64-bits going to be used by any application? Within my children’s lifetime? Only on massive multi-processor supercomputers with shared memory. However, it’s increasingly obvious that shared memory at the building scale is a bad idea, and message passing is the way to go when off-chip. At a guess, 48 bits of space is enough for any app that we want to support in the market. And how many apps will there be? Or rather, how many distinct protection environments (called a turf in Mill terminology) will there be. Or rather, how much address space in total will be used by all turfs concurrently (after all, many turfs are small) on one chip? Surely much less than 64 bits.
So a SAS is possible without running out of bits or needing more hardware. What are the advantages? The obvious one is getting the TLB out of 90+ % of memory access. TLBs are horribly expensive on a conventional. They have their own cache hierarchy to hide some of the miss costs (which still run 20% or more cycles), and to be fast are a huge part of the power budget. All that goes away with SAS. Yes, SAS still has protection in front of the cache, but that is vastly cheaper than a full-blown TLB (NYF; there will be a Protection talk). The virtual address translation does not need to happen anyway, and is very far from free 🙂
Then there are software advantages: with SAS, processes can cheaply share data at any granularity and in any location; they are not restricted to page-sized shared mmap regions that require expensive OS calls to set up. Getting the OS out of the act permits a micro-thread programming model that is essential for painless parallel programming. The OS is simpler too – it doesn’t have to track whose address space a pointer refers to.
Now all this is an intuitive argument; we certainly don’t have the experience with large programs and real OSs to measure the real costs and benefits. But we were persuaded. 🙂
- in reply to: Forum RSS? #332
I’m not getting emails either, although I do get a cumulative posting list when I click on the forum RSS-feed button.
Yes, no false sharing; the valid bits on each byte in cache are used by the coherence protocol so that several cores can own disjoint parts of the same line without having to swap the line back and forth. Invalidation is fire-and-forget; the writing core doesn’t have to get the line to write to it, and the sequential consistency memory model removes write-ordering issues so long as invalidations cannot cross in the mail, which is pretty easy to ensure in hardware.
Answers:
1) 0-latency deferred load when there’s no work to overlap. Turns out that it’s expensive to stall issue; modern cores don’t stop on a dime. Hence it’s cheaper to assume a D$1 cache hit and insert the requisite no-ops. However, in nearly all cases the no-ops occupy zero bits in the code stream, so they are free; see Encoding.
2) There is an abandon mechanism for pickup loads and other in-flight state. NYF, but will be covered in the Pipelining talk.
3) Locks: the Retire Stations do snoop on the coherency protocol in multicore configurations. Multicore will be a talk, but not soon.
- This reply was modified 10 years, 11 months ago by staff. Reason: fix link
Ryan:
FP rounding modes: modes are in the operation, indicated by the mnemonic (one of the choices does pick up from the PCSW). This is useful when you have FP ops with different rounding in the same instruction, for example in interval arithmetic.
vector casts: there is no such operation as your example; all widen and narrow ops are cardinality-preserving (N->N). We could narrow your four words to four shorts, but what value would you expect in the other four shorts of the result vector? However, there is a vector narrow that narrows two vectors to one with half-size elements. Thus 2X4Xword->8Xshort (i.e. 8->8). You can widen or narrow Nones and NaRs like any other data. A narrow that overflows gives you the same truncate/except/saturate choice as any other overflow (the fourth choice, double width result, doesn’t make much sense when narrowing and doesn’t exist).
Belt timings: all picks, including vector pick, are exu-side encoded. After the decoding there’s really no such thing as a “side” any more; execution itself is a collection of FU pipe with no particular “side”.
Implicit splat: We used to have implicit splat; makes the compiler easier. But when the hardware crew started on the implementation it turned out to be a big hit on the clock rate, paid by all operations whether used or not. So it was taken back out.
Endianness: Internally little-endian, your choice for memory.
Bool vector from rotating smear: The bool vector is not a bit vector, it’s a vector of ordinary operand which happen to be either zero or one; the width of each bool element is whatever the width metadata says it is, just like for ints. As a special case, a vector argument to a conditional branch tests the element that corresponds to the highest memory address if the vector is loaded from memory (or computed from something that was loaded). This saves an extract operation in the common case of memory-upwards loops and smeari. For memory-downwards loops (less common) the explicit extract would be necessary to get a testable bool. Instead, smearil (left inclusive smear) produces an already-extracted exit condition bool, the way that smearxr (smear right exclusive) does, as shown in the video.
Smearx requires either a second, scalar, result (as shown), or a rotating smear followed by an extract (at no gain over the existing), or branch variants that test each end of the bool vector (doable but branches are already pretty cluttered). Sort of by definition smear is only useful in loops, and in a pipelined loop latency is irrelevant, so the extra cycle doesn’t bother us
- AuthorPosts