- Ivan GodardKeymasterMay 7, 2016 at 11:39 pmPost count: 687
< Reposted from comp.arch, by permission >
On 4/25/2016 5:11 AM, firstname.lastname@example.org wrote:
> On Monday, April 25, 2016 at 12:02:32 AM UTC+3, Ivan Godard wrote:
>> On 4/24/2016 1:02 PM, Nick Maclaren wrote:
>>> In article
>>> While issue width is theoretically limited by instruction fetch,
>>> in practice the limit is in data access and various
>> One might think so, but that doesn’t seem to be true.
>> Data access is a limit, but common multicore chips have much higher
>> data bandwidth requirements across several cores than the biggest
>> Mill has in its MIMD. We expect that any data access constraint
>> will show up in how many cores we can put on a chip, rather than in
>> how wide we can make any core. A wide Mill is basically a
>> throughput design, so if your need is lots of threads rather than
>> high single-thread performance then you would buy a chip with lots
>> of narrower Mills on it; a low end Mill has no greater bandwidth
>> needs than a conventional. Chips with fewer, wider, Mills are for
>> those that need single-thread throughput. You know who you are
> You are thinking about LLC/external data access. What about L2 data
> access that appears to have a practical limit of 1 line per cycle?
Nothing says that there is only one L2 on a chip. More importantly, the Mill is exposed pipe and stalls if data is not ready when scheduled. The deferred-load capability does let us saturate the L2 if there is a string of L1 misses in exactly issue order, but that happens only in test cases. In practice, some of the misses will be short-deferral and will stall, which shuts off the flood and lets the caches catch up to the request backlog.
We also addressed L2 issues from code demand. A branch miss followed by a microcache miss followed by a l1 miss will stall the machine while it waits for the L2. Meanwhile data requests and code prefetch are making use of the bandwidth. The L2 problem is a major reason why we went with exit prediction instead of branch prediction – we need to run code fetch well ahead of execution to smooth out bottlenecks at the L2.
Like any caching, it’s a game of statistics. Our design explorations say that a high-end Mill can profitably use four load-store units, plus a peak code demand of four cache lines per cycle in a one-instruction loop, comprising two lines on each side when the instruction crosses a line boundary. That loop executes out of the microcache and shift buffers, so the i1 and below never sees the code demand.
Open code, which isn’t in the microcache, saturates the i1’s at two lines a cycle (one for each i1 – recall that there are two), which means that if open code takes a transfer more frequently than once every (line size/average half-instruction size)*2 cycles then we will be stalling in the i1. Open code can’t use software pipelining so the half instructions tend to be short and this limit too appears to be only a concern for contrived test cases.
Lastly, Mill code has fewer loads and stores than the usual ISA. We don’t use load/store for saving temps, nor for registers and linkage info over a call. In typical general-purpose code that cuts the load considerably.
> What about L1D access that appears to have a practical limit of 2
> loads + 1-2 stores per cycle best case (no bank conflicts) and of 1
> access per cycle worst case? Yes, people tried true dual-ported L1Ds,
> but, first, practice showed that it is not very good trade-off and,
> second, for dual-ported a worst case is still only 2 accesses per
This turns out to be two issues: general-purpose open code, and the data-heavy streaming access typical of scientific codes. In open code our demand is limited by the control-flow of the application. Thus if you walk a list you can’t fetch the next until you have fetched the previous. Consequently, if a conventional can satisfy the demand using standard cache technology then so can we – we just have a bunch of L/S units sitting idle.
Data streams are different: they will be loops, software-pipelined for us, and we can increase the IPC (well, OPC) until we run out of some category of FU the loop needs. We do handle the stream case, but unfortunately how we do that is still NYF.
There are still a few rabbits in our hat; I hope you can be patient.
> Most importantly, what about register file reads and forwarding
> network? Practice appears to demonstrate that for any particular type
> of resources (of which GPRs are most critical by far) it is not
> practical to have more than 3-4 bypasses and more than 6-8 register
> read ports. And that being generous. Power-aware design will likely
> goes for less. You are calling registers and bypasses “a belt”, but a
> new name can’t fix physical limitations. In fact, if you still insist
> on CAM-based implementation of belt, it only adds new physical
The Mill has no register file, neither conceptually nor in the hardware. The whole machine is a bypass network. We use the CAM description because it’s easy to understand. There’s no real CAM either in the hardware, although the belt does work as if there were one.
In the hardware there is a routing crossbar, that connects every operand data sink with every operand source. The sources are the result latches of the FUs, both those that you can encode ops for and a few that deal with internal administration; as a rough approximation you can consider each issue slot a source. The sinks are arguments to FUs and pseudo-FUs; again as an approximation you can say that sink count is twice the slot count. If we pick an extreme high-end member then that’s a 30×60 routing network, before connectivity optimization.
Consequently that network is three muxes deep in a naive implementation. At the low end, with only four issue slots, it is one mux deep. The extra two layers do have a clock-rate cost, and whether the width is worth the clock is an engineering tradeoff that the market can decide. The network is much smaller than the network for the rename registers on a big OOO, so we have no worries about implementing it.
The network is not actually a 30×60, but is cascaded so that multi-cycle operation results get an extra clock to get to their sink. This reduces the effective number of sources from 30 to around 14, and cuts the mux layers by one. The architectural consequence is that it is no longer possible to specify an operation that is both one-cycle latency and has multiple results. Yes, the convenience of the belt for multi-result operations means that we’d like to have such ops in several cases, but can’t.
Besides the mux network itself, we also have to derive the steering signals for that network. Because the Mill is static-exposed-pipe, we know in decode what the belt consequences of the ops we are decoding are, and all the ops in the third phase of our three-phased-issue are pure consumers. As a result we know a full cycle ahead of belt transit what the routing will be, and for the pure sources of the first phase we know two cycles in advance. That gives us time to get the routing signals to the muxes. Your limit of 3-4 bypasses is for dynamic bypasses that read are CAM-like; ours are static, and are really just a fairly conventional crossbar, so the only limit is mux depth and, ultimately, power. We are well short of those limits.
It may not work; we don’t have hardware yet, and there may be gotchas. But we have addressed all the obvious objections. Most aren’t really objecting to the Mill; they are objecting to a Mill built the way a conventional is, with register files and such. We gave up on doing it that way a decade ago – for exactly the reasons you object to them now
- benke_gParticipantAugust 18, 2016 at 8:35 amPost count: 2
Thanks for the explanation on how the belt is actually a description of how to flip the switches in a big crossbar. Now I just wonder what happens to retiring values that are not immediately used, but instead needed in later cycles. Won’t these have to be held temporarily somewhere, thus increasing the number of inputs to the crossbar by at least the belt length (for the worst case).
- Ivan GodardKeymasterAugust 18, 2016 at 10:52 amPost count: 687
They do have to be held somewhere; “somewhere” is called the scratchpad. Each function appears to get its own scratchpad. There is only one physical scratchpad, which is an in-core SRAM, not memory, and is reachable only by the “spill” and “fill” ops. The “appearance” of one per frame is provided by the spiller. Mill operation phasing lets an instruction contain “fill(), fill(), add(), spill()” to execute scratchpad-to-scratchpad without stalls, if desired. The instruction takes a single cycle, every time, sustained as long as you like. The equivalent in a general register RISC that has run out of registers would be “load; load; add; store” but because it uses memory the latency is longer, even if there are no cache misses. Machines like the VAX or x86 that have M2M operations suffer the same latency/miss problems, just encoded differently. Mill avoids all that.
You must be logged in to reply to this topic.