Forum Replies Created
- AuthorPosts
- in reply to: IPC questions #3038
Lots of questions!
The local and global address spaces are address spaces, not turfs. Yes, when a fork happens the parent’s local space is logically replicated (yes, COW) in the child, and the child turf gets the same rights into it that the parent turf did to its copy, but many different turfs may have rights into the respective spaces. Spaces are an aliasing notion, not a permission notion.
XOR still exists; it cheaply permits bi-directional change from local to global addressing.
It’s OS/RTS policy whether code is (COW) duplicated or is shared. Normal loaded code will likely share, but if there’s a JIT in there then you might want to duplicate it. The hardware doesn’t care.
COW involves a protection trap (not fault), fielded by the Protection Service. Whether that is part of the OS or is a detached service is up to the system design. The fork() may also elect to pre-copy things that are certain to be touched right away, like the data stack. All policy.
There are ops to global-/local-ize pointers. We expect that portals taking pointers will be called through a shim that globalizes, and the callee will verify globalization as part of argument checking. There is hardware help for that, that will be covered in a future talk.
- in reply to: Vector ops #3036
Like most wide-issue machines the Mill has a notion of “slot”, or scalar lane; one op can issue in each slot in each cycle, and can execute in any of the functional units attached to that slot using independent data. The data may be scalars from one byte up to the maximum supported scalar (64 or 128 bits depending on family member), or vectors of any scalar element size. That is, the architecture is fundamentally MIMD. The data paths feeding the FUs of the slots define a maximum operand width for data in the machine. This width is at least as big as the maximum scalar, but may be bigger so as to support larger vectors.
The con() op drops literal values to the belt. It is completely general and can drop any operand up to the maximum operand size, both scalar and vector. Scalars use the b/h/w/d/q width tags, and vectors use v
, as in v16b. If every element of the literal has the same value, you may get better code by con()ing a scalar and extending it to vector with splat(). The rd() op is not general, but drops one of a member-dependent set of popCons (popular constants). If a particular literal is an available popcon then the specializer use the rd() op because it is more compact and does not plug up the flow-side slots that con() uses. Popcons may be scalar or vector, and each has a specific width that it drops; the same value but different widths are different popcons. Some popCons are always present on every member (0 and 1 of all scalar widths for example), and some are always present if the configuration includes hardware for which they would be useful (pi and e of relevant floating-point widths for members with FP for example). In addition, there will always be a few bit patterns left over after the configuration software has determined the bit patterns of all other operations configured in the readerBlock of the encoding (where the rd() op encodes). These are used to add additional popcons until all the bits are used up.
There are no reduction operations defined other than any() or all(). The alternate() op is a special form of vector swizzling that lets reductions be constructed in logN steps.
- in reply to: Getting an int from a possible pointer #3029
I like your handle.
Pointers are data like any other, and can be in the stack frame. What is not in the Mill stack frame is the state of the calling protocol, in particular the return address and down-stack link. These may be machine addresses (and usually are) but are not program-declared pointers.
Mill conforms to C rules, which permit conversion between pointer types and intptr_t integral. However, the only thing you can legally do with the result of a conversion is to convert it back again. The facility exists to support data structures that need a hash or total ordering of pointers.
You can legally modify the low bits either by converting to integer and doing integer arithmetic, or by casting through void* to char[4] and doing pointer arithmetic.
- in reply to: History, Roadmap & State of the Mill: a lack of. #3027
I have kicked this up to our next meeting, for internal discussion, and will post what was decided.
- in reply to: A suggested market for the Mill… #3026
I agree that automotive is a suitable home for the Mill. Unfortunately, “suitable” and “saleable” are not synonyms, and we would need to partner with a vendor already in that space. It’s one path we are exploring.
- in reply to: Questions related to your IPC talk #3021
We wish we could do capabilities, but caps break the C model and the C pointer representation, so selling a caps machine seems unlikely. There are subtle difference between the Mill grant-based model and caps, most evident when the argument to the RPC is some kind of linked structure such as a graph. In Mill it’s easy to pass a single node and annoying to pass the whole graph; in caps it’s vice versa.
1) Re task switch: It depends on what you mean by “task”; Mill hardware is below that level and does not dictate the task model. If you mean something heavyweight with accounting quanta and all then yes, the OS must be involved, because the hardware doesn’t do accounting. If you mean something lightweight such as a thread of control then no, the OS doesn’t need to be involved. Our next talk will probably be on threading and will cover this.
2) Re availability: Not yet, though we hope to put the sim on the cloud at some point.
3) Gate count: I have no clue, I’m a software guy. I wouldn’t trust the hardware guys on this either.
4) Turf across cores/chips: Turf works fine across cores in a multicore, although there are the usual atomicity issues in updating the protection info. By design Mill does not extend it’s environment across chips; there’s no interchip shared memory, so there’s no interchip memory protection. Use message passing protocols instead.
5) Core counts: See “Gate count” above.
Wow! Deep questions. And few answers – but I’ll do my best in no particular order.
The current business model is Intel (or TI if you prefer): mostly a chip vendor, with substantial side businesses in IP, custom, and boards. An ARM-like model is a fall-back option.
We expect to expose our specification and sim tools to the customer base and research community, and quite likely to the public. With those tools and a chunk of NRE and we’ll give you your chip, or build hard macros for use in a chip of your own.
We have no minimax searching tool such as you describe in view. Given customer demand, sure.
Specializer-driven emulation of one Mill on another is possible, but won’t give you the modelling you are looking for: members differ to much is ways not driven by the code. Nothing we can do in the code can make one Mill act like another with half the icache. For accurate modelling you’d need to run our sim on your big Mill; the sim models the non-code differences.
Currently our sum greatly abstracts at the system level. In particular we do not try to model anything across the pins. For example, we simple spec a fixed number of picos for DRAM access, and ignore the variation induced by charging, bank switch, and the like. Similarly we do not model i/o devices, so would be no help in trading memory for i/o.
And all these will no doubt change as the company evolves. In particular, the funders of our next round will have a big say in the answers here.
- in reply to: Multithreading? #3000
There will be a talk on the subject, probably the next one after the one on 10/4.
An interesting paper, but of limited relevance to the Mill. Because (as you note) both aliasing and protection checks are not an issue for us, we can in general hoist a load to any point after the address is available, and can often hoist (much of) the address computation too.
Of the issues listed in the paper:
1) Finding enough independent instructions: always an issue on the higher end members, even when being liberal with speculation. Hoisting helps somewhat, but there’s no point in hoisting just to introduce no-ops.
2) Chains of long-latency instructions: This is a restate of #1, because if there are plenty of independent instructions then long chains are hatmless.
3) Increased register (retire station) pressure: this is a lesser issue on the Mill because we don’t unroll.So while the paper addresses some of the same things that the Mill architecture does, the methods used have costs that are obviated on the Mill.I doubt that combining the two would give any net benefit.
- in reply to: IPC questions #3040
Turf’s don’t have addresses, they have ids. The turf is a collection of arbitrary address ranges with their permissions.
Yes, when allocating in local you have to not collide in global; there’s lots of address space. What is XOR’d is the turf id into the high bits of the address. This relies on each Unix process (the only code that has fork()) having a new “home” turf for each process.
- in reply to: Vector ops #3037
The general answer is to recognize that this is a pick reduction, and yields to the standard reduction strategy of a logN chain of alternate() ops applying the reduction operator at each stage. The operator here would pick the non-zero (or non-None) value at each stage, leaving the chosen value as element index zero at the end where the extract() op would yield it as a scalar.
Be aware that this is likely not the last word on this question and on reductions in general. We have made sure that vector semantics is correct, but have not paid much attention to vector performance, and won’t until auto-vectorization is working. An add reduction inherently requires logN adds, but there may be better ways to express that than the present alternate tree. Your pick reduction also fits nicely into the shuffle hardware – but it’s not clear how to fit it into the ISA yet.
The Mill is by definition a SAS system, so the constraint on the low end is address space. A chip with no MMU (all addresses are physical) then that is effectively SAS, so if you can fit in that then you can fit in a Mill of the same size. On a 64-bit Mill the spillet matrix occupies 2^50 bytes of address, so the number of distinct thread and turf ids has to drop sharply as the space goes down, but for embedded the number of threads/turfs is probably statically known.
There’s also no architectural need for caches if you are going straight to on-chip memory. You’d probably want to keep the I$0 microcache anyway. The FUs and all the operand paths can have arbitrary width, but should probably not be less than pointer size to avoid code explosion. The predictor could be dropped if frequent misses are tolerable.
There are architectural overheads that are largely independent of address space and operand size: the specRegs, the decoders, others. As the Mill size shrinks these fixed costs start to be more important. It’s unclear when they would tilt a choice against the Mill.
- in reply to: code examples? #3012
Pipelining is easier IMO, at least with SSA. You have a feasible linear schedule and simply lay it around the torus, so everything is done in scalar order. While an instruction may contain operations from different iterations, the arguments of those operations are unambiguously the ones that belong to that iteration so aliasing is not an issue, no more than it is with simple unrolling. The annoying part is the loop prologue and epilogue, and there we have hardware help.
Short functions with loads tend to have extra noops or stalls for the reason you give. A sufficiently smart compiler could issue prefetches in the caller, just as can be done for any architecture. But the general Mill approach is inlining.
Most ISAs inline to avoid call overhead, but calls are cheap on a Mill. We inline to get more for the machine width to work on. On a wider Mill you can inline surprisingly large functions with zero latency cost in the expanded caller, because the called ops fit in the schedule holes of the caller. Of course there is usually some increase in the code space used, but removing the call and return means less thrashing in the cache lines, so net it’s a win unless profiling shows the call is never taken.
We’re still tuning the inlining heuristics in the tool chain. As a rule of thumb, any function shorter than a cache line is worth inlining on icache grounds alone.
You wander a bit, but I’ll try to address some of your points.
The belt is not equivalent to a multi-port register file because there’s no update; everything is SSA, so if anything it acts like a RF with only read ports and no write ports. There is an all-source-to-all-sink distribution problem, conceptually akin to a crossbar matrix, but one of the patents explains how we can do 30X40 and have it time like 8×20. However, the distribution is still the scaling limit for the belt.
Phasing is valuable in open code because of the shape of the typical dataflow graph: a few long chains, with local short business at each node of the long chain. We’re not often executing two long chains simultaneously, but we are often doing several bush pieces at once and feeding that to the long chain. The bush fragments tend to be short (1-3 ops) and to fall naturally into the three cycles given by phasing, so we can overlap the later bush phases of this long-chain node with the earlier phases of the bush of the next node. It sounds a bit like breadth-first search, but the presence of the long chains means that the placement algorithm is different.
We get long open code because we do a lot of speculative execution. It’s a little like trace scheduling.
Some years ago we had vectors ultra-wide operands (UWOs). UWOs make sense on a machine with a dedicated UW RF, but not on the belt where scalars are mixed in, so we replaced UWOs with our skyline vectors.
About seeing silicon: go to https://millcomputing.com/investor-list/
Predication: we have predicated forms of (nearly) all the operations that cannot be speculated, notably all control flow and stores. There’s no point in wasting entropy and compiler monkey-puzzle on the rest.
Managed systems: we have some gc-centric facilities, We also expect that a Mill JIT would take advantage of the regularity and simplicity of the ISA to generate to the machine rather than to a memory-to-memory legacy abstraction. The big Mill wins in a managed environment will not come from the execution core however; it will come from the reduced memory traffic, cheap coherence, and ultra-fast task switch.
- AuthorPosts