Forum Replies Created
- AuthorPosts
- in reply to: strcpy and alignment #891
The Mill does support unaligned memory access, but as you might expect using them doubles the memory bandwidth requirements, which can lead to poorer power-performance.
There are a pair of Mill-unique operations that can be used to get a vector loop started with proper alignment. As you also might expect, they produce an initial vector with leading Nones. There wasn’t enough time to cover them in the Metadata talk, but how they work should be fairly obvious.
- in reply to: coroutines & greenlets in Mill #847
We know about co-routines 🙂 And light-weight processes for that matter, which are greenlets by another name.
The good news is that the Mill supports co-routines and microthreads in hardware. The bad news is that it’s all NYF. Expect a talk on the subject at some point.
Even without the NYF stuff, the hardware threads described in the Security talk can be used as greenlets. It ‘s a policy decision whether a given thread spawn produces an OS-known thread (and hence subject to preemptive multitasking, or does not and produces a greenlet that does voluntary multitasking. An OS swap-out saves the relevant specRegs in the current TCB. If the TCB is bound to a particular thread id then you have heavy-weight tasking. If the TCB saves (and eventually restores) the current thread id whatever it is then you have a greenlet group. The OS task switch code needs to be aware of the possibility of greenlets, but it makes no difference to the hardware.
At that level at least. There’s more, deeper, but NYF. Sorry.
Pre-emptive multitasking is fully supported. Its a policy decision above the Mill pay grade whether all threads are first class or some are part of thread groups that are scheduled as a group in the usual way. Task switch on a Mill is very cheap because all the save/restore is done for you by the spiller, but there is still the trip through the ready queue and the priority scheduling etc. The point to portals and stacklets is to have the protection benefits of separate processes without the IPC costs when the IPC is in fact being done co-routine style and so can be replaced by calls, as is the great majority of cases.
As suggested by the comments above, the thread id is unchanged through a portal call – it’s the same thread, just running in another protection environment. The current thread id is a readable specReg, so the service can know who it is working for. From that it can find more info in data structures it or other services maintain.
However, it also can keep client-specific info in Thread Local Storage. Each client/service combination has its own TLS just like it has its own stacklet, addressed by a hardware base register like the hardware frame pointer.
- in reply to: Introduction to the Mill CPU Programming Model #889
The load format knows that it is a Mill and has done operation selection under the assumption that the full abstract Mill operation set is available. The specializer replaces missing ops with emulation subgraphs, often calls but sometimes in-line, that do the same job. The canonical example is floating point, which some Mill family members lack completely.
The load format also knows the widths of each operation. Even though width is not in in the final encoding (it’s represented in the metadata), the compiler that generates load format knows the widths of the arguments, so load format distinguishes add-byte from add-half, and so on. The specializer also substitutes emulation for missing widths, notably on members that lack hardware quad precision.
Load format itself is a serialized forest of data- and control-flow graphs. Emulation substitution is O(N) in program size, and is done on the fly as the graphs are de-serialized.
The major task of the specializer, and the time critical one, is operation and instruction scheduling. We use the classic VLIW time-reversed deterministic scheduler, which is O(N) for feasible schedules. Not all schedules are feasible because we can run out of belt and must spill to scratchpad. In such a case, the specializer modifies the graph by inserting spill-fill ops and reschedules, repeating if necessary. A feasible schedule is guaranteed eventually, because the algorithm reduces to sequential code which is feasible by hardware definition. The time depends on the number of iterations before feasibility, which is at most N giving an O(N^2) worst case. However, studies show O(N^1.1) is typical, i.e. few EBBs need spill, and more than one iteration is essentially never seen.
That’s a big-O analysis, and the constants matter too in the actual speed of the specializer. However, the constants depend on the horsepower of the Mill doing the specialization, and we don’t have sims yet for anything that complex. We don’t expect speed to be an issue though; disk latency is more likely to dominate.
There are a few cases in which best member-dependent code is not just a matter of schedule. An example is the decision of whether a given short-count loop is worth vectorizing, which decision depends on the vector size of the member. The compiler normally does the vectorization, and the only part the specializer plays is to supply a few constants, such as the amount to advance addresses by with each iteration. However, in some cases it is better not to vectorize at all, and the vector and scalar loops are quite different.
In such cases the load format gives alternate code choices and a predicate that the specializer evaluates. The chosen code is then integrated with the rest of the program in the same way that emulation substitution is done. We want to compiler to be sparing in its use of this facility though, to keep the load-format file from blowing up too big.
I expect that there will be a Specializer talk, but not until the new LLVM-based compiler is at least up enough for public view.
We have not put hardware encryption into the base Mill, because we expect encryption methods to change over the life of the architecture and don’t want to be saddled with outdated legacy requirements.
That said, the Mill has enough horsepower to do a reasonable software crypto. For application that want more, the Mill defines an extension interface to which things like crypto and other specialized engines can be hooked. The interface makes the engine look like an i/o device from the program using it.
We have also considered supporting a physical seed unit. Such units give only a few bits of entropy and so cannot themselves be used for crypto, but the provide an uncrackable seed for regular algorithms. The decision on that has been deferred until after the FPGA version.
Very good! Yes, you can see the Novel bit as implementing a writeback cache, and deferred table update (described as “lazy” in the talk) works as you suggest.
As for the missed question (with my hearing I miss quite a few), the “next call” is for explicit calls, not for interrupts, traps, or faults. The pending grant(s) are state. We once had a way to push them in the PLB and fix them later, but there were issues if they got evicted before the call, so it’s just spiller-food now.
A region descriptor cannot have both execute and portal permission, but you could create two overlapping descriptors. Which you got would be search happenstance. If you wound up looking at a portal block as code then you would not transit and would be due for an invalidInstruction fault Real Soon Now. If you wound up looking at code as a portal, and by accident you happened to pass the security check by satisfying the ids that the bits in the id fields implied, then you would transit to the turf implied by the bits in the turf field, and then try to jump to the address implied by the bits in the target field. That address would have to have execute permission, and be in fact the address of an EBB entry (or you are up for invalidInstruction again) and probably must be the address of the entry of a function with no arguments or you are up for invalidOperand because the belt contents wouldn’t match what the code expects.
So, if the OS portal-bless service screws up and does overlap two descriptors, and the bitsies are just exactly right, then you can call a function in a random service. That’s why portal-bless is in the kernel.
As for distinguishing portal from non-portal calls, the basic reason is uniformity. We wanted a single pointer representation, one you could pass on to code that did not know whether it’s a portal or not. Consider a numeric integration package, which takes a data vector and a pointer to the function to integrate. The integrator should work the same whether the function pointer is to an application function, or a portal pointer to something in a math service.
- in reply to: coroutines & greenlets in Mill #852
Sorry – you’ll see NYF here a lot. It stands for Not Yet Filed, as in patents, and implies that we can’t answer the question without an NDA due to USPTO rules.
JIT code generation is conventional. To save the JIT from needing to do scheduling and bit packing, the JIT will create Mill load module abstract code and then call the same library that the specializer uses to get executable bits for the host machine. Thus the same JIT runs on all Mill members, even though the binary encoding varies by member.
The Mill has special support for GC, in the form of the event bits in the pointer format. See the Memory talk IIRC.
VARARGS is supported.
As the language RTS will need to keep state, including a possibly large stack, for each of those thousands of greenlet threads, the limiting factor for any CPU, Mill included, is probably thrashing in the caches. Even if each greenlet thread uses only the 4KB initial stacklet, a thousand would completely saturate the L2 of a modern CPU, leaving no room for code, OS, or any other process. So they would get evicted to DRAM, and switching to a new greenlet would require reloading from DRAM. The result would be hopelessly slow on any CPU architecture. You would run out of cache long before you ran out of thread ids.
If the greenlets are not really threads but are closures (small closures) that transiently use a stack when they get invoked, then the state requirements become much less and the cache issues go away. However there is then no reason to treat the greenlet as a thread requiring an id in the Mill sense; they are just a collection of cross-calling closures that can be identified by the address of the state object, and there is only one real stack and hence only one real thread and only one real thread id.
Even if the greenlet/closures admit callbacks (and hence cannot use a single stack without GC) you can multiplex them across a small set of true threads (Mill sense), where the thread pool size is determined by the number of concurrent cross-activations, which I suspect is orders of magnitude smaller than the number of greenlets in the use-case.
In all of these the Mill hardware should make the implementation of greenlets easier than on a conventional. But the details depend on the use-case and what the designers of the software have in mind.
- in reply to: Prediction #849
Yes. The load of the info block (in case of a callback) overlaps with the target code fetch, so there are only two fetches on the latency path: the portal itself, and the code miss. If you work through the detailed timing, this is the same as is needed for a trampoline (thunk) calling a dynamically linked (and local in the same protection environment as the caller) function that did not use a portal.
Consequently we expect that portals with wildcard ids (which use the caller’s ids) will be used for local dynamic linking as well as (with explicit ids) for actual protection transit; there’s no need for two mechanisms.
It is unfortunate that the predictor cannot predict far transits, but permitting it to do so would nearly double the size of the prediction tables for the same capacity. Measurement may show that far transits are common enough to justify the extra bits, in which case we will change the prediction formats and make portal calls be predictable, but our best guess for now says that far transits are rare enough compared to near transits that it is better to save the bits.
Re MMU use by garbage collectors:
One certainly could use Mill protection for this purpose, but there’s a better way.
Mill protection has byte granularity, so the GC would need only one region descriptor for the whole of any one kind of space. In a generational GC for example, you might use one descriptor per generation (typical GCs use three generations). This would be an easy port, just replacing the small amount of code that manages the page tables with similar code that manages the region descriptors.
However, there’s a better way, one that uses the GC support “event bits” in the pointer format. With these, the GC can work at the granularity of single objects rather than pages or regions, and would be expected to have sharply reduced overheads. Porting a GC to use these would probably be a bit more work, because the model changes and that requires a bit of thought. The actual code changes should be near trivial though, mostly involving taking stuff out.
Please distinguish alias mapping from paging. 32-bits systems don’t have enough address-space bits, so they have to reuse addresses; this is mapping,. In addition, physical pages may be present or not. For efficiency, conventional systems combine the mapping task and the paging task in one MMU/TLB machine.
With a 60-bit space the Mill has no need for mapping; there’s all the virtual space that everybody wants there for the taking. However, virtual space may still exceed physical DRAM, so there remains a need for paging, and the Mill does paging.
The paging is done in the TLB and its associated OS tables. Those are only looked at when we miss in cache and have to go to memory, and they tell where in physical memory a given virtual page is located, just as on a conventional machine. Or not located, if the page is swapped out to disk. So a Mill can still get page traps.
Mill code is position-independent (PIC), but jump table and the like have full-width pointers with virtual addresses where the code has been mapped. The underlying DRAM may be swapped out and paged back in at a different physical address, but the virtual address remains the same and the pointers still work.
Actually there is hardware virtual memory on the Mill, and paging and all the rest, and the virtual-to-physical translation, page traps and so on take place in the TLB in a quite ordinary way. It’s just that the TLB is in an unusual place in the architecture, after rather than before the caches as seen from the core, and protection is divorced from paging and moved to a different unit.
Well, that’s not quite true: there are some extras in the Mill MMU: virtual zero (described in the talk), and some NYF dealing with Unix fork(). But it’s mostly true.
As for embedded work, the problem with a smaller address space footprint is aliasing. If an embedded application had sufficiently small address space needs that everything would fit in 32 bits with no memory mapping/aliasing the the Mill single-address-space model would work; it would be a full normal Mill with an implied (and ot represented) zero in the high-order word of pointers. Note that you would still want a PLB. Whether the market for a 32-bit-no-MMU Mill is big enough to justify the work is unknown.
Which brings me to your market questions. The Mill innovations tend to interlock righter tightly, so it is difficult to pull just one out and apply it to some other architecture. For example, you could pull Mill split-stream encoding out and apply it to a VLIW to be able to decode 30 ops per cycle like the Mill does. But without things like the Belt in the execution engine you wouldn’t be able to execute all the ops you decoded. And so on. We’re not opposed to licensing, and would take NRE contracts to port some of the ideas to other machines, but we see the opportunities as being rather limited. We feel that more likely we will sell hard macros of full Mills into SOCs.
In contrast, we are actively interested in partners for the Mill itself. We know that large buyers will demand second-source availability, which means a license/partner. In addition there are specialized markets – rad-hardened, for example – where the existing vendors have expertise we will never have and a license seems the way to go. It’s the usual business story though – nobody wants to be the first to stick their neck out about a new architecture, but as soon as one bites everybody will be at our door.
To which we will say: we are not an ARM with a license-based business model, so it’s going to be first-come-first-served.
Thread and turf ids last as long as the identified entity and are not swapped out and reused in the anticipated usage. A (typical) two million or so active ids should be sufficient for what a single chip can do. If you need more, then you’ll have to implement pseudo-ids in software and write some kind of mapping that overlays them on the available hardware id-set and specRegs. Not trivial, but could be done based on the Mill primitive model. However, note that direct access to those specRegs is very much a kernel function, so the mapper would be part of the OS whatever it is.
- AuthorPosts