Forum Replies Created
- AuthorPosts
- in reply to: Memory Allocation #3706
Allocation is above the architectural level, except for alloca() and allocf() which have dedicated instructions. Stilll, one can guess the likely implementations that a native OS might use.
Mill is designed to support a client-server software architecture where the participants form an arbitrary service graph (not flat nor nested by level) and are mutually bilaterally distrusting. This lends itself to recursively defined allocations or any resource, not just memory, where an allocator hands out from an internal resource pool and if necessary refills or extends its pool from another service. As with any recursive structure there must be a bottom turtle that obtains its pool from somewhere else instead of from a recursive call to another allocator.
In the case of memory, the “somewhere else” is hardware, or rather is specialized software that talks to the hardware. In our prototype code, the size of the address space is hardwired into the data structure at power-up in raw hex constants. In a sense, the bottom turtle pool is the mind of the programmer writing the boot code.
Besides the totality of the address space, the design starts with a number of subspaces that are also hard-wired. Examples include the threadId and turfId spaces, and the threadlets. Getting these started at power-up involves being in the All permission space and using intrinsics to diddle MMIO registers. Its a perverse kind of fun, for those who get into that kind of thing 🙂
- in reply to: Simulation #3704
The constraints are legal (patents), financial, and administrative. We are all impatient.
Mill is not stalled, but is nowhere where we hoped and intended it to be.
By the end of 2018 we had reached the natural point in development for a transition from a small-team invent-and-design effort to a large-team implementation effort. But we only had a small team; the transaction would require a massive new funding round and a pretty thorough reorganization from a sweat-equity basis to a conventional salaried structure with actual management.
We targeted that transition to March 2020, and took some preliminary steps (like closing subscriptions for our Convertible Notes) in late 2019. Our timing was exquisitely awful; you know what happened in spring 2020.
We expect to try the transition this year.
- in reply to: What is Mill's multithreading memory model? #3701
Threading has been addressed a little here and there in other topics and presentations, but there hasn’t been a talk explicitly focused on it. There is some NYF material that we do not plan to disclose before product so as to preserve the eventual patent period. Here’s a brief summary of what we can say now, subject to change based on implementation and market experience:
* We will not support simultaneous multi-threading (SMT) initially, and maybe ever. In today’s hardware a whole second core is cheaper than what it costs to share a core, and Mill has a *very* fast process switch if the software doesn’t want to wait.
* Mill uses a shared virtual space model. We will support shared address space on chip but not past the pins. Off chip will use message passing through standard libraries and protocols.
* Atomicity, including multi-core atomicity, is supported using the top-level cache as a pending participant buffer, which appears to the program as a limited-capacity hardware transaction. This isn’t our idea, just an implementation of an approach that IBM has used for years and is well known in the lit and IBM documentation. Intel also tried to use the approach but couldn’t get it to work within the quirks of the x86. Given multi-factor atomicity, software support for standard synchronization primitives like semaphores is a straightforward library. We expect to provide high performance unbounded software transactions through a library.
* There are no barrier instructions; membars are unnecessary for coherency on a Mill, which is sequentially consistent. Programs that contain true data races must resolve them using the atomicity support.
* Streamers are NYF.
- in reply to: Transistor counts? #3691
No counts are available. The hardware folks are targeting an FPGA implementation first, as a proof-of-principle. Even when they shift to real chips the gate count is likely to be not very meaningful – so much depends on process, design rules and so on, and things like choice of cache sizes. We do expect to publish area and power numbers, and I suppose gate counts for anyone who’s interested, but all that will be for shipping chips, when we have them.
- in reply to: Is the Mill ISA classic virtualizable? #3668
Yes and no, depending on configuration.
As you say, Mill has no privileged instructions, and controls things through address permissions. However, the ISA exposes the full 64-bit virtual address space to the supervisor (OS), so to virtualize the supervisor (and so be able to run two of them) it is necessary to be able to provide distinct address spaces for each guest. As all the space is available, a Mill supporting VMMs must use address space identifiers (ASIDs) internally to distinguish guest address references. This costs hardware, so some Mill configs won’t support virtualization, those for embedded markets for example.
The size of the ASID will limit how many guests can run concurrently.
- in reply to: alloca (dynamic stack allocation) on Mill #3663
There is an alloca operation in the Mill ISA, used for both alloca() and VLAs. It changes the internal specRegs to dynamically allocate space in the current stacklet, which space implicitly deallocates on function return. There are some issues that must be addressed by the implementation:
what happens when there is more than one alloca in an instruction bundle?
The order of effect is not defined, nor is the order between bundles in the same EBB. The only way the order could make a difference is if the program compared the addresses returned by the two alloca invokes, which comparison is illegal per the C and C++ standards.What happens if the requested allocation does not fit in the current partially occupied stacklet?
This is handled the same way that stacklet overflow for ordinary locals is handled: the hardware (directly or via a trap, depending on the member implementation) dynamically allocates a new stacklet in the address space, allocates the desired range, makes it visible in the PLB to the current turf, and arranges that the stack cutback at return will reverse the allocation. There is some magic that prevent excessive cost if the allocation repeatedly crosses a stacklet with an iterative allocate/deallocate sequence; NYF.What happens if the allocation is too large to fit in an empty standard stacklet?
There has been some sentiment in favor of simply banning (and faulting) this usage; after all, other systems do have upper bounds on the size of alloca’s. However, currently the ISA defines that this condition will be supported by the hardware, typically via traps to system software for both allocation and deallocation. The upper bound is defined by the application’s memory quantum, as with other allocations of raw address space like mmap(). - in reply to: Transistor counts? #3696
You are way beyond me w/r/t coins and mining tech, but the underlying designguidance holds true: use a general purpose CPU like a Mill when you don’t know what code it is to run, other than lots of it all different. Use a special-purpose ASIC when its the same algorithm over and over .. and over.
Mill does have a place in such application purposes: as the control plane processor, feeding and controlling all those ASICs.
- in reply to: Transistor counts? #3694
@Madis Kalme:
Crysis is a fair comparison 🙂 Benchmarks are no substitute for real use.This is a real problem for the Mill: we can measure micro-benchmarks by counting cycles in the sim. But much of the gain from the Mill comes at system level, and especially on heavy load, things like implicit zero and very cheap system calls and task switches. The biggest incompatibility though is in RAS – the Mill architecture is immune to many exploits and cheaply defends against others – but how do you benchmark security?
@jabowery:
Mill can of course do mining, but it is a poor choice for that. It’s a general purpose CPU and mining is a heavy enough load that it needs a special-purpose architecture (matrix multiply, especially) that’s dedicated to the application to be economically competitive. A GPU or similar architecture will always out-perform a general purpose CPU for such things, Mill included. We expected to go out for a funding round (and convert from bootstrap to salary-paying) last spring but, well, 2020.
Right now we’re trying to decide when to make a second try at it. The financial market for the likes of us isn’t back yet, and the virus rates are turning up again, which argues for waiting. But the economy and the future market is real iffy, which argues for doing it now.
Comments?
- in reply to: Instruction Encoding #3679
So do we 🙂
Would your instructor welcome a class presentation by you about the Mill? - in reply to: Is the Mill ISA classic virtualizable? #3672
The Mill ISA is designed for a nano-kernel. The overwhelming bulk of what we call an OS lies outside the NK and can run happily as you describe. Whether such a system is “virtualized” is a matter of definition.
- in reply to: Inter-process Communication #3671
The Mill call instructions take a static argument with the expected result count. For encoding efficiency reasons the common cases of no result or one result are special-cased, giving three opcodes: call0, call1, and calln. The hardware matches the number of result in the retn instruction with the expected count, and faults on a mismatch; the fault follows the standard fault sequence.
Only the count is checked, and only for belt results. You can still shoot yourself by returning a value of an unexpected width – say a short when a double is expected. However, it is also possible to write width-agnostic code in some cases, because each datum carries its width in metadata. The “sig(..)” instruction can be used to check the width without disturbing the belt.
Arguments and results that are too big or too many to fit on the belt are passed in memory. Memory carries no metadata so it is not possible to verify the count and widths of memory operands. Memory bounds checking provides a limited form of sanity checking, but in general it is up to software to verify that the content of the memory arg/result area makes sense.
- in reply to: Graphics card on a Mill Machine #3664
We expect to use 3rd party IP for all the uncore that we can.
- AuthorPosts