
[{"content":"","date":"15 January 2018","externalUrl":null,"permalink":"/pages/","section":"Pages","summary":"","title":"Pages","type":"pages"},{"content":" Mill CPU Split Stream Encoding Spectre, Meltdown and the Mill CPU Spectre Talk PowerPoint Slides ","date":"15 January 2018","externalUrl":null,"permalink":"/white-papers/","section":"Pages","summary":" Mill CPU Split Stream Encoding Spectre, Meltdown and the Mill CPU Spectre Talk PowerPoint Slides ","title":"White Papers","type":"pages"},{"content":"Forum Topic: Switches\nTalk by Ivan Godard - July 12, 2017, at the SFBay Association of C/C++ Users\nSlides: mill-cpu-switches.04 (.pptx)\nThis is the eleventh topic publicly presented related to the Mill general-purpose CPU architecture.\nMulti-way branches, known as switches or case clauses in various languages, are a notorious pain for compiler writers and CPU architects. On the critical path in important applications from lexers to byte-code interpreters, switches often suffer from poor branch prediction performance. Short switches with few cases can use a chain of ifs but are hampered by the low rate at which branches issue on conventional hardware and may miss several times while working down the chain. A common alternative is to indirect-branch through a jump table, but that approach is subject to misses in the data cache as well a prediction miss. Some CPU architectures have taken the approach of putting the table in instruction space, but that adds expensive shifters and datapaths to the instruction fetch.\nThis talk shows how the ultra-wide-issue Mill architecture responds to the switch challenge. The Mill is a family of wide-issue, statically scheduled, exposed pipeline CPUs. Some family members can issue up to 30 instructions every cycle, including up to five branches — and can retire instructions at the same rate. The talk works through the compiled code for an example with eight cases spanning 100 values. Spoiler: a mid-range Mill does the whole switch, including the action bodies for each case, in three instructions and no table. Which isn’t too bad, considering that hand-crafted Mill code can do it in two.\nIvan Godard is CTO and a founder of Mill Computing, Inc., developer of the Mill family of general-purpose CPUs. He has written or led the development team for a dozen compilers, an OS, an OODBMS, and much other software. He has no degrees and has never taken a computing course; such things didn’t exist when he started. NOTE: the slides may require genuine Microsoft Windows PowerPoint to view; many clones, and Mac native PowerPoint, are unable to show the animations correctly. If you do not have access to Windows PowerPoint then watch the video, which shows the slides as intended.\n","date":"6 July 2017","externalUrl":null,"permalink":"/docs/switches/","section":"Pages","summary":"Forum Topic: Switches\nTalk by Ivan Godard - July 12, 2017, at the SFBay Association of C/C++ Users\nSlides: mill-cpu-switches.04 (.pptx)\nThis is the eleventh topic publicly presented related to the Mill general-purpose CPU architecture.\nMulti-way branches, known as switches or case clauses in various languages, are a notorious pain for compiler writers and CPU architects. On the critical path in important applications from lexers to byte-code interpreters, switches often suffer from poor branch prediction performance. Short switches with few cases can use a chain of ifs but are hampered by the low rate at which branches issue on conventional hardware and may miss several times while working down the chain. A common alternative is to indirect-branch through a jump table, but that approach is subject to misses in the data cache as well a prediction miss. Some CPU architectures have taken the approach of putting the table in instruction space, but that adds expensive shifters and datapaths to the instruction fetch.\n","title":"Switches","type":"pages"},{"content":" Developers of the Mill, a clean-sheet rethink of general-purpose CPU architectures\nFaster, Cooler, Safer Computing. For the existing, portable code in the world, a re-compiled program will run faster, cooler and safer.\nFaster Most existing code is single thread, so the Mill is designed to speed up the execution of each thread by being capable of many more operations per clock.\nFor multi-thread code, the Mill is designed with hardware support for the rapid spawning, synchronization and termination of threads.\nThe Mill also has hardware support for rapidly switching context for interrupts, protection domains, thread states, function call and return.\nCooler The Mill uses low power statically scheduled hardware to achieve performance similar to high power dynamically scheduled hardware.\nThe Mill routes most results directly from operation to operation, reducing high speed register file power.\nThe Mill uses informed memory management hardware to keep most temporary data on chip, reducing external main memory power.\nSafer The Mill uses a separate machine state stack to stop buffer overflows from installing hostile code and Return-Oriented Programming attacks.\nThe Mill function call and return mechanism hides intermediate data that are not explicitly passed, preventing data leakage across a call or its return.\nThe Mill memory allocation hardware prevents data leakage between processes, with zero power and latency overhead.\nNews announcements\u0026nbsp;– See and discuss them in our Forum. events\u0026nbsp;– Find out about any future or recent ones. in the press\u0026nbsp;– Press articles mentioning Mill Computing. links\u0026nbsp;– Places on the web where the Mill Architecture or Mill Computing have been mentioned. news email list\u0026nbsp;– Subscribe to our Mill Computing email news list. Who are we? In February 2003, the original founders saw that key problems with general purpose processor architectures simply were not being addressed adequately by any contemporary CPU, and so rethinking CPU architecture was something that “just needed doing.” The Mill CPU project was started knowing full well that delivering processors to customers also means delivering a lot of support software. Several of us were serial entrepreneurs, and we decided to form a company, one that would avoid some of the traps that cause young companies to fail. We sought advice from several well-known people in the microprocessor industry.\nIn January 2004 Out-of-the-Box Computing came into existence as a “company in formation.”\nIn March 2014 after a decade of work Out-of-the-Box Computing was incorporated as Mill Computing, Inc. in order to raise funding to begin filing patent applications and move on to the next phase. Our fundraising from angel investors has gone well, and we now have over twenty patents filed on all aspects of the Mill.\nAs of January 18, 2022, 17 of our patents have been granted.\nJoin Us What do you get by joining us? In the beginning we were a sweat equity organization; no one received a salary; instead, contributors received units that converted to stock when we incorporated. At incorporation 45 people had worked on the Mill and became shareholders. After incorporation we are still a sweat equity organization; we now use a stock option system for sweat equity, and we still pay no salaries. Reward for work today is comparable to what it was before incorporation.\nWhere do we work? We are a distributed company with individual contributors working from their homes. While the founders are in Silicon Valley, some of us live in Europe, and for a while one even lived in Borneo!\nWhy are we looking for talent now? Now that we are well into implementation we can benefit by having a larger team. Earlier on, a larger team would have just slowed us down.\nWhat work needs to be done? Writing compiler toolchain code, including working on our own code generator back end and modifying LLVM to support some unusual capabilities of the Mill such as quad precision, overflow detection and decimal floating point Writing test sequence generators in C++ for individual operations Modifying and/or writing logic generators in C++ for hardware functional units Porting libc and libc++ Porting an open-source BIOS Writing a micro-kernel similar to L4 Porting Linux Writing white papers, technical web sites, and user guides What skills do contributors need? All code contributors should be comfortable writing and enhancing sophisticated C++ code and navigating large C++ code bases. Software toolchain contributors will need familiarity with LLVM and/or ELF. BIOS contributors will need familiarity with programming for hardware such as USB, PCI and DRAM controllers. Hardware contributors will need to know Verilog in addition to C++ and must be familiar with standard cell design techniques. Technical writing contributors need to be familiar with computer architecture and systems software. If you are interested in joining us, please email\nInvestor Email List Signup This list is for those wishing to invest in Mill Computing, Inc., when and if opportunities to do so become available. Future opportunities, if any, will be announced only on this list.\nBy joining this list you are certifying that you are an Accredited Investor under Regulation D of the Securities Act of 1933 or are otherwise exempt from the registration requirements. Typically an Accredited Investor will have a net worth of $1 million or more excluding the primary residence, or an income of $200,000 or more in the current and prior two years. The full list of requirements is at http://www.sec.gov/answers/accred.htm. See also http://www.sec.gov/answers/rule504.htm.\nI qualify as a Regulation D Accredited investor. Sign up for investor list.\nI do not qualify. Return to front page.\nSign up for Mill Computing News * indicates required Email Address * First Name * Last Name * Company How did you hear about us? Email Format html text ","date":"2 December 2016","externalUrl":null,"permalink":"/","section":"","summary":" Developers of the Mill, a clean-sheet rethink of general-purpose CPU architectures\nFaster, Cooler, Safer Computing. For the existing, portable code in the world, a re-compiled program will run faster, cooler and safer.\nFaster Most existing code is single thread, so the Mill is designed to speed up the execution of each thread by being capable of many more operations per clock.\n","title":"","type":"page"},{"content":" U.S. Patent 9,513,920 - Computer Processor Employing Split-Stream Encoding This patent is for the split-stream method of encoding that permits Mill CPUs to decode extremely wide instructions (those with many independent operations in each instruction) using a compact and flexible variable-length encoding that is also fast to decode. The stream of instructions that is the program being executed is split into two streams of half-instructions that are stored separately in memory but are processed in lock-step by the CPU decoder. Because wide fixed length instruction encodings use an impractical amount of cache and memory, and variable-length encodings take time polynomial in the width, instruction decode on legacy CPUs is limited to eight operations per cycle or fewer; Mill split-stream encoding supports instruction widths of over thirty operations decoded per cycle. In addition, split-stream permits doubling the amount of instruction cache with no clock or pipeline penalty.\nSplit-stream encoding is described here in a way that is more accessible than the patent text.\nU.S. Patent 9,513,921 - Computer Processor Employing Temporal Addressing For Storage of Transient Operands This foundational patent in the Mill CPU family replaces traditional general-purpose registers for short-lived values with a hardware-managed “belt” that uses temporal addressing.\nOverview The processor organizes transient operands in a fixed-length logical queue called the belt. Instead of spatial register numbers, operands are referenced by their production order in time (temporal addresses). New results drop onto the front of the belt; older ones shift backward each cycle and eventually fall off the end. This eliminates write-after-write, read-after-write, and write-after-read hazards, removes the need for register renaming, and simplifies instruction encoding because results are placed implicitly while sources are referenced explicitly by belt position.\nKey Features Temporal addressing: An operand produced in the current cycle is at belt address 0; the previous one at address 1, and so on. Instructions specify only source positions (e.g., Add(3,4) adds the 4th- and 5th-most-recent results). Multi-result and multi-cycle support: A single operation or multiple operations can produce several operands in one cycle; they are ordered and dropped according to fixed rules. Variable-latency units (e.g., multiply, load) coordinate via slot-specific or daisy-chained output registers. Belt interconnect (optional split crossbar): Efficiently routes results from functional units to the belt and from the belt to consumers. Spiller unit: Automatically saves and restores belt contents during CALL/RETURN or interrupts, creating private per-frame belts. Scratchpad integration: Longer-lived operands are spilled to or filled from the byte-addressable scratchpad when they would otherwise fall off the belt. Core Mechanisms Operands enter the front of the belt in production order. Each cycle the entire belt logically shifts; the oldest entry retires. Decode hardware translates temporal addresses to physical locations (belt slots or functional-unit output registers). Multiple operands can be read or written in a single cycle. For subroutine calls, the spiller asynchronously copies the caller’s belt to a private frame (and later restores it), passing arguments and results by selective copying. The design supports in-order retirement while permitting out-of-order execution within the pipeline.\nProblems Addressed and Benefits Conventional register files require large numbers of physical registers, complex renaming logic, massive bypass networks, and many address bits in instructions. The belt eliminates all of these: no rename stage, dramatically smaller instruction encoding, no WAW/RAW/WAR hazards, and far simpler bypass logic. Code is smaller, power and area are reduced, and high instruction-level parallelism becomes easier to exploit. The mechanism integrates cleanly with the Mill’s operand-metadata system, scratchpad, and high-ILP pipeline, forming the foundation for the rest of the Mill architecture while remaining invisible to the programmer.\nThe Belt is described here in a way that is more accessible than the patent text.\nU.S. Patent 9,513,904 - Computer Processor Employing Cache Memory With Per-Byte Valid Bits This is a core patent in the Mill CPU family describing a hierarchical cache system that tracks validity at the byte level rather than the full-line level.\nOverview The invention adds a set of per-byte valid bits to every cache line in the processor’s memory hierarchy (L1 data cache, L2, victim buffers, etc.). These bits explicitly mark which individual bytes within a line contain semantically defined data and which do not. Memory requests (loads and stores) carry a byte mask that tells the cache exactly which bytes matter for that operation. This design eliminates the classic “read-before-write” problem for partial stores and enables precise, low-overhead handling of sparse or partially initialized data.\nKey Features Per-byte valid bits: One bit per byte in every cache line (e.g., 64 bits for a 64-byte line) stored alongside the data, tag, and a single dirty bit. Byte mask in every request: Loads and stores include a mask indicating which bytes of the addressed cache line are being accessed. Victim buffers: Fully associative buffers that absorb all new stores immediately; no traditional write buffers or allocate-on-write-miss required. Byte-wise merging and hoisting/lowering: When data moves between cache levels or to main memory, only valid/dirty bytes are transferred or merged, with higher-level data taking priority. Per-byte hit/miss logic: Hardware generates separate hit signals for each requested byte using tag matches + valid bits, feeding a simple output multiplexer. Core Mechanisms Loads: The L1 cache (and victim buffers) checks only the bytes specified by the mask. Matching valid bytes are returned immediately; missing ones propagate down the hierarchy with a narrowed mask. When a lower level supplies data, valid bytes are hoisted upward and merged byte-by-byte into any partial resident line. Stores: Data goes straight to a victim buffer (setting the corresponding valid and dirty bits). If the line already exists in the cache array, the new bytes are merged in with priority. On eviction, only dirty valid bytes are written down the hierarchy—no full-line write-backs or consolidating buffers needed. Coherence in multiprocessor systems: Private caches can hold the same line in “modified” state simultaneously if their valid-byte sets are disjoint. Special byte-masked “Request for Invalidate” messages prevent false sharing. Integration with Mill architecture: Works with virtual caches, the Protection Lookaside Buffer (PLB), deferred loads, and the belt/scratchpad operand model. Stores take effect immediately, eliminating the need for memory barriers in Mill programs. Problems Addressed and Benefits Traditional caches operate at full-line granularity, forcing read-modify-write cycles, write buffers, and consolidating logic for any sub-line store. This creates power, area, and complexity overhead and forces unnecessary memory traffic. The per-byte valid-bit design solves these by:\nEliminating write buffers and read-before-write entirely—stores always “hit.” Drastically reducing memory traffic and power (only valid/dirty bytes move between levels or to DRAM). Simplifying hardware (no consolidating buffers, smaller victim structures). Preventing false sharing in multi-core chips. Providing deterministic behavior for uninitialized or partially written data. The result is a cleaner, faster, lower-power memory subsystem that fits seamlessly with the Mill’s high-ILP, belt-based execution model while remaining fully compatible with conventional memory semantics.\nThe per-byte cache validation is described here in a way that is more accessible than the patent text.\nU.S. Patent 9,524,163 - Computer processor employing hardware-based pointer processing This patent is part of the Mill CPU family and adds dedicated hardware support for safe, efficient pointer handling directly in the execution pipeline.\nOverview The patent equips the processor with two complementary hardware mechanisms that operate on pointers containing small amounts of embedded metadata. The first mechanism uses event bits in pointers to trigger low-overhead traps on memory operations (loads, stores, or pointer stores) for uses such as garbage collection or security monitoring. The second uses a granularity field to perform automatic bounds checking on pointer arithmetic and array accesses, preventing wild-address bugs without software checks or extra memory tags.\nKey Features Event Bits: A few metadata bits carried with every pointer (distinct from the address field) that encode provenance or memory-region information. Interest Mask Registers: Three writable hardware registers (read, write, and update masks) that the runtime or OS can set to specify which pointer operations should raise an “event-of-interest” signal. Granularity Field: A compact field (e.g., 6 bits in a 64-bit pointer) that defines the size of aligned “chunks” (power-of-two byte blocks) the pointer is allowed to address. Address Derivation Unit: Dedicated logic that computes new pointer values while simultaneously checking for offset overflow. Optional Valid/Invalid Bit: Marks pointers as invalid (e.g., one-past-the-end) and restricts unsafe operations on them. Core Mechanisms Pointer Event Detection (for garbage collection and security):\nOn every load-through-pointer, store-through-pointer, or pointer-store operation, the event bits from the pointer(s) index the appropriate mask register. If the selected interest bit is set, hardware immediately raises an event-of-interest trap to the runtime system. Example: In a copying garbage collector, event bits distinguish “old” vs. “new” regions; a pointer store from old to new can be trapped so the collector can fix it—without polling or scanning every pointer. Concatenation of source and target event bits enables precise detection of cross-region updates. Hardware Bounds Checking via Granularity:\nPointer arithmetic (add, subtract, indexing) is routed through the address derivation unit. The granularity field splits the address into a fixed “chunk” part and a variable “offset” part; the unit verifies that the offset stays within the chunk (no carry into the chunk field). Valid result → pass signal and new pointer; overflow → fail signal (and optional fault). Supports “refine” operations to tighten granularity and safe one-past-the-end handling via the valid bit. All checks occur in parallel with normal address generation and add negligible latency to the pipeline.\nProblems Addressed and Benefits Traditional pointer handling relies on slow software checks (range comparisons, tag bits, or GC barriers) that hurt performance and increase memory use. Wild pointers and buffer overruns remain major sources of crashes and security vulnerabilities. This design solves both issues in hardware:\nEliminates software overhead for common pointer operations and garbage-collection barriers. Prevents wild-address bugs by enforcing object bounds at the hardware level with zero extra storage. Accelerates garbage collection through automatic provenance tracking and selective trapping. Improves security without sacrificing speed—especially valuable for C/C++-style code or languages with manual memory management. Minimal cost: Only a handful of bits per pointer and a few small hardware units; fully compatible with the Mill’s belt, scratchpad, and high-ILP pipeline. Overall, the invention delivers safe, high-performance pointer processing as a native CPU feature, simplifying compilers and runtimes while closing a long-standing gap between high-level language safety and low-level hardware efficiency.\nU.S. Patent 9,652,230 - Computer processor employing dedicated hardware mechanism controlling the initialization and invalidation of cache lines This patent is part of the Mill CPU family and introduces a lightweight hardware structure called the Implicit Zero Map (IZM) that automatically manages cache-line initialization and invalidation for stack-frame-local data.\nOverview The invention adds a small, dedicated hardware map (the IZM) to the processor’s execution/retire logic. Each bit in the map corresponds to one cache line that holds frame-local operand data for the currently active stack frame. When the bit is set, the cache line is treated as implicitly zero (no valid data exists in the memory hierarchy). This mechanism works transparently with the processor’s hierarchical memory system (L1 cache + lower levels) and integrates with the Mill’s data-stack and scratchpad model.\nKey Features of the Implicit Zero Map (IZM) Bit-per-cache-line array: A compact, hardware-resident bit map that tracks only the cache lines belonging to the top stack frame (size typically covers a few KB of local data). Implicit-zero state: A set bit means “this line has never been written and should return zeros on any read.” Frame-local binding: The map is dynamically re-pointed to the current frame using special registers (stack pointer SP and frame-size/offset registers) on every function activation. No software zeroing required: Eliminates explicit store loops that traditionally clear local variables. Core Mechanisms On function entry (frame allocation): Hardware sets all IZM bits for the new frame’s cache lines to the implicit-zero state. If the frame exceeds the map’s capacity, excess lines are explicitly zero-written to the cache once.\nLoad handling: Before any memory request reaches the cache, the address is converted to an IZM index. If the bit is set, the pipeline immediately returns zero-filled data and marks the request “hit,” discarding it so the cache and lower memory are never accessed.\nStore handling: If the IZM bit is set, the store data is merged with zeros for the rest of the line, the data is written directly to the cache, and the bit is cleared (marking the line “valid”). Again, no lower-level memory traffic occurs.\nOn function exit (frame deallocation): Hardware walks the IZM bits:\n– Clear bits (valid lines) → invalidate the cache line and cancel any pending write-back.\n– Set bits (still zero) → simply reset the bit; nothing else is required.\nExcess lines beyond the map are also invalidated without write-back.\nAll operations are performed in the execution/retire pipeline stages and use spare memory bandwidth when possible, so they add virtually zero cycle overhead.\nProblems Addressed and Benefits Conventional processors require explicit zero-initialization stores (or rely on OS page-zeroing) for every new stack frame, consuming memory bandwidth, power, and code space. They also leave stale data from prior frames, creating security/privacy leaks and subtle bugs when code assumes uninitialized locals are zero. The IZM solves these by:\nEliminating zero-initialization traffic — loads from fresh locals are satisfied instantly with zeros; stores pay only for actual data written. Guaranteeing deterministic starting state — every new frame begins with clean zeros, removing an entire class of initialization bugs. Hiding prior-frame data — automatic invalidation on exit prevents information leakage between function activations. Saving power and bandwidth — cache lines that remain unused never touch lower memory levels. Minimal hardware cost — only a few dozen bits of storage plus simple address-to-index logic. The design fits seamlessly with the Mill architecture’s scratchpad, belt, and high-ILP pipeline, delivering cleaner, faster, and more secure function-call semantics without changing the programmer-visible instruction set.\nFor a less patentese explanation of this Mill technology see our memory talk video here, or look at the corresponding powerpoint starting at slide 59.\nU.S. Patent 9,690,581 - Computer processor with deferred operations This patent describes the deferred-load hardware of the Mill CPU architecture.\nThis patent describes a processor architecture that supports deferred load operations, where the request of data from an address (issue) is separated from its retirement into operand storage (retire). The retire timing is controlled by statically assigned parameter data encoded in the operation itself, enabling program-managed scheduling of variable-latency operations like memory loads in a static-scheduled, in-order execution environment. By allowing a program to avoid stalling when data that is being loaded is in cache, the invention allows a statically scheduled CPU to achieve load performance comparable to a dynamically scheduled Out-Of-Order CPU.\nKey Features Deferred Operations: Operations (e.g., deferred load or DLOAD) that decouple issue from retire, with retire controlled by a schedule latency parameter (cycle count or identifier). Retire Station: A hardware buffer that manages result data, handling early arrivals (buffering) or late arrivals (stalling) relative to the programmed schedule latency. Schedule Latency Control: Static encoding specifies retire cycle via countdown timer or matching identifier from a \u0026ldquo;pickup\u0026rdquo; operation. Memory Handling: Supports hierarchical memory with snooping for store collisions on in-flight loads, ensuring correct memory state at retire time. Call/Interrupt Integration: Options for inclusive/exclusive latency counting during calls; mechanisms to discard, save/restore, or reissue deferred ops on interrupts. Speculation Support: Allows hoisting operations over branches, with fault handling for errors encountered before retire. Core Mechanisms Execution Flow: A deferred operation issues to a functional unit (e.g., load unit sends DLOAD to L1 cache). A retire station allocates, configures with schedule latency, and monitors execution. Latency Comparison: Retire station compares actual execution latency (time to result) against schedule latency. Buffers results if early; stalls pipeline if late until result arrives. Collision Detection: Snoops intervening stores for address overlaps with in-flight DLOADs; discards and reissues if collision detected, or buffers valid results. Retire Process: At schedule expiration, retires data to operand storage (e.g., belt or registers) if valid; faults on errors or mismatches. Pickup Variant: Alternative to timers; retire triggered by a separate \u0026ldquo;pickup\u0026rdquo; op with matching ID, allowing dynamic scheduling. Interrupt/Call Handling: On calls, latency timers can pause (exclusive) or continue (inclusive); interrupts may discard ops, spill state to stack, or reissue post-resume. Problems Addressed and Benefits Traditional static-scheduled processors struggle with variable execution latencies (e.g., cache misses stalling pipelines), while out-of-order designs add complexity, power, and area costs. This invention addresses these by enabling static scheduling of variable-latency ops without stalls at issue time, allowing early issuance (hoisting) for overlap with other work. Benefits include simplified hardware (no dynamic reordering), reduced power/area, improved performance via masked latencies, correct handling of memory dependencies at retire, and robust support for speculation/interrupts—enhancing efficiency in Mill\u0026rsquo;s high-ILP architecture while complementing related patents on operand storage and bypass networks.\nU.S. Patent 9,747,216 - Computer processor employing byte-addressable dedicated memory for operand storage This patent introduces a processor architecture that uses a \u0026ldquo;belt\u0026rdquo; for transient operand storage with temporal addressing and a separate byte-addressable \u0026ldquo;scratchpad\u0026rdquo; memory for longer-lived operands, complemented by a spiller unit for context management. It optimizes operand handling in high-ILP environments by reducing register pressure and enabling efficient CALL/RETURN operations.\nKey Features Belt Storage: A fixed-length queue (conveyor belt) for short-lived operands, referenced via temporal addressing based on production order. Scratchpad Memory: Byte-addressable dedicated storage for operands copied from the belt, with static, aligned addresses and variable-sized windows per function frame. Spill/Fill Operations: Mechanisms to copy operands between belt and scratchpad for preservation and restoration. Spiller Unit: Handles asynchronous saving/restoring of belt and scratchpad contexts during subroutine calls, returns, and interrupts. Window-Based Mapping: Circular buffer organization with Base/Fence registers defining per-frame address windows. Split Crossbar Network: Optimizes operand routing between functional units, prioritizing low-latency paths. Core Mechanisms Belt Operation: Operands from functional units are injected at the belt\u0026rsquo;s front and shift backward each cycle, falling off the end when displaced. Instructions use logical temporal addresses (e.g., Add(3,4)) to reference operands by recency, eliminating explicit register names. Scratchpad Integration: SPILL copies belt operands to scratchpad before expiration; FILL restores them. Addresses are byte-aligned and static, supporting variable operand sizes. Context Switching: On CALL, the spiller saves caller state (belt/scratchpad) asynchronously and sets up callee\u0026rsquo;s private frame via SCRATCHF (allocates window). RETURN restores state and deallocates. Base/Fence define active windows; SP/RP track save/restore progress to minimize stalls. Asynchronous Save-Restore: Spiller operates in background, stalling only if bandwidth limits are exceeded or conflicts arise. Pipeline Support: Integrates with fetch/decode/issue/execute stages, maintaining in-order retirement for belt consistency while allowing out-of-order execution. Problems Addressed and Benefits Traditional architectures face high register pressure, complex bypass networks, slow context switches, and latency from memory spills. This design mitigates these by using temporal addressing to simplify instructions, dedicated scratchpad for fast access, and asynchronous spilling for low-overhead switches. Benefits include reduced chip area/power from fewer registers, faster subroutine handling, scalable operand storage for nested calls, and improved ILP without traditional hazards. Overall, it enhances Mill\u0026rsquo;s ecosystem for efficient, high-performance processing in modern workloads.\nU.S. Patent 9,747,218 - CPU security mechanisms employing thread-specific protection domains This patent introduces a processor architecture with hardware-enforced security using \u0026ldquo;turfs\u0026rdquo;—thread-specific protection domains that enable fine-grained memory isolation without the overhead of traditional process-based models. Turfs are collections of region descriptors defining memory areas and permissions, allowing threads to dynamically switch domains via portal calls while maintaining a single address space. Protection is hardware-managed through Protection Lookaside Buffers (PLBs) and specialized registers, supporting secure inter-domain communication and execution.\nKey Features Turfs and Region Descriptors: Turfs group descriptors specifying memory regions (bounds, permissions like read/write/execute/portal), associated with thread-turf ID pairs; supports wildcards for flexible access. Protection Lookaside Buffers (PLBs): Separate iPLB (instruction) and dPLB (data) for fast permission checks; backed by a Region Table in memory; includes a \u0026ldquo;novel\u0026rdquo; bit for efficient eviction and revocation tracking. Hardware Registers: Turf-specific (e.g., code/constants/data bases), thread-turf-specific (stacklets for isolated stacks), and thread-specific (local storage) registers for bypassing PLB queries on common accesses. Stacklets: Per-thread-turf stack segments with info blocks (top-of-stack pointer, base, limit); chained for nested calls, auto-allocated in reserved virtual space. Portal Entries: Memory structures (turf ID, entry address, state) enabling secure domain switches via portal-type CALLs. Operations: GRANT/REVOKE for persistent permission delegation; PASS for temporary grants; ARGS for argument reservation; portal/normal CALL/RETURN for control flow with domain handling. Core Mechanisms Access Enforcement: For LOAD/STORE, compute virtual address and check hardware registers first (e.g., stacklet descriptors); if unmatched, query dPLB for permissions—fault on violation. Instruction fetches use iPLB for execute/portal checks. Domain Switching: Portal-type CALL evaluates portal pointer, saves caller state externally (via Spiller), loads callee turf/state, initializes stacklet, and executes in new domain. RETURN unwinds, restores state from storage/memory, and revokes temporary grants. Permission Management: GRANT adds descriptors to PLBs; REVOKE removes them (lazy via novel bit); PASS modifies descriptors to wildcard turf for call duration, auto-revoked on RETURN. Argument Passing: ARGS reserves stack space (OutP register); portal CALL copies to callee\u0026rsquo;s InP for secure access without full sharing. Pipeline Integration: Operates in execute/retire stages with Branch and Load/Store Units; supports speculation but makes portal RETURN non-speculative for security; prefetching minimizes stalls. Problems Addressed and Benefits Traditional systems rely on heavy process switches for isolation, incurring high overhead (e.g., cache/TLB invalidation, hundreds of cycles) and error-prone shared buffers. This design addresses these by enabling lightweight turf switches without thread changes, secure coroutine-like calls, and hardware-bypassed checks for efficiency. Benefits include reduced power/latency via parallel PLB queries, scalable support for thousands of domains, prevention of exploits (e.g., stack smashing), and flexible rights delegation—all while integrating with Mill\u0026rsquo;s high-ILP features like operand metadata and memory hierarchies.\nU.S. Patent 9,747,238 - Computer processor employing split crossbar circuit for operand routing and slot-based organization of functional units This patent details a processor architecture that optimizes operand handling and routing using a \u0026ldquo;belt\u0026rdquo; for transient storage with temporal addressing, a split crossbar interconnect for efficient bypass routing, and slot-based grouping of functional units to manage mixed-latency operations. It also includes mechanisms for subroutine calls/returns with private per-frame storage.\nKey Features Belt Storage: A fixed-length, conveyor-like queue for transient operands, using temporal addressing based on production order rather than physical registers. Split Crossbar Circuit: Divided into lower (for single-cycle results) and upper (for multi-cycle results) sections to reduce routing latency and complexity. Slot-Based Functional Units: Groups of functional units that produce results of varying latencies in parallel, with separate output registers per latency type. Scratchpad Memory: Byte-addressable storage for longer-lived operands, accessed via SPILL/FILL operations. Spiller Unit: Handles context saving/restoring for belt and scratchpad during CALL/RETURN, ensuring isolated frames. Core Mechanisms Temporal Addressing and Belt Operation: Operands are injected at the belt\u0026rsquo;s front in production order, shift backward each cycle, and retire from the end. Instructions reference operands by logical positions (e.g., Add(3,4) adds the 4th and 5th most recent). Supports multiple drops per cycle with ordering rules. Operand Routing via Split Crossbar: Single-cycle results route directly through the lower crossbar to any consumer. Multi-cycle results (e.g., 2-4 cycles) route via the upper crossbar to the lower for distribution, minimizing wire delays and power. Slot Organization: Each slot shares input/output paths and executes mixed operations (e.g., add in 1 cycle, multiply in 3). Results store in latency-specific registers (lat-1 to lat-4), shifting in a daisy-chain to align latencies; overflows spill to buffers. CALL/RETURN Handling: CALL saves caller\u0026rsquo;s state asynchronously via spiller, creates callee\u0026rsquo;s private belt/scratchpad, and copies arguments. RETURN restores and copies results back. Scratchpad uses window mapping (Base/Fence registers) for dynamic allocation. Integration with Pipeline: Decode maps temporal addresses to physical; issue dispatches to slots; execute uses bypass for direct producer-consumer routing. Problems Addressed and Benefits Traditional processors rely on spatial registers, leading to high rename overhead, complex bypass networks, and inefficient multi-latency handling. This design addresses these by eliminating registers for transients, reducing bypass costs via split crossbars, and enabling parallel mixed-latency execution in slots. Benefits include lower power and area through simplified routing, higher ILP via temporal addressing and private frames, efficient context switches without stalls, and better scalability for wide-issue pipelines. It complements Mill\u0026rsquo;s other innovations like operand metadata and loop pipelining for high-performance computing.\nU.S. Patent 9,785,441 - Computer processor employing instructions with elided nop operations This patent describes a processor architecture that processes two parallel instruction streams with a predefined timed semantic relationship, using variable-length instructions that incorporate an \u0026ldquo;alignment hole\u0026rdquo; to implicitly encode NOP (no-operation) counts. This elides explicit NOP instructions, optimizing for parallel decoding and execution in a split-stream design.\nKey Features Dual Instruction Streams: Two distinct streams (Stream I and Stream II) within instruction blocks, flowing in opposite directions in memory (Stream I forward-increasing, Stream II reverse-decreasing) from a shared entry address. Variable-Length Instruction Format: Each instruction includes a fixed-length header (with overall length and block slot counts) and a variable-length bit bundle divided into forward (head-to-tail) and reverse (tail-to-head) operation blocks. Alignment Hole: A variable-position gap (0-7 bits) between forward and reverse block groups, encoding a binary count of implicit NOP operations without using explicit NOP encodings. Parallel Processing Components: Two multi-stage pipelines (each with program counter, fetch unit, instruction buffer, decode stage, and execution logic) that handle the streams independently but synchronously via NOP-induced stalls. Stream Specialization: Streams can be assigned different instruction classes (e.g., Stream I for flow-control and memory operations, Stream II for computational operations). Core Mechanisms Instruction Organization: Instructions are grouped into blocks (extended basic blocks) with a single entry and multiple exits. Streams share cache lines at entry points but diverge directionally to reduce thrashing in separate L1 caches. Decoding Process: Header parsing in the first sub-stage extracts length and block fields, enabling speculative decoding of initial forward blocks. Forward blocks are decoded sequentially from head to tail; reverse blocks from tail to head, allowing parallel processing (e.g., Block 2F and Block 2R simultaneously). Shifter logic isolates blocks using header-derived tap values; the alignment hole is processed after all blocks to update a running NOP counter. NOP counts accumulate across instructions and trigger stalls in the opposing stream\u0026rsquo;s decode or issuance stages, either immediately or in subsequent cycles. Execution and Synchronization: Decoded operations are issued to functional units grouped by operand count (e.g., dyadic, triadic). Timed semantics are enforced by stalling the lagging stream via implicit NOPs, without data dependencies across streams except through control signals. Alternate Formats: Supports extensions for larger holes or lag operations if the standard hole size is insufficient. Problems Addressed and Benefits Traditional processors with timed semantics require explicit NOPs to synchronize parallel operations, wasting memory, fetch bandwidth, and power—especially in variable-length or VLIW architectures where decoding is serialized and stalls are frequent. Variable-length instructions complicate parallel decoding, and shared caches lead to thrashing in multi-stream designs. This invention mitigates these by:\nEliding explicit NOPs through alignment holes, reducing code size, memory usage, and processing overhead. Enabling fast parallel decoding with double-ended processing, minimizing latency in high-ILP pipelines. Improving cache efficiency via separate L1 caches per stream, supporting larger working sets without thrashing. Maintaining precise timed relationships between streams with low-cost stalls, enhancing performance in split-stream architectures. Providing economical encoding for variable-length instructions, making it scalable for modern workloads. Overall, this design aligns with Mill Computing\u0026rsquo;s focus on efficient, high-parallelism processors, complementing related patents on double-ended decoding, bypass networks, and operand metadata by optimizing synchronization in dual-stream execution.\nU.S. Patent 9,817,669 - Computer processor employing explicit operations that support execution of software pipelined loops and a compiler that utilizes such operations for scheduling software pipelined loops This patent details a processor architecture and associated compiler techniques that facilitate efficient software pipelining of loops by using explicit operations to manage operand retirement and loop frames, eliminating the need for prologues and epilogues while optimizing for high instruction-level parallelism (ILP).\nKey Features of the Operand and Execution Model Logical Belt: A conveyor-like structure for storing transient operands, with temporal addressing based on production order and operation latency (e.g., add drops immediately, mul after 3 cycles). Operands are dropped to the front and retired from the rear. Operand Metadata: Includes scalarity (scalar/vector), element width, floating-point flags, Not-a-Result (NAR) for errors, and None for missing values. None Operand: Represents absent data; propagates through speculable operations (e.g., arithmetic) and causes non-speculable operations (e.g., stores) to skip without side effects. Scratchpad Memory: Byte-addressable storage for long-lived operands, with logical-to-physical mapping via rotators (supporting wrap-around for loop-carried values) and operations like SCRATCHF (allocate), SCRATCHD (deallocate), and ROTATE (shift cursor). Extended Scratchpad: Hierarchically extends scratchpad into main memory for larger needs. Spiller Unit: Handles context saving/restoring for belt and scratchpad across calls/loops. Core Mechanisms RETIRE Operation: Explicitly specifies a static count of operands to retire in a machine cycle. If fewer actual operands are produced, inserts None values; if more, faults; if equal, no-op. Variant RETIRE WITH WIDTH LIST adds scalarity/width tags for validation. INNER and LEAVE Operations: INNER creates a new loop frame with an empty belt, initialized from arguments, without altering function context. LEAVE restores the prior belt, discards in-flight operations, and drops exit arguments, avoiding epilogues. Software Pipelining: Compiler schedules loops to enter steady-state directly using RETIRE at the loop head to simulate missing operands with None during warmup phases. Overlaps iterations by leveraging belt temporal ordering and latency-based drops. Loop Frame Management: Supports nested loops via multiple rotators and private belts/scratchpads per frame, with ROTATE advancing physical addresses per iteration while keeping logical addresses static. Speculation and Error Handling: None/NAR propagation ensures safe speculative execution; non-speculable ops skip on None, accumulating FP flags only on commits. Problems Addressed and Benefits Traditional software pipelining requires prologues to fill the pipeline and epilogues for cleanup, wasting code space, fetch bandwidth, and execution time—especially for low-iteration or nested loops. Fixed rotating registers (e.g., in SPARC/Itanium) consume space inefficiently for variable-sized operands, increase register pressure, and necessitate spills. This design addresses these by:\nEliminating prologues/epilogues through RETIRE and INNER/LEAVE, reducing code size and latency for short loops. Enabling direct steady-state entry, boosting ILP and performance in parallel pipelines. Providing flexible storage via belt (transients) and scratchpad (long-lived), minimizing waste and spills with byte-packing and dynamic allocation. Supporting nested/search-style loops efficiently with rotators and frame isolation, without fixed-size constraints. Enhancing compiler flexibility for optimizations like JIT, with machine-dependent scheduling for operand lifetimes and latencies. Overall, this invention advances Mill Computing\u0026rsquo;s high-ILP architecture by integrating with features like operand metadata, bypass networks, and memory optimizations, delivering prologue-free pipelining for robust, efficient loop execution in modern workloads.\nU.S. Patent 9,875,106 - Computer processor employing instruction block exit prediction This patent introduces a processor architecture that predicts exit points of instruction blocks (Extended Basic Blocks or EBBs) using a table of predictors and an associative cache to form a chain guiding prefetch, fetch, decode, and execution. This reduces pipeline stalls from branch mispredictions by enabling run-ahead processing of expected control-flow paths.\nKey Features Instruction Blocks (EBBs): Sequences with a single entry point and multiple exit points via control transfers (branches, calls, returns); divided into fragments starting at entry/return points and ending at calls or exits. Exit Table: Direct-mapped hash table storing predictor entries for fragments, indexed by unique keys (arithmetically derived for collision avoidance). Predictor Entries: Include target address (offset or absolute), cache-line count, instruction count, transfer kind (branch, call, return), untaken-call count, quality counter, loop iteration count, and alternate key for misprediction recovery. Exit Cache: Small associative cache for recently used predictors, reducing table access latency and supporting one-per-cycle chaining. Prediction Chain: Linked sequence of predictors queried from the table/cache, used to drive hardware components. Return Stack and Loop Stack: Handle return resolutions and loop iterations, with extensions for in-flight operations. Bulk Loading: Toolchain-generated predictor sets embedded in program images, loaded into the table/cache on cold starts to minimize initial mispredictions. Core Mechanisms Prediction Process: On entering a block/fragment, query the Exit Table/Cache with the key to retrieve a predictor; chain subsequent predictors based on expected exits, forming a FIFO queue. Prefetch and Fetch: Prefetcher uses predictor address and cache-line count for exact line prefetching (no wasteful speculation); fetcher loads lines into an instruction buffer, resolving returns via the Return Stack. Decode and Shifting: Decode control logic consumes the queue head to manage an instruction shifter, isolating instructions based on counts; supports overflow via saturated values and pessimistic fetching. Misprediction Recovery: Detect mismatches during execution; discard decode state, insert dummy predictors for continued shifting, rebuild chain from actual exit using alternate keys, and update quality counters (saturating for success/failure tracking). Loop Handling: Predictors store toolchain-supplied iteration counts; hardware loop stack decrements on back-branches, switching to exit paths when exhausted. Deferred Branches: Annotate branches with deferral cycles; if unresolved, inject new chains into the Exit Cache for speculative prefetch. Split-Stream Encoding: Supports dual (forward/backward) address streams with half-predictors for independent decoding. Update Queue: Buffers post-execution adjustments to predictors, ensuring non-blocking updates. Problems Addressed and Benefits Traditional branch prediction uses per-branch tables, leading to high misprediction penalties (e.g., pipeline flushes costing 30+ cycles), table size bloat, power inefficiency, and wasteful prefetching. This block-level approach mitigates these by:\nEnabling run-ahead chaining for prefetch/fetch/decode, minimizing stalls and improving throughput in high-ILP pipelines. Reducing hardware overhead with compact predictors capturing full control-flow context, including loops and calls/returns. Enhancing cold-start performance via bulk loading, avoiding repeated mispredictions in new code paths. Lowering power and latency through exact prefetching, quality-based maintenance, and lightweight recovery (no full flushes). Supporting complex structures like nested loops and variable-length instructions with metadata-rich predictors. Overall, the invention optimizes control-flow prediction for modern processors, aligning with Mill Computing\u0026rsquo;s architecture by integrating with features like variable-length decoding and bypass networks for efficient, stall-resistant execution.\nU.S. Patent 9,959,119 - Computer processor employing double-ended instruction decoding This patent describes a processor architecture that uses double-ended decoding to efficiently handle variable-length instructions. Each instruction consists of a fixed-length header and a variable-length bit bundle organized into slots and blocks, divided into forward (head-to-tail) and reverse (tail-to-head) groups. The decode stage processes these in parallel, starting from both ends, to enable fast, simultaneous decoding of operations while minimizing hardware complexity.\nKey Features Instruction Format: Fixed-length header with fields for overall length and slot counts per block; variable-length bit bundle partitioned into fixed-length slots grouped into blocks (e.g., up to four, like Block 1F, 2F, 3F for forward; Block 3R, 2R for reverse). Double-Ended Partitioning: Forward group extends from the head end toward the tail; reverse group (including tail-end block) from the tail toward the head, allowing simultaneous processing from both ends. Alternate Encodings: \u0026ldquo;Svelte\u0026rdquo; for compact headers with fixed slots; \u0026ldquo;Skinny\u0026rdquo; for table-indexed common operations to bypass full decoding. Shifter and Decoder Logic: Hardware shifters isolate blocks using header-derived tap values; parallel parser/decoder arrays handle fixed-length parsing and opcode decoding. Core Mechanisms Decoding Process: Instructions are stored in an instruction buffer and aligned in a double shifter. The header is processed first to generate control signals and shifter tap values for block isolation. Parallel Decoding: Forward blocks are decoded sequentially in forward order across pipelined sub-stages; reverse blocks in reverse order, starting from the tail-end block. This enables parallel handling (e.g., Block 2F with Block 3R) without full sequential scanning. Speculative and Gated Execution: Head-end blocks (e.g., Block 1F) are decoded speculatively assuming maximum slots, with invalid results gated based on actual header counts. Shifter Circuitry: Uses N-way mux trees for logarithmic-speed alignment; left-aligns forward blocks and right-aligns reverse ones, supporting variable bundle sizes up to 64. Integration: Complements the processor\u0026rsquo;s execution logic, allowing decoded operations to issue in parallel while maintaining semantic order. Problems Addressed and Benefits Traditional variable-length instruction decoding requires serial parsing to locate boundaries, leading to bottlenecks, high hardware costs, and delays in parallel execution (e.g., in CISC or VLIW architectures). This double-ended approach mitigates these by:\nEnabling parallel decoding from both ends, reducing latency and supporting high instruction-level parallelism without complex sequential logic. Improving efficiency and compactness through header-guided isolation and alternate encodings, saving memory bits and power. Lowering hardware overhead with fixed-length sub-parsing and logarithmic shifters, making it scalable for modern processors. Enhancing performance in out-of-order or multi-threaded environments by minimizing decode stalls and facilitating compact, variable-length formats. Overall, the invention aligns with Mill Computing\u0026rsquo;s high-ILP architecture, complementing related patents on operand metadata, bypass networks, and memory optimizations for more efficient, power-aware processing.\nU.S. Patent 9,965,274 - Computer processor employing bypass network using result tags for routing result operands This patent describes a processor architecture featuring a bypass network that uses dynamically generated result tags to efficiently route operand data directly from producing functional units to dependent instructions, enhancing data forwarding in pipelined, multi-cycle execution environments.\nKey Features Result Tags: Numeric identifiers dynamically assigned to result operands, ensuring unique tracking and collision-free routing. Bypass Network: Provides dedicated paths for broadcasting result data along with tags, enabling direct forwarding without register file or memory access. Tag Match Mux Control Circuits: Hardware selectors with comparators that match tags to select and route the correct operands to functional units. Dynamic Tag Generation: Based on operation slots, latencies, and valid bits, allowing predictable and reusable tags akin to register allocation. Core Mechanisms The bypass network operates by associating each result operand—produced by functional units over multiple machine cycles—with a result tag generated dynamically. Tag generation circuitry increments tags starting from an initial value, incorporating the number of valid results from prior slots or latencies to prevent overlaps. These tags, paired with valid bits, are broadcast over bypass paths alongside the result data. For operand selection, tag match mux circuits compare broadcast tags against source tags specified in the instruction\u0026rsquo;s operation field. Upon a valid match, the corresponding data is forwarded directly to the consuming functional unit, supporting single or multiple matches and including fault detection for mismatches. This integrates seamlessly with the processor pipeline, reducing stalls by enabling fast forwarding in out-of-order or speculative execution scenarios.\nProblems Addressed and Benefits Conventional bypass networks rely on complex wiring, address-based lookups, or multi-stage comparators, resulting in high hardware overhead, routing congestion, propagation delays, and scalability issues in high-performance processors. This invention mitigates these by employing lightweight, predictable result tags for routing, simplifying logic and reducing interconnect complexity. Benefits include enhanced instruction throughput via faster data forwarding, lower power consumption and area usage through binary selectors and fewer wires, improved performance in multi-unit, variable-latency environments, and better overall efficiency in pipelined architectures. It aligns with Mill Computing\u0026rsquo;s focus on high-ILP designs, complementing related patents on operand metadata and memory optimizations.\nU.S. Patent 10,678,700 - CPU Security Mechanisms employing Thread-Specific Protection Domains This patent details a hardware-based security system for processors, introducing \u0026ldquo;turfs\u0026rdquo;—lightweight, thread-specific protection domains that enable fine-grained memory isolation and access control. Turfs allow secure execution of code from varying trust levels within the same thread and address space, minimizing overhead compared to traditional processes or privilege rings.\nKey Features of Protection Domains (Turfs) Turf Structure: Each turf is identified by a unique turfID and defines memory regions with specific permissions (e.g., read, write, execute, portal/grant). It includes:\nWell-Known Regions (WKRs): Hardware-optimized predefined areas for common accesses like code (cWKR), data/constants (dWKR), stack, thread-local storage (TLS), and null (nWKR) for fast NULL checks. Permission Tables: Backed by instruction and data Protection Lookaside Buffers (iPLB and dPLB) for efficient lookups. Thread Integration: Security state is tied to thread-turf pairs (threadID + turfID), supporting thousands of domains per process with low resource use.\nPermission Types: Includes transient (temporary, call-scoped) and persistent (longer-term delegation) grants to safely share access without permanent exposure.\nCore Mechanisms Access Enforcement: Memory operations (loads, stores, fetches) first check against turf-specific WKRs for fast-path approval. Misses query PLBs and permission tables; violations trigger faults. Hardware ensures isolation, with guard bits on pointers for provenance tracking and session IDs to prevent cross-turf leaks. Domain Switching via Portals: Portals are memory objects with a \u0026lsquo;p\u0026rsquo; permission bit, containing target turfID, entry point, and setup info. A portal CALL switches turfs securely: spills caller state to a spillet, passes arguments, activates new permissions, and revokes transients on return. Operations like GRANT, RELAY, and PERSIST handle permission delegation; supports nested calls and returns. Stack and State Handling: Per-turf stacklets isolate frames to prevent overflows or smashing. Hardware spiller manages register/belt state across switches; ARGS reserves argument frames. Separate call and data stacks enhance security. Integration with Processor Features: Complements Mill\u0026rsquo;s belt architecture, operand metadata, and high-ILP design by embedding security checks in hardware pipelines. Problems Addressed and Benefits Conventional security models (e.g., rings, processes, capabilities) suffer from high context-switch costs, coarse granularity, or complexity, making them inefficient for fine-grained isolation in performance-sensitive code. Turfs address this by:\nOffering lightweight, hardware-enforced domains with minimal overhead for switches (no TLB flushes or full saves). Enabling secure intra-thread interactions (e.g., user code calling libraries or services) without privilege escalation or shared vulnerabilities. Reducing performance penalties through WKRs and fast checks, while preventing attacks like confused deputy or buffer overflows. Enhancing flexibility and scalability for sandboxes, plugins, or microservices in a single address space. Overall, this invention provides robust, efficient security tailored to modern high-parallelism processors, integrating seamlessly with other Mill innovations for secure, high-performance computing.\nU.S. Patent 10,802,987 - Computer processor employing Cache Memory storing Backless Cache Lines This patent introduces a processor architecture that optimizes virtual memory management through \u0026ldquo;backless cache lines\u0026rdquo;—cache entries associated with virtual addresses that lack immediate backing by valid physical memory. This enables lazy physical memory allocation, reducing unnecessary overhead for transient or unused data while maintaining compatibility with standard paging systems.\nKey Features of Backless Cache Lines Backless cache lines are defined by virtual addresses that either:\nLack a corresponding page table entry (in virtual caches), or Have a page table entry pointing to an invalid or \u0026ldquo;pending\u0026rdquo; physical address (in physical caches). These lines reside solely in cache without physical backing until triggered to transform into a \u0026ldquo;backed\u0026rdquo; state. Key attributes include:\nZero-Filled Initialization: Loads from backless lines return zero-value bytes, simulating uninitialized memory without physical access. Temporary Single-Line Pages: During transformation, data may be stored in cache-line-sized physical pages before migrating to larger OS-specified pages (e.g., 4KB). Page Table Markings: Use indicators like \u0026ldquo;pending\u0026rdquo; or dummy addresses to denote backless status, facilitating quick detection and updates. Support for Write-Back and Write-Through Caches: Defers or immediate backing based on cache policy. The system integrates with Translation Lookaside Buffers (TLBs), Protection Lookaside Buffers (PLBs), and Memory Management Units (MMUs) for efficient translation and protection.\nCore Mechanisms Load and Store Handling: Hits: Process normally using cached data. Misses: For backless addresses, allocate a zero-filled cache line and apply the operation without physical memory involvement. Eviction and Backing Transformation: In write-back caches, eviction forces allocation of physical space (e.g., single-line page), data write, and page table update to a valid mapping. In write-through caches, stores trigger immediate backing and physical write. OS involvement may occur for page resizing or pool replenishment via interrupts. Virtual Caches: Rely on missing page table entries for backless identification; use PLBs for protection. Physical Caches: Employ \u0026ldquo;pending\u0026rdquo; markings or dummy address spaces; MMUs handle translations, with secondary tables for state transitions. Optimization Features: Pre-allocated zeroed page pools reduce latency; flows ensure no data loss during evictions. Detailed flowcharts (e.g., FIGS. 5–7 for virtual caches, FIGS. 9–12 for physical caches) outline these processes, emphasizing minimal OS intervention for cache-resident operations.\nProblems Addressed and Benefits Traditional virtual memory systems eagerly allocate and initialize physical pages on first access, leading to high memory traffic, power consumption, and overhead for sparse or transient data (e.g., large mmap regions or cache-fitting programs). Backless cache lines mitigate this by:\nEnabling lazy allocation—physical memory is committed only when data must persist beyond cache (e.g., on eviction). Reducing memory traffic and latency through in-cache handling of backless data with zero-filled responses. Lowering power and costs by avoiding unnecessary physical interactions and initializations. Improving performance in single-address-space models without aliasing or extension issues. Enhancing scalability for dynamic workloads, with seamless integration into existing hierarchies and OS functions. Overall, this invention aligns with Mill Computing\u0026rsquo;s focus on efficient, high-performance architectures, complementing related patents on operand metadata and memory optimizations by minimizing virtual memory bottlenecks.\nU.S. Patent 11,226,821 - Computer processor employing Operand Data with Associated Meta-data This patent describes a processor architecture where operands are handled as unitary data elements, each combining the actual data value (payload) with embedded metadata. This metadata provides contextual information that enhances execution efficiency, particularly in speculative operations, error handling, and vectorized (SIMD) processing.\nKey Features of the Operand Model Each unitary operand includes:\nPayload Data: The core value, which can be scalar (single element) or vector (multiple elements).\nMetadata: Descriptive tags that travel with the data, including:\nType indicator (scalar or vector). Elemental width (power-of-two byte sizes, e.g., 1, 2, 4, or 8 bytes). Floating-point error flags (per IEEE-754, such as invalid operation, divide-by-zero, overflow, underflow, or inexact). Special values: Not-a-Result (NAR): Signals errors like invalid memory access or arithmetic issues; includes debugging details (error kind and location) for better diagnostics. None: Represents missing or invalid data, useful for masking unused elements in vectors or loops. Functional units in the processor treat these unitary elements holistically, using metadata to guide operations without relying on instruction opcodes for type or size specifics.\nCore Mechanisms Width Polymorphism: Operations (e.g., addition) adapt automatically to different data widths based on metadata, reducing the need for multiple instruction variants and simplifying the instruction set architecture (ISA). Memory Interaction: Metadata is processor-internal; loads from memory add appropriate metadata, while stores strip it and write only the payload, maintaining compatibility with standard memory systems. Speculative Execution: Supports aggressive speculation by propagating NAR or None through operations without immediate side effects. Non-speculative operations (e.g., stores or branches) trigger faults or skip actions when encountering these special values. Floating-Point Handling: Error flags accumulate (via logical OR) during speculative chains and update global state only on non-speculative commits, ensuring precise exception handling. Error Management: Includes operations like splitNAR/joinNAR for manipulating NAR states, and a no-speculation mode for immediate fault reporting during debugging. Specialized Vector Operations The architecture introduces operations to facilitate efficient vectorization of loops that are challenging in traditional processors:\nPick: A conditional select (ternary-like) that chooses between operands based on a Boolean control (scalar or vector), propagating None to avoid unwanted updates. Vector Smear (inclusive/exclusive): Converts a Boolean vector into a mask by propagating the first \u0026ldquo;true\u0026rdquo; value, ideal for while-loop termination without scalar checks or cleanup code. Remaining: For counting loops, generates a mask for partial vectors, using None to ignore excess elements and prevent side effects. Satisfied: Counts leading unsatisfied iterations in search-style loops, aiding precise termination detection in vectorized while loops. These operations minimize branching, eliminate scalar fallback code, and enable safe SIMD processing of irregular or conditional loops.\nProblems Addressed and Benefits Conventional processors separate data from metadata, leading to complex ISAs, costly speculation recovery, and difficulties in vectorizing non-uniform loops. By integrating metadata directly with operands, this design:\nSimplifies the ISA and compiler efforts through polymorphism. Enhances speculation safety and efficiency with low-overhead error propagation. Improves vectorization for dynamic loops, boosting performance in data-parallel tasks. Provides better debugging via embedded error info and robust error detection (e.g., mismatched widths). Overall, the invention promotes higher instruction-level parallelism, resource efficiency, and reliability in modern computing workloads while remaining backward-compatible with existing memory hierarchies.\n","date":"27 October 2016","externalUrl":null,"permalink":"/patents/","section":"Pages","summary":" U.S. Patent 9,513,920 - Computer Processor Employing Split-Stream Encoding This patent is for the split-stream method of encoding that permits Mill CPUs to decode extremely wide instructions (those with many independent operations in each instruction) using a compact and flexible variable-length encoding that is also fast to decode. The stream of instructions that is the program being executed is split into two streams of half-instructions that are stored separately in memory but are processed in lock-step by the CPU decoder. Because wide fixed length instruction encodings use an impractical amount of cache and memory, and variable-length encodings take time polynomial in the width, instruction decode on legacy CPUs is limited to eight operations per cycle or fewer; Mill split-stream encoding supports instruction widths of over thirty operations decoded per cycle. In addition, split-stream permits doubling the amount of instruction cache with no clock or pipeline penalty.\n","title":"Patents","type":"pages"},{"content":" Mentions of the Mill, Mill Computing and Out-of-the-Box Computing in external websites. Their presence here is not an endorsement of their contents by Mill Computing, Inc., nor any statement regarding the accuracy of the information given by them. To reinvent the processor\nMar 31, 2019 - medium.com\nThe Mill CPU Architecture: Switches\nJuly 17, 2017 - Hacker News\nHow Many X86-64 Instructions Are There Anyway?\nMarch 16, 2017 - Hacker News\nMill Computing in 2017\nFebruary 23, 2017 - reddit.com\nMill Computing in 2017\nJanuary 12, 2017 - Hacker News\nWindows zero-day exploit used in targeted attacks by FruityArmor APT\nOctober 22, 2016 - reddit.com\nApple reportedly dropping Samsung for not only A10 in iPhone 7 but also A11 in iPhone 8\nJuly 18, 2016 - 9to5mac\nWhat is the performance impact of virtual memory relative to direct mapped memory?\nMay 2, 2016 - Stack Overflow\nEfficient Integer Overflow Checking in LLVM\nApril 7, 2016 - Embedded in Academia\nSingle address spaces: design flaw or feature?\nFebruary 28, 2016 - Hacker News\nMill CPU Architecture\nFebruary 19, 2016 - EEVblog Electronics Community Forum\nIf a number is too big does it spill over to the next memory location?\nJanuary 20, 2016 - Software Engineering Stack Exchange\nSystem Security: A Model from Medieval History\nJanuary 18, 2016 - System Design Journal\nLLVM Meets the Truly Alien: Mill CPU Architecture\nAugust 24, 2015 - Hacker News\nLLVM Meets the Truly Alien: Mill CPU Architecture\nAugust 24, 2015 - reddit.com\nMill CPU Architecture\nAugust 9, 2015 - everything2.com\nThe Mill CPU Architecture – The Compiler\nJuly 8, 2015 - reddit.com\nThe Mill\nJuly ??, 2015 - liberomnia.org\nHow the Mill CPU does fork()\nin a single address space April 16, 2015 - reddit.com\nThe Mill CPU\nJuly 31, 2014 - Kevin\u0026rsquo;s\u0026rsquo; Blog\nMill CPU Architecture\nJuly 23, 2014 - wikiwand.com\nMill CPU Architecture\nJuly 23, 2014 - wikipedia.org\nMill CPU Architecture\nApril 17, 2014 - prezi.com\nMill CPU Architecture Overview\nMarch 31, 2014 - Henri Tuhola\n“The Mill” – It just might Work!\nFebruary 26, 2014 - Observations from Uppsala\nProgrammer\u0026rsquo;s intro to the new Mill CPU architecture\nFebruary 7, 2014 - reddit.com\nComments on the Mill CPU from OotB Computing\nAugust 1, 2013 - Raphael\u0026rsquo;s academic home page\n","date":"15 July 2015","externalUrl":null,"permalink":"/links/","section":"Pages","summary":"Mentions of the Mill, Mill Computing and Out-of-the-Box Computing in external websites. Their presence here is not an endorsement of their contents by Mill Computing, Inc., nor any statement regarding the accuracy of the information given by them. To reinvent the processor\nMar 31, 2019 - medium.com\nThe Mill CPU Architecture: Switches\nJuly 17, 2017 - Hacker News\nHow Many X86-64 Instructions Are There Anyway?\nMarch 16, 2017 - Hacker News\n","title":"Links","type":"pages"},{"content":"Forum Topic: Threading\nTalk by Ivan Godard - December 5, 2017, at Facebook Slides: threading.02(.pptx)\nThis is the thirteenth topic publicly presented on the Mill general-purpose CPU architecture. It covers the methods used to manage threads on the Mill Architecture. The talk assumes some general familiarity with software threads.\nThreading on the Mill CPU The Mill is a new general-purpose CPU architectural family, with novel resource allocation and control facilities that are orders of magnitude more efficient than the equivalents on other CPUs. The Mill’s direct hardware support for threading is an important example.\nThread and process creation, preemption, and dispatch are heavyweight operations on conventional operating systems on conventional hardware. As a result, software systems and languages such as Go have been devised with lightweight cooperative threading. Unfortunately, all lightweight systems to date require mutual trust among the participants, and can suffer uncontrolled stalling when encountering kernel-level events such as page traps.\nSpawning a thread, dispatching one for execution, idling it, killing it, and even such apparently unrelated facilities as setjmp/longjmp are all user-mode hardware operations on a Mill CPU, with performance comparable to a normal function call. The talk will describe in detail how this works, with examples from micro-kernel operating systems and concurrent languages like Go.\n","date":"5 May 2015","externalUrl":null,"permalink":"/docs/threading/","section":"Pages","summary":"Forum Topic: Threading\nTalk by Ivan Godard - December 5, 2017, at Facebook Slides: threading.02(.pptx)\nThis is the thirteenth topic publicly presented on the Mill general-purpose CPU architecture. It covers the methods used to manage threads on the Mill Architecture. The talk assumes some general familiarity with software threads.\n","title":"Threading","type":"pages"},{"content":"Forum Topic: Inter-process Communication\nTalk by Ivan Godard - October 4, 2017, at the Silicon Valley Linux Users Group\nSlides: 2017-10-04-IPC.4 (.pptx)\nThis was the twelfth topic publicly presented related to the Mill general-purpose CPU architecture. It covers Inter-Process Communication for the Mill CPU architecture family. The talk assumes a familiarity with aspects of CPU architecture in general and C++ programming in particular.\nThe Mill is a new general-purpose architectural family, with an emphasis on secure and inexpensive communication across protection boundaries. The large (page) granularity of protection on conventional architectures makes such communication difficult compared to communication within a protection boundary, such as a function call. As a result, the large granularity has forced communication protocols on conventional architectures into two models: pass-by-sharing (using shared pages), and pass-by-copy (using the OS kernel for files/message passing). Both have drawbacks: sharing requires difficult-to-get-right synchronization, while copy involves kernel transitions as well as the costs of the copy itself.\nThe Mill supports both these protocols, for use by legacy code. However, the Mill hardware also supports inter-process communication using the same program protocols as for intra-process communication and function call: pass-by-value, pass-by-copy, pass-by-reference, and pass-by-name, but all without kernel involvement or overhead. The protocols are secure: neither party can see anything of the other except the explicit arguments to the communication. Neither caller nor callee codes need source changes to replace intra-process communication with Mill inter-process argument passing. However, the pass-by-reference protocol may require use of shims to delimit the extent of sharing in some languages. And granularity is no longer an issue: arguments can be of any size down to the byte.\nThe talk describes the machinery behind the Mill IPC protocols, together with suggestions as to how the hardware facilities may be integrated with representative language runtime systems such as those found in Linux.\nIvan Godard is CTO and a founder of Mill Computing, Inc., developer of the Mill family of general-purpose CPUs. He has written or led the development team for a dozen compilers, an OS, an OODBMS, and much other software. Ivan has been active in the field of computers since the 1960s.\n","date":"5 May 2015","externalUrl":null,"permalink":"/docs/inter-process-communication/","section":"Pages","summary":"Forum Topic: Inter-process Communication\nTalk by Ivan Godard - October 4, 2017, at the Silicon Valley Linux Users Group\nSlides: 2017-10-04-IPC.4 (.pptx)\nThis was the twelfth topic publicly presented related to the Mill general-purpose CPU architecture. It covers Inter-Process Communication for the Mill CPU architecture family. The talk assumes a familiarity with aspects of CPU architecture in general and C++ programming in particular.\n","title":"Inter-process Communication","type":"pages"},{"content":"The Mill is a new general-purpose CPU architectural family. The talk will present machine-level details of the Mill support for bigger-than-scalar data.\nMost modern architectures have SIMD operations that work on vectors of data, in addition to the scalars used by all CPUs. Typically there is a limited assortment of vector operations, working on data of a limited set of element sizes, and often a limited set of sizes for the whole vector. To make matters worse for the programmer, the operands and element sizes available frequently vary with the whole-vector size, and succeeding models of the \u0026ldquo;same\u0026rdquo; architecture rarely offer the same vector facilities. As a result, vector codes must usually be written in assembler, and special-cased for each version of the target CPU.\nThe Mill has vector forms of all scalar operations, and all vector operations work for all the element sizes that are supported for scalar; the ISA is completely regular. In addition, vectors may have any number of elements as seen by the program, while hardware parallelism is limited only by the number of functional units supporting the desired operation on the particular Mill family member. Unlike the power- and area-hungry vector registers used to hold intermediate vector results on a conventional CPU, the Mill has no vector registers (nor any general registers at all), but uses the Belt (a single-assignment forwarding network) for vectors as well as scalars.\nWhile other CPUs support bigger data in the form of vectors, none have support for irregular data forms such as structs, records or objects. Instead, individual fields of such objects are treated as single scalars. The Mill contains several facilities that treat composites as unitary objects rather than as a collection of fields, significantly improving code and performance.\n","date":"5 May 2015","externalUrl":null,"permalink":"/docs/wide-data/","section":"Pages","summary":"The Mill is a new general-purpose CPU architectural family. The talk will present machine-level details of the Mill support for bigger-than-scalar data.\nMost modern architectures have SIMD operations that work on vectors of data, in addition to the scalars used by all CPUs. Typically there is a limited assortment of vector operations, working on data of a limited set of element sizes, and often a limited set of sizes for the whole vector. To make matters worse for the programmer, the operands and element sizes available frequently vary with the whole-vector size, and succeeding models of the “same” architecture rarely offer the same vector facilities. As a result, vector codes must usually be written in assembler, and special-cased for each version of the target CPU.\n","title":"Wide Data","type":"pages"},{"content":"Forum Topic: The Compiler\nTalk by Ivan Godard - June 10, 2015, at the SFBay Association of C/C++ Users NOTE: the slides require genuine Microsoft PowerPoint to view; open source PowerPoint clones are unable to show the animations, which are essential to the slide content. If you do not have access to PowerPoint then watch the video, which shows the slides as intended.\nSlides: mill-cpu-compiler.04 (.pptx)\nThis is the tenth topic publicly presented related to the Mill general-purpose CPU architecture. It covers only the tool chain used to generate executable binaries targeted for any member of the Mill CPU architecture family. The talk assumes a familiarity with aspects of CPU architecture in general and C++ programming in particular.\nLLVM meets the truly alien: The Mill CPU architecture in a multi-target tool chain The Mill is a new general-purpose CPU architecture family that forms a uniquely challenging target for compilation – and also a uniquely easy target. This talk describes the Mill tool chain from language front end to binary executable.\nThe Mill tool chain is unusual in that it translates LLVM intermediate representation (IR) not into object code but into a different IR (genAsm), tailored for the Mill architecture family. Then a separate tool, the specializer, converts genAsm input into executable binary code (conAsm) for a particular Mill architecture family member. genAsm is a dataflow language, essentially a programmable representation of a single-assignment compiler IR.\nThe Mill has no general registers. Instead, intermediate results are placed on the Belt, a fixed-length queue, and these operands are accessed by temporal addressing. A Mill operation in effect says “add the third most recent value to drop on the belt to the fifth most recent, and drop the result at the front of the belt, and discard the oldest value from the other end of the belt”. The Mill is also a (very) wide issue machine, and many of these actions are taking place concurrently in each cycle. The tool chain, or rather the specializer component, must track the location of operands as they move along the belt, because their belt address changes as other operations are executed and drop results. In addition, the Mill is statically scheduled with an exposed pipeline, so an operation may produce its results several cycles after the operation was issued, possibly with intervening control flow.\nThis belt structure leads to unique needs for operation scheduling and operand spilling. These needs are the rough equivalent of instruction selection, register coloring, and spill on a conventional machine. The talk concludes by explaining the algorithms used.\nSpeaker bio Ivan Godard has designed, implemented or led the teams for 11 compilers for a variety of languages and targets, an operating system, an object-oriented database, and four instruction set architectures. He participated in the revision of Algol68 and is mentioned in its Report, was on the Green team that won the Ada language competition, designed the Mary family of system implementation languages, and was founding editor of the Machine Oriented Languages Bulletin. He is a Member Emeritus of IFIPS Working Group 2.4 (Implementation languages) and was a member of the committee that produced the IEEE and ISO floating-point standard 754-2011.\nIvan is currently CTO at Mill Computing, a startup now emerging from stealth mode. Mill Computing has developed the Mill, a clean-sheet rethink of general-purpose CPU architectures. The Mill is the subject of this talk.\n","date":"5 May 2015","externalUrl":null,"permalink":"/docs/compiler/","section":"Pages","summary":"Forum Topic: The Compiler\nTalk by Ivan Godard - June 10, 2015, at the SFBay Association of C/C++ Users NOTE: the slides require genuine Microsoft PowerPoint to view; open source PowerPoint clones are unable to show the animations, which are essential to the slide content. If you do not have access to PowerPoint then watch the video, which shows the slides as intended.\n","title":"The Compiler","type":"pages"},{"content":"","date":"28 March 2014","externalUrl":null,"permalink":"/categories/","section":"Categories","summary":"","title":"Categories","type":"categories"},{"content":"","date":"28 March 2014","externalUrl":null,"permalink":"/categories/faq-general/","section":"Categories","summary":"","title":"Faq-General","type":"categories"},{"content":"Q: What does NYF stand for? Stands for Not Yet Filed - as used here it refers to a patent or patents that are Not Yet Filed. Questions about patentable mechanisms that are NYF cannot be answered without an NDA due to USPTO rules.\nQ: Why no PDF slides? Publishing PDF slides (instead of .pptx) would make them available to more people. A: Unfortunately PDF format doesn\u0026rsquo;t support the animations, and without those the slides are pretty useless. We looked into creating a completely different presentation without the animations, but decided we don\u0026rsquo;t have the bandwidth, and the need for a static format was better served by white papers, of which we have a few in process. We had been using OpenOffice (and tried LibreOffice) but found both useless because of bugs in the animation on some platforms and the generally limited support for animation creation, and so settled on MSOffice as the best of the choices.\nThere is a free PowerPoint Viewer available from Microsoft, although we have had reports of issues with it on the Mac. http://www.microsoft.com/en-us/download/details.aspx?id=6 or Google \u0026ldquo;PowerPoint viewer\u0026rdquo;.\n","date":"28 March 2014","externalUrl":null,"permalink":"/general-questions/","section":"Pages","summary":"Q: What does NYF stand for? Stands for Not Yet Filed - as used here it refers to a patent or patents that are Not Yet Filed. Questions about patentable mechanisms that are NYF cannot be answered without an NDA due to USPTO rules.\nQ: Why no PDF slides? Publishing PDF slides (instead of .pptx) would make them available to more people. A: Unfortunately PDF format doesn’t support the animations, and without those the slides are pretty useless. We looked into creating a completely different presentation without the animations, but decided we don’t have the bandwidth, and the need for a static format was better served by white papers, of which we have a few in process. We had been using OpenOffice (and tried LibreOffice) but found both useless because of bugs in the animation on some platforms and the generally limited support for animation creation, and so settled on MSOffice as the best of the choices.\n","title":"General Questions","type":"pages"},{"content":"Forum Topic: Pipelining\nTalk by Ivan Godard – 2014-07-14 at Facebook NOTE: the slides require genuine Microsoft PowerPoint to view; open source PowerPoint clones are unable to show the animations, which are essential to the slide content. If you do not have access to PowerPoint then watch the video, which shows the slides as intended.\nSlides: pipelining.06 (.pptx)\nSoftware pipelining on the Mill CPU: Instant pipeline: add loop, no stirring needed The Mill CPU architecture is very wide, able to issue and execute 30+independent MIMD operations per cycle. Non-looping open code often cannot use this raw compute capacity, but fortunately \u0026gt;80% of cycles are in loops. Loops potentially have unbounded instruction-level parallelism and can absorb all the capacity available – if the loop can be pipelined. This talk addresses how loops are pipelined on the Mill architecture. On a conventional machine, pipelining requires lengthy prelude and postlude instruction sequences to get the pipeline started and wound down, frequently destroying the benefit of pipelining the main body; conventional pipelining can be of negative benefit on short loops, especially “while” type loops whose length is unknown and data dependent. Not so on a Mill: Mill pipelines have neither prelude nor postlude, and early conditional exit has no added cost. Pipelines on conventional machines also have problems with loop-carried data, values produced by one iteration but consumed by another. Conventional code must resort to bucket-brigade register copies, or fail to pipeline altogether. Even architectures like the Itanium, which have special hardware for support, provide it only for the innermost loop. In contrast, the Mill needs no copies and can pipeline outer as well as inner loops. Familiarity with prior talks in this series, especially the Belt and Metadata talks will be helpful but not essential.\nSpeaker bio Ivan Godard has designed, implemented or led the teams for 11 compilers for a variety of languages and targets, an operating system, an object-oriented database, and four instruction set architectures. He participated in the revision of Algol68 and is mentioned in its Report, was on the Green team that won the Ada language competition, designed the Mary family of system implementation languages, and was founding editor of the Machine Oriented Languages Bulletin. He is a Member Emeritus of IFIPS Working Group 2.4 (Implementation languages) and was a member of the committee that produced the IEEE and ISO floating-point standard 754-2011.\nIvan is currently CTO at Mill Computing, a startup now emerging from stealth mode. Mill Computing has developed the Mill, a clean-sheet rethink of general-purpose CPU architectures. The Mill is the subject of this talk.\n","date":"24 March 2014","externalUrl":null,"permalink":"/docs/pipelining/","section":"Pages","summary":"Forum Topic: Pipelining\nTalk by Ivan Godard – 2014-07-14 at Facebook NOTE: the slides require genuine Microsoft PowerPoint to view; open source PowerPoint clones are unable to show the animations, which are essential to the slide content. If you do not have access to PowerPoint then watch the video, which shows the slides as intended.\n","title":"Pipelining","type":"pages"},{"content":"Forum Topic: Security\nTalk given by Ivan Godard - 2014-03-21 at Google. NOTE: the slides require genuine Microsoft PowerPoint to view; open source PowerPoint clones are unable to show the animations, which are essential to the slide content. If you do not have access to PowerPoint then watch the video, which shows the slides as intended.\nSlides: Powerpoint (.pptx)\nSecurity and reliability on the Mill CPU: Naughty, naughty; bad program, mustn’t do that! Software bugs have always been a problem, but in recent years bugs have become an even more serious concern as they are exploited to breach system security for privacy violation, theft, and even terrorism or acts of war.\nThe Mill CPU architecture addresses software robustness in three basic ways: it makes impossible many errors and exploits; it detects and reports many errors and exploits that cannot be prevented; and it survives and recovers from many detected errors and exploits. None of these ways involve loss of performance.\nThe talk describes some of the Mill CPU features that defend against well-known error and exploit patterns. Examples include:\na call stack structure that cannot be overwritten to redirect execution on return an instruction format that makes “return-oriented programming” exploits very difficult an inter-process protection mechanism that lets applications, server code, and operating systems follow “least privilege” principles These features will be discussed in the context of the overall Mill CPU security model, which defends not only against known errors and exploits, but also against unanticipated future failures.\nSpeaker bio Ivan Godard has designed, implemented or led the teams for 11 compilers for a variety of languages and targets, an operating system, an object-oriented database, and four instruction set architectures. He participated in the revision of Algol68 and is mentioned in its Report, was on the Green team that won the Ada language competition, designed the Mary family of system implementation languages, and was founding editor of the Machine Oriented Languages Bulletin. He is a Member Emeritus of IFIPS Working Group 2.4 (Implementation languages) and was a member of the committee that produced the IEEE and ISO floating-point standard 754-2011.\nIvan is currently CTO at Mill Computing, a startup now emerging from stealth mode. Mill Computing has developed the Mill, a clean-sheet rethink of general-purpose CPU architectures. The Mill is the subject of this talk.\n","date":"15 February 2014","externalUrl":null,"permalink":"/docs/security/","section":"Pages","summary":"Forum Topic: Security\nTalk given by Ivan Godard - 2014-03-21 at Google. NOTE: the slides require genuine Microsoft PowerPoint to view; open source PowerPoint clones are unable to show the animations, which are essential to the slide content. If you do not have access to PowerPoint then watch the video, which shows the slides as intended.\n","title":"Security","type":"pages"},{"content":"The Mill is a new CPU architecture designed for very high single-thread performance within a very small power envelope.\nThe Mill has a 10x single-thread power/performance gain over conventional out-of-order (OoO) superscalar architectures, yet runs the same programs without rewrite.\nThe Mill is an extremely-wide-issue, statically scheduled design with exposed pipeline. High-end Mills can decode, issue, and execute over thirty MIMD operations per cycle, each cycle.\nThe Mill architecture is able to pipeline and vectorise almost all loops, including while loops and loops containing calls and flow control.\nThe pipeline is very short, with a mispredict penalty of just four cycles.\nProcessor Models The Mill is a family of processor models. Models differ in various parameters that specify their performance and power constraints, such as the width of their vector units, the functional units, number of pipelines and belt length.\nAll Mills can run the same code, however. Code is compiled to an intermediate representation for distribution, and specialisation for a particular Mill processor model is done by a specialising compiler at software installation time or on-demand as needed.\nThe Belt The Mill is a belt machine. There are no general-purpose registers.\nFunctional units (FUs) can read operands from any belt position. All results are placed on the front of the belt.\nThe belt is a fixed size, and the exact size is dependent on the precise Mill model. When values are inserted at the front of the belt, a corresponding number of old values are pushed off the back end of the belt.\nThe items on the belt are referenced temporally, by their position relative to the front of the belt, which changes as the belt advances. The belt advances some number of items after each cycle, depending upon how many values were returned by ops finishing that cycle.\nFor example, add is issued with the operands 2 and 5. This scalar add takes one cycle, so before the next cycle the sum of these two values is inserted at the front of the belt, pushing the oldest item off the back end of the belt.\n[See talk: The Belt]\nExposed Pipeline The number of cycles each operation takes is fixed, and known to the compiler. There are no variable-latency instructions other than load.\nA pipeline can be told to do one operation each cycle. These operations take some number of cycles, e.g. an add may take one cycle but a multiplication takes 3 cycles.\nEven if the pipeline is part-way through an operation, you can still instruct it to start further operations on subsequent cycles, and it will perform these operations in parallel.\nFor example, a 3-cycle multiplication is started. In the next cycle, a 1-cycle addition is started. The addition inserts its result at the front of the belt at the end of the cycle, while the multiplication is still ongoing. On the next cycle, the result of the addition is on the belt - and can be used as an input to other operations. The multiplication is nearing completion but the pipeline is ready to start another operation.\nIf two operations in the pipeline finish on the same cycle, they are inserted onto the belt in FIFO order; the operation that was started first is inserted first.\nThere are very few examples of hazards between operations, and they are CPU -model specific. It may be that two particular operations cannot execute in parallel inside the pipeline, and if this happens the CPU will detect it and fault. But the compiler knows about these exceptions, and does not try and schedule them.\nAn operation can return more than one result. For example, an integer divide operation may return both the quotient and the remainder, as two separate values. Furthermore, calls (detailed below) can return more than one value too.\nItems that fall off the end of the belt are lost forever. Best belt usage results from producers producing as close to their consumers as possible.\n[See talk: The Belt]\nScalars and Vectors All Mill arithmetic and logic operations support vectors. Mill operations are SIMD, and scalar values are vectors of length 1.\nEach belt item is tagged with its element width and its scalarity (number of elements). Elements may be 1, 2, 4, 8 or 16 bytes. The vector length is Mill -model -specific.\nIf a particular Mill -model does not support any particular widths, these are emulated in software transparently by the compiler.\nOperations support all meaningful combinations of operand scalarity. You can add an single integer to every element in a larger vector, for example, or perform a comparison between two vectors.\nThe Mill has narrow and widen operations, which increase and decrease the width of each element in the operand respectively. Narrow produces a vector of half-width elements and widen produces two vectors of double-width elements.\nThe Mill also has widen-versions of add and multiply operations, which return vectors of elements double the width of their widest operands, and therefore cannot overflow.\nThere is an extract operation for extracting elements from vectors to make scalar belt items, and a shuffle operation for rearranging (or duplicating) the elements within a belt item.\nThe meta-data does not encode the data types. Code can interpret an element as a pointer, float, fraction, signed integer or unsigned integer. There are separate operations for pointers, floating point, fractions, signed integers and unsigned integers. The compiler knows the type of the operands, and encodes this in the operations it issues.\nOperations may have different latencies for different element widths. The compiler knows the size and width of all items on the belt at all times, so it knows the latencies. The operations themselves do not encode the operand sizes and the latencies for most sizes are consistent so the compiler can often share code for templated functions which operate on different sized primitives and reuse a single shared function.\n[See talk: Metadata]\nInstructions and Pipelines The functional units in a pipeline can, collectively, issue one operation per cycle. But the Mill has lots of pipelines, and each pipeline can issue an operation each cycle, sustained.\nThe Mill CPU is a VLIW (Very Long Instruction Word) machine. The operations that issue each cycle are collectively termed the instruction.\nDifferent pipelines have different mixes of FUs available, and the number of pipelines and their FU mix is Mill -model specific. There may be 4 or 8 integer pipelines and 2 or 4 floating point pipelines, for example. There are pipelines that handle flow control and pipelines that handle load and saves. In a mid-range Mill “Gold” CPU there are 33 pipelines of which 8 are integer and 2 are floating-point; the Mill Gold CPU can issue 33 operations per cycle, sustained: Phasing Like most microprocessors, the Mill has a multi-stage pipeline.\nThe Mill has fewer stages than conventional architectures such as classic RISC or modern OoO processors which typically have 10-14 stages (the Intel Pentium 4 peaked with 31 stages).\nOn a conventional microprocessor, each instruction passes through several stages. The individual operations in the instruction pass through the same stages at the same time.\nHowever, the Mill issues the individual operations in an instruction in specific phases; Mill phases issue their operations across several cycles.\nThere are five phases in the Mill instruction:\nDecode happens during the first cycle and the reader operations are issued immediately in that first cycle. The operations for the other phases are decoded in parallel to the reader phase in the first cycle, and issued in the second.\nIn the second cycle, the main operations such as arithmetic and comparisons are issued.\nAt the end of the second cycle, two special phases run: function calls are dispatched sequentially and return, and then any pick operations run.\nThe final writer phase happens in the third cycle.\nOperations that depend upon the output of other operations must wait for that operation to be completed. On conventional architectures this means that these operations must be in subsequent instructions and execute in subsequent cycles.\nOn the Mill, the results of a phase are available immediately to the operations in the next phase in the same cycle, and so dependent operations can be in the same instruction if those operations are in sequential phases.\nThe Mill can chain up to 6 dependent operations together and execute all of them in a single instruction.\nThis has advantages over a classic approach in tight loops, for example. The Mill is able to perform a strcpy loop as 27 operations in a single instruction, chained together through phasing.\n[See talk: Execution]\nSpeculative Execution The Mill can perform operations speculatively because operand elements have special meta-data and can be specially marked as invalid (Not a Result; NaR) or missing (None). Individual elements in vectors can be NaR or None.\nThe Mill can speculate through errors, as errors are propagated forward and only fault when realised by an operation with side effects e.g. a store or branch.\nA load from inaccessible memory does not fault; it returns a NaR. If you load a vector and some of the elements are inaccessible, only those are marked as NaR.\nNaRs and Nones flow through speculable operations where they are operands. If an operand element is NaR or None, the result is always NaR or None.\nIf you try and store a NaR, or store to a NaR address, or jump to a NaR address, then the CPU faults. NaRs contain a payload to enable a debugger to determine where the NaR was generated.\nFor example: a = b? *p: *q; The Mill is able to load both the values pointed at by p and q speculatively because any bad pointers will return NaR and will only fault if the comparison causes them to be stored to a.\nNones are available as a special constant, used by the compiler.\nIf you try and store a None, or store to a None address, or jump to None, the CPU does nothing.\nFor example: if(b) a = *p; The Mill can turn this into a = b? *p: None; which allows *p to be loaded speculatively and executes the store only if the condition is met.\nFloating point exceptions are also stored in meta-data flags with belt elements. The exceptions (invalid, divide-by-zero, overflow, underflow and inexact) are ORed in operations, and the flags are applied to the global state flags only when values are realised.\nThere are operations to explicitly test for None, NaR and floating point exceptions.\nTechnically, None is a kind of NaR; there are several kinds of NaR and the kind is encoded in the value bits. A debugger can differentiate between memory protection errors and divide by zeros, for example, by looking at the kind bits. The remaining bits in the operand are filled with the low-order-bits of a hash identifying the operation which generated the NaR, so the debugger can usually determine this too even if that instruction that generated the NaR even if the NaR has propagated a long way.\nThe None NaR has a higher precedence over all other kinds of NaR so if you perform arithmetic with NaR and None values the result is always None; None is used to discard and mask-out speculative execution.\n[See talk: Metadata]\nVectorising while-loops [See talk: Metadata]\nThe Mill can vectorise almost all loops, even if they are variable and conditional iteration and contain flow of control or calls. It can do this because it can mask-out elements in the vector that fall outside the iteration count. The Mill performs boolean masking of vector elements using pick and smear operations:\nPick The pick operation is a hardware implementation of the ternary if: x = cond? a: b;\nPick is a special operation which takes 0 cycles because it executes at the cycle boundary.\nPick takes a condition operand which it interprets as boolean (all values with the low-order bit set are true) and two data operands to pick between, based on the boolean.\nThe operands can be vector or scalar. If the data operations are vector and the condition operand is vector, the pick performs a per-element pick.\nSmear The smear operation copies a vector of boolean, smearing the first true value it finds to all subsequent elements. For example, 0,1,0,1 smears to 0,1,1,1. (smear operates on vectors of boolean, rather than bits in a scalar.)\nThe smearx operation copies a vector of boolean, offset one element. The first element is always set to false, and the second result is an exit flag indicating if any true elements were encountered. For example, 0,1,0,1 becomes 0,0,1,1: 1 and 0,0,0,1 becomes 0,0,0,0: 1.\nStrcpy example Please do not use unsafe functions like C’s strcpy! However, it is an elegant illustration of loop vectorisation. while(*a++ = *b++) {} is straightforward to vectorise on the Mill:\na can be loaded as a vector and b can be stored as a vector. Any elements in a that the process does not have permission to access will be NaR, but this will only fault if we try and store them.\nThe a vector can be compared to 0; this results in a vector of boolean, which is then smearx’ed. This can then be picked with a vector of None into b. The smearx offsetting ensures that the trailing zero is copied from a to b. The second return from smearx, recording if any 0 was found in a, is used for a conditional branch to determine if another iteration of the vectorised loop is required.\nThe phasing of the strcpy operations allows all 27 operation to be executed in just one cycle, which moves a full maximum vector of bytes each cycle.\nRemaining The remaining operation is designed for count loops.\nGiven a scalar, the remaining operation produces a vector of boolean and an exit condition that can be used for picks.\nGiven an array, the remaining operation produces the offset of the first true element.\nFor example, strncpy is a straightforward extension of strcpy. It would take the n count argument and use the remaining operation to produce a boolean vector to act as a mask, which it then ORs with the strcpy’s result vector from comparing to 0.\nMemory [See talk: Memory]\nThe Mill has a 64-bit Single Address Space (SAS) architecture with position-independent code.\nOn-CPU caches all use logical addressing, and virtualisation to main memory pages is performed by a Translation Lookaside Buffer (TLB) between the lowest cache level and main memory. The TLB can divide up DRAM pages on a cache-line granularity. The allocation of virtual memory is done in hardware by the TLB using a free list, and only needs Operating System (OS) intervention if the free list is exhausted.\nMemory accesses on the L1 are checked by a high-speed Protection Lookaside Buffer (PLB) which operates on logical address ranges. This is executed in parallel with loads, and masks out inaccessible bytes. The PLB can protect ranges down to byte granularity, although large ranges are more common and efficient.\nPointers Data pointers can be a full 64-bits but can also be 32-bits using special segment registers, and can be relative. There are special segment registers for the stack, generic code, continuations and for thread-local storage (which also enables efficient green threading).\nCode pointers are 32-bit using separate special segment registers, and are usually encoded relative to the EBB entry point, which can be efficiently packed in the instruction encoding.\n64-bit pointers have 3 reserved high-order bits which can be used for marking by the application e.g. by garbage collectors. There are distinct operations for pointer manipulation e.g. the addp operation for adding offsets. These operations preserve the high bits and produce NaRs if these bits are overflowed into.\nLoads The Mill has an exposed pipeline architecture. The latency of all operations is fixed and known to the compiler; this includes the latency of loads, because the compiler schedules when it needs the value.\nThe load instruction specifies the number of future cycles when it will retire, and the hardware has until then to pull the value up through the cache hierarchy. The load latency of top-level data cache is just 3 cycles.\nLoads are, in truth, variable latency but the scheduling of loads in the future hides this latency and stalls are uncommon.\nAn EBB will typically issue its load instructions as early as possible, set to retire close to their consumers, and perform meaningful computations in the intervening cycles.\nA pickup load operation specifies a handle rather than a cycle to retire on. Code can use the pickup operation to retire the load at a later point. Pickup loads can be cancelled.\nThe Mill CPU has a fixed number of retire stations, and called code may evict a caller’s in-flight loads if there are no free retire stations. The caller’s loads are serialised to spill and re-issued when the call returns. Loads in-flight across expensive calls can therefore act as a pre-fetch, bringing values from main memory into cache while the expensive calls are executing, even if those calls evict the caller’s retire stations.\nAliasing When a store issues, it broadcasts the range it is storing to to all retire stations. If a store range intersects an in-flight load range, the station re-issues its load.\nThis makes the Mill immune to false aliasing, and as the store ranges are propagated to all cores on a chip, immune to false sharing too.\nScratch There is a small amount of per-call-frame very fast memory called scratch that can be used to store long-lived values from the belt. This is where code can store values that would otherwise fall off the end of the belt.\nCode can explicitly spill and fill to and from scratch, which takes just 3 cycles.\nScratch preserves floating point exception flags, NaR and None markers and other meta-data.\nEBBs must declare up-front how much scratch space they need for temporary storage, and this space is private to the call; EBBs cannot see the scratch used by other EBBs, and their scratch is lost when they return.\nStack Stack is special address space that is private to the function that allocates it. It is allocated in fixed cache-line lengths using the stackf operation, and is implicitly initialized to zero. When the function returns, the stack is automatically discarded.\nImplicit Zero and Backless Memory Reads from uninitialised memory are automatically zeroed. This includes main memory, and the belt, stack and scratch.\nCache lines and stack lines contain a validity mask and uninitialized values are zeroed by a load.\nReads from memory that is not committed in the TLB returns zeros, without committing the RAM. RAM is only committed on-demand when cache-lines are evicted from the bottom of the cache hierarchy and written to main memory.\nCode Blocks and Calls Code is organised into Extended Basic Blocks (EBBs). Each EBB has exactly one entry point, but may have many exit points. Execution cannot fall off the end of an EBB - there is always terminating branch or return operation.\nExecution can only jump - by branches within an EBB, or calls to an EBB - to the entry point.\nThe Mill has built-in - and therefore accelerated - calling support. There is a call operation, which takes just one cycle. Parameters to the call can be passed in copies of caller belt items (specifying items and their order on the callee’s new belt).\nEBBs that are called are functions, and have their own private belt, stack and scratch. EBBs that are branched to inside a function share this belt and scratch.\nA function can return multiple values back to the belt of the caller. The caller specifies how many belt items it is expecting to be returned; if this does not match the actual number returned, the thread faults.\nCall and branch dispatch takes just one cycle. There is a predictor that is pre-fetching likely-to-be-executed blocks.\nFrom the perspective of the caller, calls behave just like 1 cycle operations. If a caller has some other operations in flight during a call, these retire after the call has returned.\nEBBs are a common organisation in compilers, and the Mill target is a near 1:1 translation of the compiler’s EBB and SSA internal representations.\n[See talk: The Belt]\n","date":"7 February 2014","externalUrl":null,"permalink":"/introduction-to-the-mill-cpu-programming-model/","section":"Pages","summary":"The Mill is a new CPU architecture designed for very high single-thread performance within a very small power envelope.\nThe Mill has a 10x single-thread power/performance gain over conventional out-of-order (OoO) superscalar architectures, yet runs the same programs without rewrite.\nThe Mill is an extremely-wide-issue, statically scheduled design with exposed pipeline. High-end Mills can decode, issue, and execute over thirty MIMD operations per cycle, each cycle.\nThe Mill architecture is able to pipeline and vectorise almost all loops, including while loops and loops containing calls and flow control.\n","title":"Introduction to the Mill CPU Programming Model","type":"pages"},{"content":"","date":"7 February 2014","externalUrl":null,"permalink":"/categories/uncategorized/","section":"Categories","summary":"","title":"Uncategorized","type":"categories"},{"content":"Forum Topic: Execution\nTalk by Ivan Godard - 2014-02-05 at Stanford University’s Computer Systems Colloquium (EE380). NOTE: the slides require genuine Microsoft PowerPoint to view; open source PowerPoint clones are unable to show the animations, which are essential to the slide content. If you do not have access to PowerPoint then watch the video, which shows the slides as intended.\nSlides: execution.02 (.pptx)\nInstruction execution on the Mill CPU: Working at Mach 3 A perennial objection to wide-issue CPU architectures such as VLIWs and the Mill is that there is insufficient instruction level parallelism (ILP) in programs to make effective use of the available functional width. While software pipelining can reveal large quantities of ILP in loops, in open (non-loop) code studies have calculated maximum ILP in the order of two instructions per cycle (IPC), well below the capacity of even conventional VLIWs never mind super-wide architectures such as high-end Mills. The problem is that the program instructions tend to form chains connected by data dependencies, precluding executing them in parallel.\nThis talk addresses the ILP issue, describing how the Mill is able to achieve much higher IPC even when the nominal ILP is relatively low. The Mill is able to execute as many as six chained dependent operations in a single cycle; open code IPC numbers typically exceed nominal ILP by a factor of three. The talk will show in detail how this is achieved, and why we chose “Mach 3” as the name of the mechanism. In the course of the explanation, the talk will also introduce other operations for which the semantics of Mill execution differs from that of conventional CPUs.\nSpeaker bio Ivan Godard has designed, implemented or led the teams for 11 compilers for a variety of languages and targets, an operating system, an object-oriented database, and four instruction set architectures. He participated in the revision of Algol68 and is mentioned in its Report, was on the Green team that won the Ada language competition, designed the Mary family of system implementation languages, and was founding editor of the Machine Oriented Languages Bulletin. He is a Member Emeritus of IFIPS Working Group 2.4 (Implementation languages) and was a member of the committee that produced the IEEE and ISO floating-point standard 754-2011.\nIvan is currently CTO at Mill Computing, a startup now emerging from stealth mode. Mill Computing has developed the Mill, a clean-sheet rethink of general-purpose CPU architectures. The Mill is the subject of this talk.\n","date":"29 December 2013","externalUrl":null,"permalink":"/docs/execution/","section":"Pages","summary":"Forum Topic: Execution\nTalk by Ivan Godard - 2014-02-05 at Stanford University’s Computer Systems Colloquium (EE380). NOTE: the slides require genuine Microsoft PowerPoint to view; open source PowerPoint clones are unable to show the animations, which are essential to the slide content. If you do not have access to PowerPoint then watch the video, which shows the slides as intended.\n","title":"Execution","type":"pages"},{"content":"Forum Topic: Specification\nTalk by Ivan Godard – 2014-05-14 at the SFBay Association of C/C++ Users NOTE: the slides require genuine Microsoft PowerPoint to view; open source PowerPoint clones are unable to show the animations, which are essential to the slide content. If you do not have access to PowerPoint then watch the video, which shows the slides as intended.\nSlides: specification.04 (.pptx)\nProcessor configuration on the Mill CPU: Specifying cores for a range of power and performance points The Mill CPU architecture defines a generic Mill processor, from which a family of specific processors can be configured. A particular configuration for a Mill CPU family member is defined by a specification, which is processed by Mill configuration software to build a member-specific assembler, simulator, compiler back-ends, Verilog for the hardware implementation, documentation, and other tools and components. A Mill CPU family member specification is in two parts: one defines the instruction set, and the other defines the components that comprise the functional organization and microarchitecture of the configured family member. The talk describes the specifications and the software components that perform Mill configuration.\nIn addition, the talk includes a live demo of configuration. The audience specified a new instruction and a new Mill family member. The speaker built a new configuration from the specification, wrote a short program in Mill assembly language that exercises the new instruction, and executed and debugged the program in a software functional simulation of the newly configured family member.\nSpeaker bio Ivan Godard has designed, implemented or led the teams for 11 compilers for a variety of languages and targets, an operating system, an object-oriented database, and four instruction set architectures. He participated in the revision of Algol68 and is mentioned in its Report, was on the Green team that won the Ada language competition, designed the Mary family of system implementation languages, and was founding editor of the Machine Oriented Languages Bulletin. He is a Member Emeritus of IFIPS Working Group 2.4 (Implementation languages) and was a member of the committee that produced the IEEE and ISO floating-point standard 754-2011.\nIvan is currently CTO at Mill Computing, a startup now emerging from stealth mode. Mill Computing has developed the Mill, a clean-sheet rethink of general-purpose CPU architectures. The Mill specification machinery is the subject of this talk.\n","date":"18 December 2013","externalUrl":null,"permalink":"/docs/specification/","section":"Pages","summary":"Forum Topic: Specification\nTalk by Ivan Godard – 2014-05-14 at the SFBay Association of C/C++ Users NOTE: the slides require genuine Microsoft PowerPoint to view; open source PowerPoint clones are unable to show the animations, which are essential to the slide content. If you do not have access to PowerPoint then watch the video, which shows the slides as intended.\n","title":"Specification","type":"pages"},{"content":"Forum Topic: Metadata\nTalk by Ivan Godard – 2013-12-11 at the SFBay Association of C/C++ Users NOTE: the slides require genuine Microsoft PowerPoint to view; open source PowerPoint clones are unable to show the animations, which are essential to the slide content. If you do not have access to PowerPoint then watch the video, which shows the slides as intended.\nSlides: metadata.02 (.pptx)\nMetadata in the Mill CPU Smarter data for performance and power The Mill is a new CPU architecture designed for very high single-thread performance within a very small power envelope. It achieves DSP-like power/performance on general purpose codes, without reprogramming. The Mill is a wide-issue, statically scheduled design with exposed pipeline. High-end Mills can decode, issue, and execute over thirty MIMD operations per cycle, sustained. The pipeline is very short, with a mispredict penalty of only four cycles.\nTo support such sustained performance, the Mill conveys some of the semantics of execution in the form of operand metadata. For example, size metadata bits attached to each operand eliminate the need for redundant opcodes that serve only to encode size metadata. Another example is the NaR bit, for “not a result”, which among other uses allows improved smart exception handling and allows the Mill to encode the novel “None” singleton operand, which enables smaller and faster generated code. Metadata propagates through execution, following rules specified by the architecture.\nUse of metadata provides a number of advantages to the architecture:\nMetadata reduces the number of distinct opcodes by a factor of seven. Metadata enables speculative execution without fix-up code. Metadata eliminates flag-control overheads in floating point. Metadata permits vectorizing of while-loops. The talk describes these and other technical aspects of metadata and speculation in the Mill design.\nSpeaker bio Ivan Godard has designed, implemented or led the teams for 11 compilers for a variety of languages and targets, an operating system, an object-oriented database, and four instruction set architectures. He participated in the revision of Algol68 and is mentioned in its Report, was on the Green team that won the Ada language competition, designed the Mary family of system implementation languages, and was founding editor of the Machine Oriented Languages Bulletin. He is a Member Emeritus of IFIPS Working Group 2.4 (Implementation languages) and was a member of the committee that produced the IEEE and ISO floating-point standard 754-2011.\nIvan is currently CTO at Mill Computing, a startup now emerging from stealth mode. Mill Computing has developed the Mill, a clean-sheet rethink of general-purpose CPU architectures. The Mill is the subject of this talk.\n","date":"18 December 2013","externalUrl":null,"permalink":"/docs/metadata/","section":"Pages","summary":"Forum Topic: Metadata\nTalk by Ivan Godard – 2013-12-11 at the SFBay Association of C/C++ Users NOTE: the slides require genuine Microsoft PowerPoint to view; open source PowerPoint clones are unable to show the animations, which are essential to the slide content. If you do not have access to PowerPoint then watch the video, which shows the slides as intended.\n","title":"Metadata","type":"pages"},{"content":"Forum Topic: Prediction\nTalk by Ivan Godard – 2013-11-12 at IEEE CS Santa Clara Valley NOTE: the slides require genuine Microsoft PowerPoint to view; open source PowerPoint clones are unable to show the animations, which are essential to the slide content. If you do not have access to PowerPoint then watch the video, which shows the slides as intended.\nSlides: PowerPoint (.pptx)\nRun-ahead transfer prediction in the Mill CPU architecture Programs frequently execute only a handful of operations between transfers of control: branches, calls, and returns. Yet modern wide-issue VLIW and superscalar CPUs can issue similar handfuls of operations every cycle, so the hardware must be able to change to a new point of execution each cycle if performance is not to suffer from stalls. Changing the point of execution requires determining the new execution address, fetching instructions at that address from the memory hierarchy, decoding the instructions, and issuing them—steps that can take tens of cycles on modern out-of-order machines. Without hardware help, a machine could take 20 cycles to transfer, for just one cycle of actual work.\nThe branch predictor hardware in conventional out-of-order processors does help a lot. It attempts to predict the taken vs. untaken state of conditional branches based on historical behavior of the same branch in earlier executions. Modern predictors achieve 95% accuracy, and large instruction-decode windows can hide top-level cache latency. Together these effects are sufficient for programs like benchmarks that are regular and small. However, on real-world problems today’s CPUs can spend a third or more of their cycles stalled for instructions.\nThe Mill uses a novel prediction mechanism to avoid these problems; it predicts transfers rather than branches. It can do so for all code, including code that has not yet ever been executed, running well ahead of execution so as to mask all cache latency and most memory latency. It needs no area- and power-hungry instruction window, using instead a very short decode pipeline and direct in-order issue and execution. It can use all present and future prediction algorithms, with the same accuracy as any other processor. On those occasions in which prediction is in error, the mispredict penalty is four cycles, a quarter that of superscalar designs. As a result, code stall is a rarity on a Mill, even on large programs with irregular control flow.\nThe talk describes the prediction mechanism of the Mill and compares it with the conventional approach.\nSpeaker bio Ivan Godard has designed, implemented or led the teams for 11 compilers for a variety of languages and targets, an operating system, an object-oriented database, and four instruction set architectures. He participated in the revision of Algol68 and is mentioned in its Report, was on the Green team that won the Ada language competition, designed the Mary family of system implementation languages, and was founding editor of the Machine Oriented Languages Bulletin. He is a Member Emeritus of IFIPS Working Group 2.4 (Implementation languages) and was a member of the committee that produced the IEEE and ISO floating-point standard 754-2011.\nIvan is currently CTO at Mill Computing, a startup now emerging from stealth mode. Mill Computing has developed the Mill, a clean-sheet rethink of general-purpose CPU architectures. The Mill is the subject of this talk.\n","date":"18 December 2013","externalUrl":null,"permalink":"/docs/prediction/","section":"Pages","summary":"Forum Topic: Prediction\nTalk by Ivan Godard – 2013-11-12 at IEEE CS Santa Clara Valley NOTE: the slides require genuine Microsoft PowerPoint to view; open source PowerPoint clones are unable to show the animations, which are essential to the slide content. If you do not have access to PowerPoint then watch the video, which shows the slides as intended.\n","title":"Prediction","type":"pages"},{"content":"Forum Topic: Memory\nTalk by Ivan Godard – 2013-10-16 at Stanford NOTE: the slides require genuine Microsoft PowerPoint to view; open source PowerPoint clones are unable to show the animations, which are essential to the slide content. If you do not have access to PowerPoint then watch the video, which shows the slides as intended.\nSlides: PowerPoint (.pptx) This talk at Stanford EE380 Computer Systems Colloquium\nMostly missless memory in the Mill CPU Avoiding the pain of cache misses in a statically-scheduled architecture The Mill is a new CPU architecture designed for very high single-thread performance within a very small power envelope. It achieves DSP-like power/performance on general purpose codes, without reprogramming. The Mill is a wide-issue, statically scheduled design with exposed pipeline. High-end Mills can decode, issue, and execute over thirty MIMD operations per cycle, sustained. The pipeline is very short, with a mispredict penalty of only four cycles.\nIt is well known that exposed-pipe static scheduling yields near-perfect code with minimal power – except when there is a miss in the cache. In a conventional VLIW, a miss stalls the whole machine, whereas an out-of-order architecture can sometimes find other useful operations to execute while waiting on the memory hierarchy. The Mill uses a novel load instruction that tolerates load misses as well as hardware out-of-order approaches can do, while avoiding the need for expensive load buffers and completely avoiding false aliasing. In addition, store misses are impossible on a Mill, and a large fraction of the memory traffic of a conventional processor can be omitted entirely.\nThe talk covers these and other technical aspects of the memory hierarchy in the Mill design.\nSpeaker bio Ivan Godard has designed, implemented or led the teams for 11 compilers for a variety of languages and targets, an operating system, an object-oriented database, and four instruction set architectures. He participated in the revision of Algol68 and is mentioned in its Report, was on the Green team that won the Ada language competition, designed the Mary family of system implementation languages, and was founding editor of the Machine Oriented Languages Bulletin. He is a Member Emeritus of IFIPS Working Group 2.4 (Implementation languages) and was a member of the committee that produced the IEEE and ISO floating-point standard 754-2011.\nIvan is currently CTO at Mill Computing, a startup now emerging from stealth mode. Mill Computing has developed the Mill, a clean-sheet rethink of general-purpose CPU architectures. The Mill is the subject of this talk.\n","date":"18 December 2013","externalUrl":null,"permalink":"/docs/memory/","section":"Pages","summary":"Forum Topic: Memory\nTalk by Ivan Godard – 2013-10-16 at Stanford NOTE: the slides require genuine Microsoft PowerPoint to view; open source PowerPoint clones are unable to show the animations, which are essential to the slide content. If you do not have access to PowerPoint then watch the video, which shows the slides as intended.\n","title":"Memory","type":"pages"},{"content":"Forum Topic: The Belt\nTalk by Ivan Godard – 2013-07-11 at Google NOTE: the slides require genuine Microsoft PowerPoint to view; open source PowerPoint clones are unable to show the animations, which are essential to the slide content. If you do not have access to PowerPoint then watch the video, which shows the slides as intended.\nSlides: 2013-07-11_mill_cpu_belt (.pptx)\nBelt Machines Data interchange without general registers A large fraction of the power budget of modern superscalar CPUs is devoted to renaming registers: the CPU must track the dataflow of the executing program, assign physical registers and map them to the logical registers of the program, schedule operations when arguments are available, restore visible state in the event of an exception—all while avoiding register update hazards.\nNot all CPU architectures are subject to hazards that require register renaming. Unfortunately, earlier hazard-free designs either require one-at-a-time instruction execution (stack and accumulator machines) or push hazard avoidance off onto the compiler or programmer (VLIWs). The Mill is a new machine architecture that eliminates these problems by adopting a new machine model, the “belt”.\nThe belt machine model is inherently free of update hazards because all operation results go onto the belt by Single Assignment; in other words, once created they never change their value. Belt machines have no general registers and thus no rename registers that physically embody them. Result addressing is implicit, which produces compact code and easily accommodates operations like integer divide that logically produce multiple results. The machine model integrates naturally with function call, eliminating caller/callee save conventions and complex call preamble and postamble code.\nA belt machine has short pipelines because it lacks the extra pipe stages associated with rename; typical misprediction penalty is five cycles (if decode is also fast). Area and power consumption in a belt core is a third that of an equivalent superscalar in large part because a belt lacks the large number of physical rename registers and the interconnect needed to supply register values to the functional units.\nThe talk explains the belt model as seen by the programmer and the physical internals of a typical implementation.\nSpeaker bio Ivan Godard has designed, implemented or led the teams for 11 compilers for a variety of languages and targets, an operating system, an object-oriented database, and four instruction set architectures. He participated in the revision of Algol68 and is mentioned in its Report, was on the Green team that won the Ada language competition, designed the Mary family of system implementation languages, and was founding editor of the Machine Oriented Languages Bulletin. He is a Member Emeritus of IFIPS Working Group 2.4 (Implementation languages) and was a member of the committee that produced the IEEE and ISO floating-point standard 754-2011.\nIvan is currently CTO at Mill Computing, a startup now emerging from stealth mode. Mill Computing has developed the Mill, a clean-sheet rethink of general-purpose CPU architectures. The Mill is the subject of this talk.\n","date":"18 December 2013","externalUrl":null,"permalink":"/docs/belt/","section":"Pages","summary":"Forum Topic: The Belt\nTalk by Ivan Godard – 2013-07-11 at Google NOTE: the slides require genuine Microsoft PowerPoint to view; open source PowerPoint clones are unable to show the animations, which are essential to the slide content. If you do not have access to PowerPoint then watch the video, which shows the slides as intended.\n","title":"The Belt","type":"pages"},{"content":"Forum Topic: Instruction Encoding\nWhite paper – 2013-08-23 White paper: mill_cpu_split-stream_encoding (.PDF)\nThe Mill: Split-stream encoding Real-world programs often thrash in the instruction cache, especially when SMT methods are used. The Mill™ split-stream encoding doubles the effective capacity of the instruction cache at no increase in per-instruction power usage or cache access latency, while also sharply increasing the potential maximal decode rate for instruction sets that use variable-length encoding.\nTalk by Ivan Godard – 2013-05-29 at Stanford NOTE: the slides require genuine Microsoft PowerPoint to view; open source PowerPoint clones are unable to show the animations, which are essential to the slide content. If you do not have access to PowerPoint then watch the video, which shows the slides as intended.\nSlides: PowerPoint (.pptx) This talk at Stanford EE380 Computer Systems Colloquium\nInstruction Encoding Instructions can be wide, fast to decode and compact The military maxim, “Amateurs study tactics, professionals study logistics” applies to CPU architecture as well as to armies. Less than 10% of the area and power budget of modern high-end cores is devoted to real work by the functional units such as adders; the other 90% marshals instructions and data for those units and figures out what to do next.\nA large fraction of this logistic overhead comes from instruction fetch and decode. Instruction encoding has subtle and far reaching effects on performance and efficiency throughout a core; for example, the intractable encoding used by x86 instructions is why the x86 will never provide the performance/power of other architectures having friendlier encoding.\nSome 80% of executed operations are in loops. A software-pipelined loop has instruction-level parallelism (ILP) bounded only by the number of functional units available and the ability to feed them. The limiting factor is often decode; few modern cores can decode more than four instructions per cycle, and none more than 10. The Mill is a new general-purpose CPU architecture that breaks this barrier; high-end Mill family members can fetch, decode, issue and execute over 30 instructions per cycle.\nThis talk explains the fetch and decode parts of the Mill architecture.\nSpeaker bio Ivan Godard has designed, implemented or led the teams for 11 compilers for a variety of languages and targets, an operating system, an object-oriented database, and four instruction set architectures. He participated in the revision of Algol68 and is mentioned in its Report, was on the Green team that won the Ada language competition, designed the Mary family of system implementation languages, and was founding editor of the Machine Oriented Languages Bulletin. He is a Member Emeritus of IFIPS Working Group 2.4 (Implementation languages) and was a member of the committee that produced the IEEE and ISO floating-point standard 754-2011.\nIvan is currently CTO at Mill Computing, a startup now emerging from stealth mode. Mill Computing has developed the Mill, a clean-sheet rethink of general-purpose CPU architectures. The Mill is the subject of this talk.\n","date":"16 December 2013","externalUrl":null,"permalink":"/docs/encoding/","section":"Pages","summary":"Forum Topic: Instruction Encoding\nWhite paper – 2013-08-23 White paper: mill_cpu_split-stream_encoding (.PDF)\nThe Mill: Split-stream encoding Real-world programs often thrash in the instruction cache, especially when SMT methods are used. The Mill™ split-stream encoding doubles the effective capacity of the instruction cache at no increase in per-instruction power usage or cache access latency, while also sharply increasing the potential maximal decode rate for instruction sets that use variable-length encoding.\n","title":"Instruction Encoding","type":"pages"},{"content":"The Mill general purpose CPU architecture takes new approaches in most major areas of processor architecture. We have public presentation video recordings for most of the topics listed below, with more to come.\nNOTE: the slides require genuine Microsoft PowerPoint to view; open source PowerPoint clones are unable to show the animations, which are essential to the slide content. If you do not have access to PowerPoint then watch the video, which shows the slides as intended.\nTopics ⬇ Instruction Encoding\n⬇ The Belt\n⬇ Memory\n⬇ Prediction\n⬇ Metadata\n⬇ Execution\n⬇ Security\n⬇ Specification\n⬇ Pipelining\n⬇ The Compiler\n⬇ Switches\n⬇ Inter-process Communication\n⬇ Threading\n⬇ Wide Data\nInstruction Encoding A major portion of the area and power budget of modern high-end CPU cores is devoted to fetching and decoding instructions, to feed the functional units and to figure out what to do next. The instruction encoding techniques of the Mill CPU architecture allow high-end Mill family members to fetch, decode and issue up to 30 opcodes per cycle, sustained, within a three cycle decode pipeline.\nwhite paper, talk more…\nThe Belt The Belt is the data interchange mechanism for the Mill general purpose CPU architecture, replacing the general registers of other architectures. The Mill\u0026rsquo;s belt is unique both in its programming model and its implementation at the micro-architecture level. Destination addressing is implicit, yielding more compact instruction encoding. The Belt is integrated with the function call mechanism; it eliminates caller/callee save conventions and callee pre-/postlude instructions, and it supports multi-result calls naturally. The Belt is Single-assignment, so rename registers and pipeline phases are unnecessary.\ntalk more…\nMemory The Mill uses a novel load instruction that tolerates load misses as well as hardware out-of-order approaches can do, while avoiding the need for expensive load buffers and completely avoiding false aliasing. In addition, store misses are impossible on a Mill, and a large fraction of the memory traffic of a conventional processor can be omitted entirely.\ntalk more…\nPrediction The Mill uses a novel prediction mechanism; it predicts transfers rather than branches. It can do so for all code, including code that has not yet ever been executed, running well ahead of execution so as to mask all cache latency and most memory latency. It needs no area- and power-hungry instruction window, using instead a very short decode pipeline and direct in-order issue and execution.\ntalk more…\nMetadata The Mill conveys some of the semantics of execution in the form of metadata attached to the arguments of operations, in addition to that expressed by the operation encodings in the executed code stream. Metadata propagates through execution, following rules specified by the architecture, although it may be altered explicitly by code when needed.\ntalk more…\nExecution A perennial objection to wide-issue CPU architectures such as VLIWs and the Mill is that there is insufficient instruction level parallelism (ILP) in programs to make effective use of the available functional width. This talk addresses the ILP issue, describing how the Mill is able to achieve much higher IPC even when the nominal ILP is relatively low.\ntalk more…\nSecurity Software bugs have always been a problem, but in recent years bugs have become an even more serious concern as they are exploited to breach system security for privacy violation, theft, and even terrorism or acts of war. The Mill CPU architecture addresses software robustness in three basic ways. This talk describes some of the Mill CPU features that defend against well-known error and exploit patterns.\ntalk more…\nSpecification The Mill CPU architecture defines a generic Mill processor, from which a family of specific processors can be configured. A particular configuration for a Mill CPU family member is defined by a specification, which is processed by Mill configuration software to build a member-specific assembler, simulator, compiler back-ends, Verilog for the hardware implementation, documentation, and other tools and components.\ntalk more…\nPipelining On a conventional machine, pipelining requires lengthy prelude and postlude instruction sequences to get the pipeline started and wound down, frequently destroying the benefit of pipelining the main body. Mill pipelines have neither prelude nor postlude, and early conditional exit has no added cost.\ntalk more…\nThe Compiler The Mill is a new general-purpose CPU architecture family that forms a uniquely challenging target for compilation - and also a uniquely easy target. This talk describes the Mill tool chain from language front end to binary executable. talk more. . .\nSwitches Multi-way branches, known as switches or case clauses in various languages, are a notorious pain for compiler writers and CPU architects. On the critical path in important applications from lexers to byte-code interpreters, switches often predict poorly. This talk shows how an ultra-wide-issue architecture responds to the switch challenge. talk more. . .\nInter-process communication The Mill is a new general-purpose architectural family, with an emphasis on secure and inexpensive communication across protection boundaries. The large (page) granularity of protection on conventional architectures makes such communication difficult compared to communication within a protection boundary, such as a function call. As a result, the large granularity has forced communication protocols on conventional architectures into two models: pass-by-sharing (using shared pages), and pass-by-copy (using the OS kernel for files/message passing). Both have drawbacks: sharing requires difficult-to-get-right synchronization, while copy involves kernel transitions as well as the costs of the copy itself.\ntalk more…\nThreading The Mill is a new general-purpose CPU architectural family, with novel resource allocation and control facilities that are orders of magnitude less expensive than the equivalents on other CPUs. Critical to this gain is the direct Mill hardware support for threading.\ntalk more…\nWide data The Mill is a new general-purpose CPU architectural family. The talk will present machine-level details of the Mill support for bigger-than-scalar data.\ntalk more. . .\n","date":"16 December 2013","externalUrl":null,"permalink":"/docs/","section":"Pages","summary":"The Mill general purpose CPU architecture takes new approaches in most major areas of processor architecture. We have public presentation video recordings for most of the topics listed below, with more to come.\nNOTE: the slides require genuine Microsoft PowerPoint to view; open source PowerPoint clones are unable to show the animations, which are essential to the slide content. If you do not have access to PowerPoint then watch the video, which shows the slides as intended.\n","title":"docs / videos / slides","type":"pages"},{"content":" Mentions of the Mill, Mill Computing and Out-of-the-Box Computing in the press. Mill Computing heads to FPGA demo, seeks funds\nJanuary 12, 2017 - eeNews Europe\nEE Times Silicon 60: Hot Startups to Watch\nSeptember 8, 2015 - EE Times\nEE Times Silicon 60: Hot Startups to Watch\nJuly 15, 2014 - EE Times\nStartup Seeks Funds to Realize \u0026lsquo;Belt\u0026rsquo; Processor\nDecember 16, 2013 - Electronics360\nMill CPU: Stack Machines Instead of Turing\nDecember 9, 2013 - EE Times\nThe Mill: Ivan Godard Explains a Revolutionary New CPU\nNovember 20, 2013 - EE Times\nInterview: Mill CPU for Humans Parts 3 and 4\nNovember 19, 2013 - Hackaday\nInterview: New Mill CPU Architecture Explanation for Humans\nNovember 18, 2013 - Hackaday\nGetting Way Out of the Box\nAugust 5, 2013 - Processor Watch\nThe Mill CPU architecture\nAugust 2, 2013 - Hackaday\n","date":"16 December 2013","externalUrl":null,"permalink":"/in-the-press/","section":"Pages","summary":"Mentions of the Mill, Mill Computing and Out-of-the-Box Computing in the press. Mill Computing heads to FPGA demo, seeks funds\nJanuary 12, 2017 - eeNews Europe\nEE Times Silicon 60: Hot Startups to Watch\nSeptember 8, 2015 - EE Times\nEE Times Silicon 60: Hot Startups to Watch\nJuly 15, 2014 - EE Times\nStartup Seeks Funds to Realize ‘Belt’ Processor\nDecember 16, 2013 - Electronics360\nMill CPU: Stack Machines Instead of Turing\nDecember 9, 2013 - EE Times\n","title":"in the press","type":"pages"},{"content":"","externalUrl":null,"permalink":"/tags/","section":"Tags","summary":"","title":"Tags","type":"tags"}]