U.S. Patent 9,513,920 – Computer Processor Employing Split-Stream Encoding
This patent is for the split-stream method of encoding that permits Mill CPUs to decode extremely wide instructions (those with many independent operations in each instruction) using a compact and flexible variable-length encoding that is also fast to decode. The stream of instructions that is the program being executed is split into two streams of half-instructions that are stored separately in memory but are processed in lock-step by the CPU decoder. Because wide fixed length instruction encodings use an impractical amount of cache and memory, and variable-length encodings take time polynomial in the width, instruction decode on legacy CPUs is limited to eight operations per cycle or fewer; Mill split-stream encoding supports instruction widths of over thirty operations decoded per cycle. In addition, split-stream permits doubling the amount of instruction cache with no clock or pipeline penalty.
Split-stream encoding is described here in a way that is more accessible than the patent text.
U.S. Patent 9,513,921 – Computer Processor Employing Temporal Addressing For Storage of Transient Operands
This foundational patent in the Mill CPU family replaces traditional general-purpose registers for short-lived values with a hardware-managed “belt” that uses temporal addressing.
Overview
The processor organizes transient operands in a fixed-length logical queue called the belt. Instead of spatial register numbers, operands are referenced by their production order in time (temporal addresses). New results drop onto the front of the belt; older ones shift backward each cycle and eventually fall off the end. This eliminates write-after-write, read-after-write, and write-after-read hazards, removes the need for register renaming, and simplifies instruction encoding because results are placed implicitly while sources are referenced explicitly by belt position.
Key Features
- Temporal addressing: An operand produced in the current cycle is at belt address 0; the previous one at address 1, and so on. Instructions specify only source positions (e.g., Add(3,4) adds the 4th- and 5th-most-recent results).
- Multi-result and multi-cycle support: A single operation or multiple operations can produce several operands in one cycle; they are ordered and dropped according to fixed rules. Variable-latency units (e.g., multiply, load) coordinate via slot-specific or daisy-chained output registers.
- Belt interconnect (optional split crossbar): Efficiently routes results from functional units to the belt and from the belt to consumers.
- Spiller unit: Automatically saves and restores belt contents during CALL/RETURN or interrupts, creating private per-frame belts.
- Scratchpad integration: Longer-lived operands are spilled to or filled from the byte-addressable scratchpad when they would otherwise fall off the belt.
Core Mechanisms
Operands enter the front of the belt in production order. Each cycle the entire belt logically shifts; the oldest entry retires. Decode hardware translates temporal addresses to physical locations (belt slots or functional-unit output registers). Multiple operands can be read or written in a single cycle. For subroutine calls, the spiller asynchronously copies the caller’s belt to a private frame (and later restores it), passing arguments and results by selective copying. The design supports in-order retirement while permitting out-of-order execution within the pipeline.
Problems Addressed and Benefits
Conventional register files require large numbers of physical registers, complex renaming logic, massive bypass networks, and many address bits in instructions. The belt eliminates all of these: no rename stage, dramatically smaller instruction encoding, no WAW/RAW/WAR hazards, and far simpler bypass logic. Code is smaller, power and area are reduced, and high instruction-level parallelism becomes easier to exploit. The mechanism integrates cleanly with the Mill’s operand-metadata system, scratchpad, and high-ILP pipeline, forming the foundation for the rest of the Mill architecture while remaining invisible to the programmer.
The Belt is described here in a way that is more accessible than the patent text.
U.S. Patent 9,513,904 – Computer Processor Employing Cache Memory With Per-Byte Valid Bits
This is a core patent in the Mill CPU family describing a hierarchical cache system that tracks validity at the byte level rather than the full-line level.
Overview
The invention adds a set of per-byte valid bits to every cache line in the processor’s memory hierarchy (L1 data cache, L2, victim buffers, etc.). These bits explicitly mark which individual bytes within a line contain semantically defined data and which do not. Memory requests (loads and stores) carry a byte mask that tells the cache exactly which bytes matter for that operation. This design eliminates the classic “read-before-write” problem for partial stores and enables precise, low-overhead handling of sparse or partially initialized data.
Key Features
- Per-byte valid bits: One bit per byte in every cache line (e.g., 64 bits for a 64-byte line) stored alongside the data, tag, and a single dirty bit.
- Byte mask in every request: Loads and stores include a mask indicating which bytes of the addressed cache line are being accessed.
- Victim buffers: Fully associative buffers that absorb all new stores immediately; no traditional write buffers or allocate-on-write-miss required.
- Byte-wise merging and hoisting/lowering: When data moves between cache levels or to main memory, only valid/dirty bytes are transferred or merged, with higher-level data taking priority.
- Per-byte hit/miss logic: Hardware generates separate hit signals for each requested byte using tag matches + valid bits, feeding a simple output multiplexer.
Core Mechanisms
- Loads: The L1 cache (and victim buffers) checks only the bytes specified by the mask. Matching valid bytes are returned immediately; missing ones propagate down the hierarchy with a narrowed mask. When a lower level supplies data, valid bytes are hoisted upward and merged byte-by-byte into any partial resident line.
- Stores: Data goes straight to a victim buffer (setting the corresponding valid and dirty bits). If the line already exists in the cache array, the new bytes are merged in with priority. On eviction, only dirty valid bytes are written down the hierarchy—no full-line write-backs or consolidating buffers needed.
- Coherence in multiprocessor systems: Private caches can hold the same line in “modified” state simultaneously if their valid-byte sets are disjoint. Special byte-masked “Request for Invalidate” messages prevent false sharing.
- Integration with Mill architecture: Works with virtual caches, the Protection Lookaside Buffer (PLB), deferred loads, and the belt/scratchpad operand model. Stores take effect immediately, eliminating the need for memory barriers in Mill programs.
Problems Addressed and Benefits
Traditional caches operate at full-line granularity, forcing read-modify-write cycles, write buffers, and consolidating logic for any sub-line store. This creates power, area, and complexity overhead and forces unnecessary memory traffic. The per-byte valid-bit design solves these by:
- Eliminating write buffers and read-before-write entirely—stores always “hit.”
- Drastically reducing memory traffic and power (only valid/dirty bytes move between levels or to DRAM).
- Simplifying hardware (no consolidating buffers, smaller victim structures).
- Preventing false sharing in multi-core chips.
- Providing deterministic behavior for uninitialized or partially written data.
The result is a cleaner, faster, lower-power memory subsystem that fits seamlessly with the Mill’s high-ILP, belt-based execution model while remaining fully compatible with conventional memory semantics.
The per-byte cache validation is described here in a way that is more accessible than the patent text.
U.S. Patent 9,524,163 – Computer processor employing hardware-based pointer processing
This patent is part of the Mill CPU family and adds dedicated hardware support for safe, efficient pointer handling directly in the execution pipeline.
Overview
The patent equips the processor with two complementary hardware mechanisms that operate on pointers containing small amounts of embedded metadata. The first mechanism uses event bits in pointers to trigger low-overhead traps on memory operations (loads, stores, or pointer stores) for uses such as garbage collection or security monitoring. The second uses a granularity field to perform automatic bounds checking on pointer arithmetic and array accesses, preventing wild-address bugs without software checks or extra memory tags.
Key Features
- Event Bits: A few metadata bits carried with every pointer (distinct from the address field) that encode provenance or memory-region information.
- Interest Mask Registers: Three writable hardware registers (read, write, and update masks) that the runtime or OS can set to specify which pointer operations should raise an “event-of-interest” signal.
- Granularity Field: A compact field (e.g., 6 bits in a 64-bit pointer) that defines the size of aligned “chunks” (power-of-two byte blocks) the pointer is allowed to address.
- Address Derivation Unit: Dedicated logic that computes new pointer values while simultaneously checking for offset overflow.
- Optional Valid/Invalid Bit: Marks pointers as invalid (e.g., one-past-the-end) and restricts unsafe operations on them.
Core Mechanisms
- Pointer Event Detection (for garbage collection and security):
- On every load-through-pointer, store-through-pointer, or pointer-store operation, the event bits from the pointer(s) index the appropriate mask register.
- If the selected interest bit is set, hardware immediately raises an event-of-interest trap to the runtime system.
- Example: In a copying garbage collector, event bits distinguish “old” vs. “new” regions; a pointer store from old to new can be trapped so the collector can fix it—without polling or scanning every pointer.
- Concatenation of source and target event bits enables precise detection of cross-region updates.
- Hardware Bounds Checking via Granularity:
- Pointer arithmetic (add, subtract, indexing) is routed through the address derivation unit.
- The granularity field splits the address into a fixed “chunk” part and a variable “offset” part; the unit verifies that the offset stays within the chunk (no carry into the chunk field).
- Valid result → pass signal and new pointer; overflow → fail signal (and optional fault).
- Supports “refine” operations to tighten granularity and safe one-past-the-end handling via the valid bit.
All checks occur in parallel with normal address generation and add negligible latency to the pipeline.
Problems Addressed and Benefits
Traditional pointer handling relies on slow software checks (range comparisons, tag bits, or GC barriers) that hurt performance and increase memory use. Wild pointers and buffer overruns remain major sources of crashes and security vulnerabilities. This design solves both issues in hardware:
- Eliminates software overhead for common pointer operations and garbage-collection barriers.
- Prevents wild-address bugs by enforcing object bounds at the hardware level with zero extra storage.
- Accelerates garbage collection through automatic provenance tracking and selective trapping.
- Improves security without sacrificing speed—especially valuable for C/C++-style code or languages with manual memory management.
- Minimal cost: Only a handful of bits per pointer and a few small hardware units; fully compatible with the Mill’s belt, scratchpad, and high-ILP pipeline.
Overall, the invention delivers safe, high-performance pointer processing as a native CPU feature, simplifying compilers and runtimes while closing a long-standing gap between high-level language safety and low-level hardware efficiency.
U.S. Patent 9,652,230 – Computer processor employing dedicated hardware mechanism controlling the initialization and invalidation of cache lines
This patent is part of the Mill CPU family and introduces a lightweight hardware structure called the Implicit Zero Map (IZM) that automatically manages cache-line initialization and invalidation for stack-frame-local data.
Overview
The invention adds a small, dedicated hardware map (the IZM) to the processor’s execution/retire logic. Each bit in the map corresponds to one cache line that holds frame-local operand data for the currently active stack frame. When the bit is set, the cache line is treated as implicitly zero (no valid data exists in the memory hierarchy). This mechanism works transparently with the processor’s hierarchical memory system (L1 cache + lower levels) and integrates with the Mill’s data-stack and scratchpad model.
Key Features of the Implicit Zero Map (IZM)
- Bit-per-cache-line array: A compact, hardware-resident bit map that tracks only the cache lines belonging to the top stack frame (size typically covers a few KB of local data).
- Implicit-zero state: A set bit means “this line has never been written and should return zeros on any read.”
- Frame-local binding: The map is dynamically re-pointed to the current frame using special registers (stack pointer SP and frame-size/offset registers) on every function activation.
- No software zeroing required: Eliminates explicit store loops that traditionally clear local variables.
Core Mechanisms
- On function entry (frame allocation):
Hardware sets all IZM bits for the new frame’s cache lines to the implicit-zero state. If the frame exceeds the map’s capacity, excess lines are explicitly zero-written to the cache once.
- Load handling:
Before any memory request reaches the cache, the address is converted to an IZM index. If the bit is set, the pipeline immediately returns zero-filled data and marks the request “hit,” discarding it so the cache and lower memory are never accessed.
- Store handling:
If the IZM bit is set, the store data is merged with zeros for the rest of the line, the data is written directly to the cache, and the bit is cleared (marking the line “valid”). Again, no lower-level memory traffic occurs.
- On function exit (frame deallocation):
Hardware walks the IZM bits:
– Clear bits (valid lines) → invalidate the cache line and cancel any pending write-back.
– Set bits (still zero) → simply reset the bit; nothing else is required.
Excess lines beyond the map are also invalidated without write-back.
All operations are performed in the execution/retire pipeline stages and use spare memory bandwidth when possible, so they add virtually zero cycle overhead.
Problems Addressed and Benefits
Conventional processors require explicit zero-initialization stores (or rely on OS page-zeroing) for every new stack frame, consuming memory bandwidth, power, and code space. They also leave stale data from prior frames, creating security/privacy leaks and subtle bugs when code assumes uninitialized locals are zero. The IZM solves these by:
- Eliminating zero-initialization traffic — loads from fresh locals are satisfied instantly with zeros; stores pay only for actual data written.
- Guaranteeing deterministic starting state — every new frame begins with clean zeros, removing an entire class of initialization bugs.
- Hiding prior-frame data — automatic invalidation on exit prevents information leakage between function activations.
- Saving power and bandwidth — cache lines that remain unused never touch lower memory levels.
- Minimal hardware cost — only a few dozen bits of storage plus simple address-to-index logic.
The design fits seamlessly with the Mill architecture’s scratchpad, belt, and high-ILP pipeline, delivering cleaner, faster, and more secure function-call semantics without changing the programmer-visible instruction set.
For a less patentese explanation of this Mill technology see our memory talk video here, or look at the corresponding powerpoint starting at slide 59.
U.S. Patent 9,690,581 – Computer processor with deferred operations
This patent describes the deferred-load hardware of the Mill CPU architecture.
This patent describes a processor architecture that supports deferred load operations, where the request of data from an address (issue) is separated from its retirement into operand storage (retire). The retire timing is controlled by statically assigned parameter data encoded in the operation itself, enabling program-managed scheduling of variable-latency operations like memory loads in a static-scheduled, in-order execution environment. By allowing a program to avoid stalling when data that is being loaded is in cache, the invention allows a statically scheduled CPU to achieve load performance comparable to a dynamically scheduled Out-Of-Order CPU.
Key Features
- Deferred Operations: Operations (e.g., deferred load or DLOAD) that decouple issue from retire, with retire controlled by a schedule latency parameter (cycle count or identifier).
- Retire Station: A hardware buffer that manages result data, handling early arrivals (buffering) or late arrivals (stalling) relative to the programmed schedule latency.
- Schedule Latency Control: Static encoding specifies retire cycle via countdown timer or matching identifier from a “pickup” operation.
- Memory Handling: Supports hierarchical memory with snooping for store collisions on in-flight loads, ensuring correct memory state at retire time.
- Call/Interrupt Integration: Options for inclusive/exclusive latency counting during calls; mechanisms to discard, save/restore, or reissue deferred ops on interrupts.
- Speculation Support: Allows hoisting operations over branches, with fault handling for errors encountered before retire.
Core Mechanisms
- Execution Flow: A deferred operation issues to a functional unit (e.g., load unit sends DLOAD to L1 cache). A retire station allocates, configures with schedule latency, and monitors execution.
- Latency Comparison: Retire station compares actual execution latency (time to result) against schedule latency. Buffers results if early; stalls pipeline if late until result arrives.
- Collision Detection: Snoops intervening stores for address overlaps with in-flight DLOADs; discards and reissues if collision detected, or buffers valid results.
- Retire Process: At schedule expiration, retires data to operand storage (e.g., belt or registers) if valid; faults on errors or mismatches.
- Pickup Variant: Alternative to timers; retire triggered by a separate “pickup” op with matching ID, allowing dynamic scheduling.
- Interrupt/Call Handling: On calls, latency timers can pause (exclusive) or continue (inclusive); interrupts may discard ops, spill state to stack, or reissue post-resume.
Problems Addressed and Benefits
Traditional static-scheduled processors struggle with variable execution latencies (e.g., cache misses stalling pipelines), while out-of-order designs add complexity, power, and area costs. This invention addresses these by enabling static scheduling of variable-latency ops without stalls at issue time, allowing early issuance (hoisting) for overlap with other work. Benefits include simplified hardware (no dynamic reordering), reduced power/area, improved performance via masked latencies, correct handling of memory dependencies at retire, and robust support for speculation/interrupts—enhancing efficiency in Mill’s high-ILP architecture while complementing related patents on operand storage and bypass networks.
U.S. Patent 9,747,216 – Computer processor employing byte-addressable dedicated memory for operand storage
This patent introduces a processor architecture that uses a “belt” for transient operand storage with temporal addressing and a separate byte-addressable “scratchpad” memory for longer-lived operands, complemented by a spiller unit for context management. It optimizes operand handling in high-ILP environments by reducing register pressure and enabling efficient CALL/RETURN operations.
Key Features
- Belt Storage: A fixed-length queue (conveyor belt) for short-lived operands, referenced via temporal addressing based on production order.
- Scratchpad Memory: Byte-addressable dedicated storage for operands copied from the belt, with static, aligned addresses and variable-sized windows per function frame.
- Spill/Fill Operations: Mechanisms to copy operands between belt and scratchpad for preservation and restoration.
- Spiller Unit: Handles asynchronous saving/restoring of belt and scratchpad contexts during subroutine calls, returns, and interrupts.
- Window-Based Mapping: Circular buffer organization with Base/Fence registers defining per-frame address windows.
- Split Crossbar Network: Optimizes operand routing between functional units, prioritizing low-latency paths.
Core Mechanisms
- Belt Operation: Operands from functional units are injected at the belt’s front and shift backward each cycle, falling off the end when displaced. Instructions use logical temporal addresses (e.g., Add(3,4)) to reference operands by recency, eliminating explicit register names.
- Scratchpad Integration: SPILL copies belt operands to scratchpad before expiration; FILL restores them. Addresses are byte-aligned and static, supporting variable operand sizes.
- Context Switching: On CALL, the spiller saves caller state (belt/scratchpad) asynchronously and sets up callee’s private frame via SCRATCHF (allocates window). RETURN restores state and deallocates. Base/Fence define active windows; SP/RP track save/restore progress to minimize stalls.
- Asynchronous Save-Restore: Spiller operates in background, stalling only if bandwidth limits are exceeded or conflicts arise.
- Pipeline Support: Integrates with fetch/decode/issue/execute stages, maintaining in-order retirement for belt consistency while allowing out-of-order execution.
Problems Addressed and Benefits
Traditional architectures face high register pressure, complex bypass networks, slow context switches, and latency from memory spills. This design mitigates these by using temporal addressing to simplify instructions, dedicated scratchpad for fast access, and asynchronous spilling for low-overhead switches. Benefits include reduced chip area/power from fewer registers, faster subroutine handling, scalable operand storage for nested calls, and improved ILP without traditional hazards. Overall, it enhances Mill’s ecosystem for efficient, high-performance processing in modern workloads.
U.S. Patent 9,747,218 – CPU security mechanisms employing thread-specific protection domains
This patent introduces a processor architecture with hardware-enforced security using “turfs”—thread-specific protection domains that enable fine-grained memory isolation without the overhead of traditional process-based models. Turfs are collections of region descriptors defining memory areas and permissions, allowing threads to dynamically switch domains via portal calls while maintaining a single address space. Protection is hardware-managed through Protection Lookaside Buffers (PLBs) and specialized registers, supporting secure inter-domain communication and execution.
Key Features
- Turfs and Region Descriptors: Turfs group descriptors specifying memory regions (bounds, permissions like read/write/execute/portal), associated with thread-turf ID pairs; supports wildcards for flexible access.
- Protection Lookaside Buffers (PLBs): Separate iPLB (instruction) and dPLB (data) for fast permission checks; backed by a Region Table in memory; includes a “novel” bit for efficient eviction and revocation tracking.
- Hardware Registers: Turf-specific (e.g., code/constants/data bases), thread-turf-specific (stacklets for isolated stacks), and thread-specific (local storage) registers for bypassing PLB queries on common accesses.
- Stacklets: Per-thread-turf stack segments with info blocks (top-of-stack pointer, base, limit); chained for nested calls, auto-allocated in reserved virtual space. Portal Entries: Memory structures (turf ID, entry address, state) enabling secure domain switches via portal-type CALLs.
- Operations: GRANT/REVOKE for persistent permission delegation; PASS for temporary grants; ARGS for argument reservation; portal/normal CALL/RETURN for control flow with domain handling.
Core Mechanisms
- Access Enforcement: For LOAD/STORE, compute virtual address and check hardware registers first (e.g., stacklet descriptors); if unmatched, query dPLB for permissions—fault on violation. Instruction fetches use iPLB for execute/portal checks.
- Domain Switching: Portal-type CALL evaluates portal pointer, saves caller state externally (via Spiller), loads callee turf/state, initializes stacklet, and executes in new domain. RETURN unwinds, restores state from storage/memory, and revokes temporary grants.
- Permission Management: GRANT adds descriptors to PLBs; REVOKE removes them (lazy via novel bit); PASS modifies descriptors to wildcard turf for call duration, auto-revoked on RETURN.
- Argument Passing: ARGS reserves stack space (OutP register); portal CALL copies to callee’s InP for secure access without full sharing.
- Pipeline Integration: Operates in execute/retire stages with Branch and Load/Store Units; supports speculation but makes portal RETURN non-speculative for security; prefetching minimizes stalls.
Problems Addressed and Benefits
Traditional systems rely on heavy process switches for isolation, incurring high overhead (e.g., cache/TLB invalidation, hundreds of cycles) and error-prone shared buffers. This design addresses these by enabling lightweight turf switches without thread changes, secure coroutine-like calls, and hardware-bypassed checks for efficiency. Benefits include reduced power/latency via parallel PLB queries, scalable support for thousands of domains, prevention of exploits (e.g., stack smashing), and flexible rights delegation—all while integrating with Mill’s high-ILP features like operand metadata and memory hierarchies.
U.S. Patent 9,747,238 – Computer processor employing split crossbar circuit for operand routing and slot-based organization of functional units
This patent details a processor architecture that optimizes operand handling and routing using a “belt” for transient storage with temporal addressing, a split crossbar interconnect for efficient bypass routing, and slot-based grouping of functional units to manage mixed-latency operations. It also includes mechanisms for subroutine calls/returns with private per-frame storage.
Key Features
- Belt Storage: A fixed-length, conveyor-like queue for transient operands, using temporal addressing based on production order rather than physical registers.
- Split Crossbar Circuit: Divided into lower (for single-cycle results) and upper (for multi-cycle results) sections to reduce routing latency and complexity.
- Slot-Based Functional Units: Groups of functional units that produce results of varying latencies in parallel, with separate output registers per latency type.
- Scratchpad Memory: Byte-addressable storage for longer-lived operands, accessed via SPILL/FILL operations.
- Spiller Unit: Handles context saving/restoring for belt and scratchpad during CALL/RETURN, ensuring isolated frames.
Core Mechanisms
- Temporal Addressing and Belt Operation: Operands are injected at the belt’s front in production order, shift backward each cycle, and retire from the end. Instructions reference operands by logical positions (e.g., Add(3,4) adds the 4th and 5th most recent). Supports multiple drops per cycle with ordering rules.
- Operand Routing via Split Crossbar: Single-cycle results route directly through the lower crossbar to any consumer. Multi-cycle results (e.g., 2-4 cycles) route via the upper crossbar to the lower for distribution, minimizing wire delays and power.
- Slot Organization: Each slot shares input/output paths and executes mixed operations (e.g., add in 1 cycle, multiply in 3). Results store in latency-specific registers (lat-1 to lat-4), shifting in a daisy-chain to align latencies; overflows spill to buffers.
- CALL/RETURN Handling: CALL saves caller’s state asynchronously via spiller, creates callee’s private belt/scratchpad, and copies arguments. RETURN restores and copies results back. Scratchpad uses window mapping (Base/Fence registers) for dynamic allocation.
- Integration with Pipeline: Decode maps temporal addresses to physical; issue dispatches to slots; execute uses bypass for direct producer-consumer routing.
Problems Addressed and Benefits
Traditional processors rely on spatial registers, leading to high rename overhead, complex bypass networks, and inefficient multi-latency handling. This design addresses these by eliminating registers for transients, reducing bypass costs via split crossbars, and enabling parallel mixed-latency execution in slots. Benefits include lower power and area through simplified routing, higher ILP via temporal addressing and private frames, efficient context switches without stalls, and better scalability for wide-issue pipelines. It complements Mill’s other innovations like operand metadata and loop pipelining for high-performance computing.
U.S. Patent 9,785,441 – Computer processor employing instructions with elided nop operations
This patent describes a processor architecture that processes two parallel instruction streams with a predefined timed semantic relationship, using variable-length instructions that incorporate an “alignment hole” to implicitly encode NOP (no-operation) counts. This elides explicit NOP instructions, optimizing for parallel decoding and execution in a split-stream design.
Key Features
- Dual Instruction Streams: Two distinct streams (Stream I and Stream II) within instruction blocks, flowing in opposite directions in memory (Stream I forward-increasing, Stream II reverse-decreasing) from a shared entry address.
- Variable-Length Instruction Format: Each instruction includes a fixed-length header (with overall length and block slot counts) and a variable-length bit bundle divided into forward (head-to-tail) and reverse (tail-to-head) operation blocks.
- Alignment Hole: A variable-position gap (0-7 bits) between forward and reverse block groups, encoding a binary count of implicit NOP operations without using explicit NOP encodings.
- Parallel Processing Components: Two multi-stage pipelines (each with program counter, fetch unit, instruction buffer, decode stage, and execution logic) that handle the streams independently but synchronously via NOP-induced stalls.
- Stream Specialization: Streams can be assigned different instruction classes (e.g., Stream I for flow-control and memory operations, Stream II for computational operations).
Core Mechanisms
- Instruction Organization: Instructions are grouped into blocks (extended basic blocks) with a single entry and multiple exits. Streams share cache lines at entry points but diverge directionally to reduce thrashing in separate L1 caches.
- Decoding Process:
- Header parsing in the first sub-stage extracts length and block fields, enabling speculative decoding of initial forward blocks.
- Forward blocks are decoded sequentially from head to tail; reverse blocks from tail to head, allowing parallel processing (e.g., Block 2F and Block 2R simultaneously).
- Shifter logic isolates blocks using header-derived tap values; the alignment hole is processed after all blocks to update a running NOP counter.
- NOP counts accumulate across instructions and trigger stalls in the opposing stream’s decode or issuance stages, either immediately or in subsequent cycles.
- Execution and Synchronization: Decoded operations are issued to functional units grouped by operand count (e.g., dyadic, triadic). Timed semantics are enforced by stalling the lagging stream via implicit NOPs, without data dependencies across streams except through control signals.
- Alternate Formats: Supports extensions for larger holes or lag operations if the standard hole size is insufficient.
Problems Addressed and Benefits
Traditional processors with timed semantics require explicit NOPs to synchronize parallel operations, wasting memory, fetch bandwidth, and power—especially in variable-length or VLIW architectures where decoding is serialized and stalls are frequent. Variable-length instructions complicate parallel decoding, and shared caches lead to thrashing in multi-stream designs. This invention mitigates these by:
- Eliding explicit NOPs through alignment holes, reducing code size, memory usage, and processing overhead.
- Enabling fast parallel decoding with double-ended processing, minimizing latency in high-ILP pipelines.
- Improving cache efficiency via separate L1 caches per stream, supporting larger working sets without thrashing.
- Maintaining precise timed relationships between streams with low-cost stalls, enhancing performance in split-stream architectures.
- Providing economical encoding for variable-length instructions, making it scalable for modern workloads.
Overall, this design aligns with Mill Computing’s focus on efficient, high-parallelism processors, complementing related patents on double-ended decoding, bypass networks, and operand metadata by optimizing synchronization in dual-stream execution.
U.S. Patent 9,817,669 – Computer processor employing explicit operations that support execution of software pipelined loops and a compiler that utilizes such operations for scheduling software pipelined loops
This patent details a processor architecture and associated compiler techniques that facilitate efficient software pipelining of loops by using explicit operations to manage operand retirement and loop frames, eliminating the need for prologues and epilogues while optimizing for high instruction-level parallelism (ILP).
Key Features of the Operand and Execution Model
- Logical Belt: A conveyor-like structure for storing transient operands, with temporal addressing based on production order and operation latency (e.g., add drops immediately, mul after 3 cycles). Operands are dropped to the front and retired from the rear.
- Operand Metadata: Includes scalarity (scalar/vector), element width, floating-point flags, Not-a-Result (NAR) for errors, and None for missing values.
- None Operand: Represents absent data; propagates through speculable operations (e.g., arithmetic) and causes non-speculable operations (e.g., stores) to skip without side effects.
- Scratchpad Memory: Byte-addressable storage for long-lived operands, with logical-to-physical mapping via rotators (supporting wrap-around for loop-carried values) and operations like SCRATCHF (allocate), SCRATCHD (deallocate), and ROTATE (shift cursor).
- Extended Scratchpad: Hierarchically extends scratchpad into main memory for larger needs.
- Spiller Unit: Handles context saving/restoring for belt and scratchpad across calls/loops.
Core Mechanisms
- RETIRE Operation: Explicitly specifies a static count of operands to retire in a machine cycle. If fewer actual operands are produced, inserts None values; if more, faults; if equal, no-op. Variant RETIRE WITH WIDTH LIST adds scalarity/width tags for validation.
- INNER and LEAVE Operations: INNER creates a new loop frame with an empty belt, initialized from arguments, without altering function context. LEAVE restores the prior belt, discards in-flight operations, and drops exit arguments, avoiding epilogues.
- Software Pipelining: Compiler schedules loops to enter steady-state directly using RETIRE at the loop head to simulate missing operands with None during warmup phases. Overlaps iterations by leveraging belt temporal ordering and latency-based drops.
- Loop Frame Management: Supports nested loops via multiple rotators and private belts/scratchpads per frame, with ROTATE advancing physical addresses per iteration while keeping logical addresses static.
- Speculation and Error Handling: None/NAR propagation ensures safe speculative execution; non-speculable ops skip on None, accumulating FP flags only on commits.
Problems Addressed and Benefits
Traditional software pipelining requires prologues to fill the pipeline and epilogues for cleanup, wasting code space, fetch bandwidth, and execution time—especially for low-iteration or nested loops. Fixed rotating registers (e.g., in SPARC/Itanium) consume space inefficiently for variable-sized operands, increase register pressure, and necessitate spills. This design addresses these by:
- Eliminating prologues/epilogues through RETIRE and INNER/LEAVE, reducing code size and latency for short loops.
- Enabling direct steady-state entry, boosting ILP and performance in parallel pipelines.
- Providing flexible storage via belt (transients) and scratchpad (long-lived), minimizing waste and spills with byte-packing and dynamic allocation.
- Supporting nested/search-style loops efficiently with rotators and frame isolation, without fixed-size constraints.
- Enhancing compiler flexibility for optimizations like JIT, with machine-dependent scheduling for operand lifetimes and latencies.
Overall, this invention advances Mill Computing’s high-ILP architecture by integrating with features like operand metadata, bypass networks, and memory optimizations, delivering prologue-free pipelining for robust, efficient loop execution in modern workloads.
U.S. Patent 9,875,106 – Computer processor employing instruction block exit prediction
This patent introduces a processor architecture that predicts exit points of instruction blocks (Extended Basic Blocks or EBBs) using a table of predictors and an associative cache to form a chain guiding prefetch, fetch, decode, and execution. This reduces pipeline stalls from branch mispredictions by enabling run-ahead processing of expected control-flow paths.
Key Features
- Instruction Blocks (EBBs): Sequences with a single entry point and multiple exit points via control transfers (branches, calls, returns); divided into fragments starting at entry/return points and ending at calls or exits.
- Exit Table: Direct-mapped hash table storing predictor entries for fragments, indexed by unique keys (arithmetically derived for collision avoidance).
- Predictor Entries: Include target address (offset or absolute), cache-line count, instruction count, transfer kind (branch, call, return), untaken-call count, quality counter, loop iteration count, and alternate key for misprediction recovery.
- Exit Cache: Small associative cache for recently used predictors, reducing table access latency and supporting one-per-cycle chaining.
- Prediction Chain: Linked sequence of predictors queried from the table/cache, used to drive hardware components.
- Return Stack and Loop Stack: Handle return resolutions and loop iterations, with extensions for in-flight operations.
- Bulk Loading: Toolchain-generated predictor sets embedded in program images, loaded into the table/cache on cold starts to minimize initial mispredictions.
Core Mechanisms
- Prediction Process: On entering a block/fragment, query the Exit Table/Cache with the key to retrieve a predictor; chain subsequent predictors based on expected exits, forming a FIFO queue.
- Prefetch and Fetch: Prefetcher uses predictor address and cache-line count for exact line prefetching (no wasteful speculation); fetcher loads lines into an instruction buffer, resolving returns via the Return Stack.
- Decode and Shifting: Decode control logic consumes the queue head to manage an instruction shifter, isolating instructions based on counts; supports overflow via saturated values and pessimistic fetching.
- Misprediction Recovery: Detect mismatches during execution; discard decode state, insert dummy predictors for continued shifting, rebuild chain from actual exit using alternate keys, and update quality counters (saturating for success/failure tracking).
- Loop Handling: Predictors store toolchain-supplied iteration counts; hardware loop stack decrements on back-branches, switching to exit paths when exhausted.
- Deferred Branches: Annotate branches with deferral cycles; if unresolved, inject new chains into the Exit Cache for speculative prefetch.
- Split-Stream Encoding: Supports dual (forward/backward) address streams with half-predictors for independent decoding.
- Update Queue: Buffers post-execution adjustments to predictors, ensuring non-blocking updates.
Problems Addressed and Benefits
Traditional branch prediction uses per-branch tables, leading to high misprediction penalties (e.g., pipeline flushes costing 30+ cycles), table size bloat, power inefficiency, and wasteful prefetching. This block-level approach mitigates these by:
- Enabling run-ahead chaining for prefetch/fetch/decode, minimizing stalls and improving throughput in high-ILP pipelines.
- Reducing hardware overhead with compact predictors capturing full control-flow context, including loops and calls/returns.
- Enhancing cold-start performance via bulk loading, avoiding repeated mispredictions in new code paths.
- Lowering power and latency through exact prefetching, quality-based maintenance, and lightweight recovery (no full flushes).
- Supporting complex structures like nested loops and variable-length instructions with metadata-rich predictors.
Overall, the invention optimizes control-flow prediction for modern processors, aligning with Mill Computing’s architecture by integrating with features like variable-length decoding and bypass networks for efficient, stall-resistant execution.
U.S. Patent 9,959,119 – Computer processor employing double-ended instruction decoding
This patent describes a processor architecture that uses double-ended decoding to efficiently handle variable-length instructions. Each instruction consists of a fixed-length header and a variable-length bit bundle organized into slots and blocks, divided into forward (head-to-tail) and reverse (tail-to-head) groups. The decode stage processes these in parallel, starting from both ends, to enable fast, simultaneous decoding of operations while minimizing hardware complexity.
Key Features
- Instruction Format: Fixed-length header with fields for overall length and slot counts per block; variable-length bit bundle partitioned into fixed-length slots grouped into blocks (e.g., up to four, like Block 1F, 2F, 3F for forward; Block 3R, 2R for reverse).
- Double-Ended Partitioning: Forward group extends from the head end toward the tail; reverse group (including tail-end block) from the tail toward the head, allowing simultaneous processing from both ends.
- Alternate Encodings: “Svelte” for compact headers with fixed slots; “Skinny” for table-indexed common operations to bypass full decoding.
- Shifter and Decoder Logic: Hardware shifters isolate blocks using header-derived tap values; parallel parser/decoder arrays handle fixed-length parsing and opcode decoding.
Core Mechanisms
- Decoding Process: Instructions are stored in an instruction buffer and aligned in a double shifter. The header is processed first to generate control signals and shifter tap values for block isolation.
- Parallel Decoding: Forward blocks are decoded sequentially in forward order across pipelined sub-stages; reverse blocks in reverse order, starting from the tail-end block. This enables parallel handling (e.g., Block 2F with Block 3R) without full sequential scanning.
- Speculative and Gated Execution: Head-end blocks (e.g., Block 1F) are decoded speculatively assuming maximum slots, with invalid results gated based on actual header counts.
- Shifter Circuitry: Uses N-way mux trees for logarithmic-speed alignment; left-aligns forward blocks and right-aligns reverse ones, supporting variable bundle sizes up to 64.
- Integration: Complements the processor’s execution logic, allowing decoded operations to issue in parallel while maintaining semantic order.
Problems Addressed and Benefits
Traditional variable-length instruction decoding requires serial parsing to locate boundaries, leading to bottlenecks, high hardware costs, and delays in parallel execution (e.g., in CISC or VLIW architectures). This double-ended approach mitigates these by:
- Enabling parallel decoding from both ends, reducing latency and supporting high instruction-level parallelism without complex sequential logic.
- Improving efficiency and compactness through header-guided isolation and alternate encodings, saving memory bits and power.
- Lowering hardware overhead with fixed-length sub-parsing and logarithmic shifters, making it scalable for modern processors.
- Enhancing performance in out-of-order or multi-threaded environments by minimizing decode stalls and facilitating compact, variable-length formats.
Overall, the invention aligns with Mill Computing’s high-ILP architecture, complementing related patents on operand metadata, bypass networks, and memory optimizations for more efficient, power-aware processing.
U.S. Patent 9,965,274 – Computer processor employing bypass network using result tags for routing result operands
This patent describes a processor architecture featuring a bypass network that uses dynamically generated result tags to efficiently route operand data directly from producing functional units to dependent instructions, enhancing data forwarding in pipelined, multi-cycle execution environments.
Key Features
- Result Tags: Numeric identifiers dynamically assigned to result operands, ensuring unique tracking and collision-free routing.
- Bypass Network: Provides dedicated paths for broadcasting result data along with tags, enabling direct forwarding without register file or memory access.
- Tag Match Mux Control Circuits: Hardware selectors with comparators that match tags to select and route the correct operands to functional units.
- Dynamic Tag Generation: Based on operation slots, latencies, and valid bits, allowing predictable and reusable tags akin to register allocation.
Core Mechanisms
The bypass network operates by associating each result operand—produced by functional units over multiple machine cycles—with a result tag generated dynamically. Tag generation circuitry increments tags starting from an initial value, incorporating the number of valid results from prior slots or latencies to prevent overlaps. These tags, paired with valid bits, are broadcast over bypass paths alongside the result data. For operand selection, tag match mux circuits compare broadcast tags against source tags specified in the instruction’s operation field. Upon a valid match, the corresponding data is forwarded directly to the consuming functional unit, supporting single or multiple matches and including fault detection for mismatches. This integrates seamlessly with the processor pipeline, reducing stalls by enabling fast forwarding in out-of-order or speculative execution scenarios.
Problems Addressed and Benefits
Conventional bypass networks rely on complex wiring, address-based lookups, or multi-stage comparators, resulting in high hardware overhead, routing congestion, propagation delays, and scalability issues in high-performance processors. This invention mitigates these by employing lightweight, predictable result tags for routing, simplifying logic and reducing interconnect complexity. Benefits include enhanced instruction throughput via faster data forwarding, lower power consumption and area usage through binary selectors and fewer wires, improved performance in multi-unit, variable-latency environments, and better overall efficiency in pipelined architectures. It aligns with Mill Computing’s focus on high-ILP designs, complementing related patents on operand metadata and memory optimizations.
U.S. Patent 10,678,700 – CPU Security Mechanisms employing Thread-Specific Protection Domains
This patent details a hardware-based security system for processors, introducing “turfs”—lightweight, thread-specific protection domains that enable fine-grained memory isolation and access control. Turfs allow secure execution of code from varying trust levels within the same thread and address space, minimizing overhead compared to traditional processes or privilege rings.
Key Features of Protection Domains (Turfs)
- Turf Structure: Each turf is identified by a unique turfID and defines memory regions with specific permissions (e.g., read, write, execute, portal/grant). It includes:
- Well-Known Regions (WKRs): Hardware-optimized predefined areas for common accesses like code (cWKR), data/constants (dWKR), stack, thread-local storage (TLS), and null (nWKR) for fast NULL checks.
- Permission Tables: Backed by instruction and data Protection Lookaside Buffers (iPLB and dPLB) for efficient lookups.
- Thread Integration: Security state is tied to thread-turf pairs (threadID + turfID), supporting thousands of domains per process with low resource use.
- Permission Types: Includes transient (temporary, call-scoped) and persistent (longer-term delegation) grants to safely share access without permanent exposure.
Core Mechanisms
- Access Enforcement:
- Memory operations (loads, stores, fetches) first check against turf-specific WKRs for fast-path approval.
- Misses query PLBs and permission tables; violations trigger faults.
- Hardware ensures isolation, with guard bits on pointers for provenance tracking and session IDs to prevent cross-turf leaks.
- Domain Switching via Portals:
- Portals are memory objects with a ‘p’ permission bit, containing target turfID, entry point, and setup info.
- A portal CALL switches turfs securely: spills caller state to a spillet, passes arguments, activates new permissions, and revokes transients on return.
- Operations like GRANT, RELAY, and PERSIST handle permission delegation; supports nested calls and returns.
- Stack and State Handling:
- Per-turf stacklets isolate frames to prevent overflows or smashing.
- Hardware spiller manages register/belt state across switches; ARGS reserves argument frames.
- Separate call and data stacks enhance security.
- Integration with Processor Features: Complements Mill’s belt architecture, operand metadata, and high-ILP design by embedding security checks in hardware pipelines.
Problems Addressed and Benefits
Conventional security models (e.g., rings, processes, capabilities) suffer from high context-switch costs, coarse granularity, or complexity, making them inefficient for fine-grained isolation in performance-sensitive code. Turfs address this by:
- Offering lightweight, hardware-enforced domains with minimal overhead for switches (no TLB flushes or full saves).
- Enabling secure intra-thread interactions (e.g., user code calling libraries or services) without privilege escalation or shared vulnerabilities.
- Reducing performance penalties through WKRs and fast checks, while preventing attacks like confused deputy or buffer overflows.
- Enhancing flexibility and scalability for sandboxes, plugins, or microservices in a single address space.
Overall, this invention provides robust, efficient security tailored to modern high-parallelism processors, integrating seamlessly with other Mill innovations for secure, high-performance computing.
U.S. Patent 10,802,987 – Computer processor employing Cache Memory storing Backless Cache Lines
This patent introduces a processor architecture that optimizes virtual memory management through “backless cache lines”—cache entries associated with virtual addresses that lack immediate backing by valid physical memory. This enables lazy physical memory allocation, reducing unnecessary overhead for transient or unused data while maintaining compatibility with standard paging systems.
Key Features of Backless Cache Lines
Backless cache lines are defined by virtual addresses that either:
- Lack a corresponding page table entry (in virtual caches), or
- Have a page table entry pointing to an invalid or “pending” physical address (in physical caches).
These lines reside solely in cache without physical backing until triggered to transform into a “backed” state. Key attributes include:
- Zero-Filled Initialization: Loads from backless lines return zero-value bytes, simulating uninitialized memory without physical access.
- Temporary Single-Line Pages: During transformation, data may be stored in cache-line-sized physical pages before migrating to larger OS-specified pages (e.g., 4KB).
- Page Table Markings: Use indicators like “pending” or dummy addresses to denote backless status, facilitating quick detection and updates.
- Support for Write-Back and Write-Through Caches: Defers or immediate backing based on cache policy.
The system integrates with Translation Lookaside Buffers (TLBs), Protection Lookaside Buffers (PLBs), and Memory Management Units (MMUs) for efficient translation and protection.
Core Mechanisms
- Load and Store Handling:
- Hits: Process normally using cached data.
- Misses: For backless addresses, allocate a zero-filled cache line and apply the operation without physical memory involvement.
- Eviction and Backing Transformation:
- In write-back caches, eviction forces allocation of physical space (e.g., single-line page), data write, and page table update to a valid mapping.
- In write-through caches, stores trigger immediate backing and physical write.
- OS involvement may occur for page resizing or pool replenishment via interrupts.
- Virtual Caches: Rely on missing page table entries for backless identification; use PLBs for protection.
- Physical Caches: Employ “pending” markings or dummy address spaces; MMUs handle translations, with secondary tables for state transitions.
- Optimization Features: Pre-allocated zeroed page pools reduce latency; flows ensure no data loss during evictions.
Detailed flowcharts (e.g., FIGS. 5–7 for virtual caches, FIGS. 9–12 for physical caches) outline these processes, emphasizing minimal OS intervention for cache-resident operations.
Problems Addressed and Benefits
Traditional virtual memory systems eagerly allocate and initialize physical pages on first access, leading to high memory traffic, power consumption, and overhead for sparse or transient data (e.g., large mmap regions or cache-fitting programs). Backless cache lines mitigate this by:
- Enabling lazy allocation—physical memory is committed only when data must persist beyond cache (e.g., on eviction).
- Reducing memory traffic and latency through in-cache handling of backless data with zero-filled responses.
- Lowering power and costs by avoiding unnecessary physical interactions and initializations.
- Improving performance in single-address-space models without aliasing or extension issues.
- Enhancing scalability for dynamic workloads, with seamless integration into existing hierarchies and OS functions.
Overall, this invention aligns with Mill Computing’s focus on efficient, high-performance architectures, complementing related patents on operand metadata and memory optimizations by minimizing virtual memory bottlenecks.
U.S. Patent 11,226,821 – Computer processor employing Operand Data with Associated Meta-data
This patent describes a processor architecture where operands are handled as unitary data elements, each combining the actual data value (payload) with embedded metadata. This metadata provides contextual information that enhances execution efficiency, particularly in speculative operations, error handling, and vectorized (SIMD) processing.
Key Features of the Operand Model
Each unitary operand includes:
- Payload Data: The core value, which can be scalar (single element) or vector (multiple elements).
- Metadata: Descriptive tags that travel with the data, including:
- Type indicator (scalar or vector).
- Elemental width (power-of-two byte sizes, e.g., 1, 2, 4, or 8 bytes).
- Floating-point error flags (per IEEE-754, such as invalid operation, divide-by-zero, overflow, underflow, or inexact).
- Special values:
- Not-a-Result (NAR): Signals errors like invalid memory access or arithmetic issues; includes debugging details (error kind and location) for better diagnostics.
- None: Represents missing or invalid data, useful for masking unused elements in vectors or loops.
Functional units in the processor treat these unitary elements holistically, using metadata to guide operations without relying on instruction opcodes for type or size specifics.
Core Mechanisms
- Width Polymorphism: Operations (e.g., addition) adapt automatically to different data widths based on metadata, reducing the need for multiple instruction variants and simplifying the instruction set architecture (ISA).
- Memory Interaction: Metadata is processor-internal; loads from memory add appropriate metadata, while stores strip it and write only the payload, maintaining compatibility with standard memory systems.
- Speculative Execution: Supports aggressive speculation by propagating NAR or None through operations without immediate side effects. Non-speculative operations (e.g., stores or branches) trigger faults or skip actions when encountering these special values.
- Floating-Point Handling: Error flags accumulate (via logical OR) during speculative chains and update global state only on non-speculative commits, ensuring precise exception handling.
- Error Management: Includes operations like splitNAR/joinNAR for manipulating NAR states, and a no-speculation mode for immediate fault reporting during debugging.
Specialized Vector Operations
The architecture introduces operations to facilitate efficient vectorization of loops that are challenging in traditional processors:
- Pick: A conditional select (ternary-like) that chooses between operands based on a Boolean control (scalar or vector), propagating None to avoid unwanted updates.
- Vector Smear (inclusive/exclusive): Converts a Boolean vector into a mask by propagating the first “true” value, ideal for while-loop termination without scalar checks or cleanup code.
- Remaining: For counting loops, generates a mask for partial vectors, using None to ignore excess elements and prevent side effects.
- Satisfied: Counts leading unsatisfied iterations in search-style loops, aiding precise termination detection in vectorized while loops.
These operations minimize branching, eliminate scalar fallback code, and enable safe SIMD processing of irregular or conditional loops.
Problems Addressed and Benefits
Conventional processors separate data from metadata, leading to complex ISAs, costly speculation recovery, and difficulties in vectorizing non-uniform loops. By integrating metadata directly with operands, this design:
- Simplifies the ISA and compiler efforts through polymorphism.
- Enhances speculation safety and efficiency with low-overhead error propagation.
- Improves vectorization for dynamic loops, boosting performance in data-parallel tasks.
- Provides better debugging via embedded error info and robust error detection (e.g., mismatched widths).
Overall, the invention promotes higher instruction-level parallelism, resource efficiency, and reliability in modern computing workloads while remaining backward-compatible with existing memory hierarchies.