http://millcomputing.com/w/api.php?action=feedcontributions&user=Jan&feedformat=atomMill Computing Wiki - User contributions [en]2024-03-29T09:54:50ZUser contributionsMediaWiki 1.23.1http://millcomputing.com/wiki/ArchitectureArchitecture2015-01-20T16:37:22Z<p>Jan: /* Introduction */</p>
<hr />
<div>== Introduction ==<br />
<br />
The Mill architecture is a general purpose processor architecture paradigm in the sense that [http://en.wikipedia.org/wiki/Stack_machine stack machine] or a [http://en.wikipedia.org/wiki/RISC_processor RISC processor] is a processor architecture paradigm.<br />
<br />
It is also a processor family architecture in the sense that [http://en.wikipedia.org/wiki/X86 x86] or [http://en.wikipedia.org/wiki/ARM_architecture ARM] are processor family architectures. Any specific Mill processor can be configured and optimized to a wide variety of different tasks and use profiles.<br />
<br />
=== Features ===<br />
<br />
One guiding principle in the design was to look out for edge areas where current CPU designs struggle and specifially find solutions for those areas.<br />
<br />
;Memory Bandwidth : Innovations like [[Backless Memory]] and [[Merging Caches]] dramatically cut down memory usage in typical work loads.<br />
;Computing Throughput : The massive parallelism enabled by [[Static Scheduling]] and the flexible wide issue [[Encoding]] pushes <abbr title="Multiple Instructions, Multiple Data">MIMD</abbr> workloads beyond anything possible previously.<br />
;Security : Rethinking the protection primitives to enable fast [[Context Switches]], make true micro kernels viable for example, and hiding security sensitive control data from program access makes many common exploits physically impossible on the Mill architecture.<br />
;Multi-Processing : These fast context switches, that require no cache and page translation flushes in a single address space, and and fast optimistic synchronization facilities makes managing many diverse processes and workloads on one machine far more effective and efficient.<br />
<br />
To briefly classify the Mill architecture: it is a [[Static Scheduling|statically scheduled]], wide issue, in order [[Belt]] architecture, i.e. in a [http://en.wikipedia.org/wiki/Digital_signal_processor DSP] all instructions are issued in the order they are present in the binary instruction stream.<br />
<br />
This approach traditionally has problems dealing with common general purpose workload operations and flows like branches, particularly [[Pipelining#While_Loops|while-loop execution]], as well as with hiding [[Memory]] access latency. Those problems have been addressed; the [[Static Scheduling]] by the [[Compiler]] offloads most of the work that had to be done in hardware on every cycle into once at compile time tasks. This is where most power savings and performance gains come from in comparison to traditional general purpose architectures.<br />
<br />
<br /><br />
<imagemap><br />
File:Architecture.png|alt=Mill Architecture<br />
desc none<br />
rect 455 220 515 470 [[Streamer]]<br />
rect 90 120 150 200 [[Prediction]]<br />
rect 90 260 150 330 [[Protection]]<br />
rect 390 260 460 330 [[Protection]]<br />
rect 0 200 230 290 [[Decode]]<br />
rect 160 150 165 160 [[Metadata]]<br />
rect 160 160 220 180 [[Metadata]]<br />
rect 220 140 380 215 [[Belt]]<br />
rect 150 120 240 200 [[Belt#Belt_Position_Data_Format]]<br />
rect 260 80 380 140 [[ExuCore]]<br />
rect 260 215 380 275 [[FlowCore]]<br />
rect 380 140 460 215 [[Registers]]<br />
rect 515 80 660 110 [[Scratchpad]]<br />
rect 515 150 660 210 [[Spiller]]<br />
rect 0 335 660 480 [[Memory]]<br />
default [http://millcomputing.com/w/images/3/3e/Architecture.svg]<br />
<br />
</imagemap><br />
<br /><br />
<br />
== Overview ==<br />
<br />
This could be described as the general design philosophy behind the Mill: Remove anything that doesn't directly contribute to computation at runtime from the chip as much as possible, perform those tasks once and optimally in the compiler, and use the freed space for more computation units. This results in vastly improved single core performance through more instruction level parallelism as well as more room for more cores.<br />
<br />
There are quite a few hurdles for traditional architectures to actually utilize the large amount of instruction level parallelism provided by many <abbr title="Arithmetic Logic Unit">ALU</abbr>s. Some of the most unique and innovative features of the Mill emerged from tackling those hurdles and bottlenecks.<br />
<br />
* The [[Belt]], for example, is the result of having to provide many data sources and drains for all those computational units, interconnecting them without tripping over data dependencies and hazards and without having polynomial growth in the cost of space and power for interconnecting.<br />
* The unusual split stream, variable length, very wide issue [[Encoding]] makes the parallel feeding of all those ALUs with instructions possible in a die space and energy efficient way with optimally computed code density.<br />
* The Mill features an exposed latency pipeline. As per usual with static scheduling, the latencies of all operations are fixed and known to the compiler for optimal scheduling. There are no variable latency instructions, except [[Instruction_Set/load|load]]. And while for load the latency may be different for each specific use, it is explicitly defined for each specific use. This makes it possible to statically schedule loads as well and hide almost all load latencies.<br />
* [[Metadata]] encapsulates normally global program state in operands, eliminating side effects, exposing far more instruction level parallelism, simplifying the instruction set, and increasing code density.<br />
* This heavily aided static scheduling furthermore enables the extensive use of techniques like [[Phasing]], [[Pipelining]], [[Speculation]], and branch [[Prediction]] over several jumps with prefetch and a very short pipeline, which minimizes the occurrence and impact of stalls for unhindered [[Execution]].<br />
* A new [[Memory]] access model with caches fully working on virtual addresses and [[Protection]] mechanisms uncoupled from address translation makes you never wait for address translation unless it is masked by DRAM access anyway.<br />
* The Mill is a processor family with many different member processor cores with very different hardware. It still provides a common binary program format that gets [[Specializer|specialized]] for every specific processor on install. Any operations that are specific to only some of the processors and might not be available in hardware on all members of the Mill family are emulated in software for full compatibility.<br />
* This is also true for any operations that cannot be implemented in hardware with fixed latencies, like divisions. They are realized in terms of the other real hardware operations by the compiler, which often is even better in performance than traditional microcode implementations of such instructions.<br />
<br />
== Implementation ==<br />
<br />
The [[Belt]] is actually implemented as a big [[Crossbar]] that connects the functional unit [[Slot]]s to the sources. How exactly this [[Crossbar]] is implemented is up to the specific processor. Different implementation options work better for different scales.</div>Janhttp://millcomputing.com/wiki/GlossaryGlossary2015-01-20T16:14:38Z<p>Jan: </p>
<hr />
<div><p style="font-size: 12pt;"><br />
[[#0|0]]<br />
[[#a|a]]<br />
[[#b|b]]<br />
[[#c|c]]<br />
[[#d|d]]<br />
[[#e|e]]<br />
[[#f|f]]<br />
[[#g|g]]<br />
[[#h|h]]<br />
[[#i|i]]<br />
[[#j|j]]<br />
[[#k|k]]<br />
[[#l|l]]<br />
[[#m|m]]<br />
[[#n|n]]<br />
[[#o|o]]<br />
[[#p|p]]<br />
[[#q|q]]<br />
[[#r|r]]<br />
[[#s|s]]<br />
[[#t|t]]<br />
[[#u|u]]<br />
[[#v|v]]<br />
[[#w|w]]<br />
[[#x|x]]<br />
[[#y|y]]<br />
[[#z|z]]<br />
</p><br />
<br />
<div style="font-size: 10pt; font-weight: bold;" id="0">0</div><br />
<br />
<div style="font-size: 10pt; font-weight: bold;" id="a">a</div><br />
[[Abstract Assembly|Abstract Code]] - general data flow code for the Mill architecture, distribution format<br /><br />
[[Abstract Assembly|Abstract Assembly]] - general data flow code for the Mill architecture in human readable form, mainly used as compiler output<br /><br />
<br />
<div style="font-size: 10pt; font-weight: bold;" id="b">b</div><br />
[[Backless Memory]] - allocating memory happens in cache initally and often no DRAM and system bus needs to be involved at all<br /><br />
[[Belt]] - provides the functionality of general purpose registers<br /><br />
[[Belt#Belt_Position_Data_Format|Belt Position/Belt Location]] - the read only data source for machine operations<br /><br />
[[Block]] - a subsection of an instruction that contains a subset of the operations or data in a defined encoding format.<br /><br />
[[Encoding#Instructions_and_Operations_and_Bundles|Bundle]] - a collection of instructions that get fetched from memory together<br /><br />
<br />
<div style="font-size: 10pt; font-weight: bold;" id="c">c</div><br />
[[Concrete Assembly|Concrete Code]] - specialized executable code for a specific Mill processor<br /><br />
[[Concrete Assembly|Concrete Assembly]] - specialized executable code for a specific Mill processor in human readable form, mainly used for testing and in the debugger<br /><br />
[[Crossbar]] - the interconnecting framework that routes the data sources to the functional units<br /><br />
<br />
<div style="font-size: 10pt; font-weight: bold;" id="d">d</div><br />
[[Decode]] - turning instruciton stream bit patters into requests to functional units<br /><br />
[[Domains]] - operand value types, i.e. different interpretations of bit patterns by operations<br /><br />
<br />
<div style="font-size: 10pt; font-weight: bold;" id="e">e</div><br />
[[Encoding#Extended_Basic_Block|EBB]] - extended basic block, a batch or sequence of instructions with one entry point and one or more exit points<br /><br />
[[Encoding]] – the semantic bit patterns representing operations<br /><br />
[[Events|Event]] - an asynchronous diversion from normal program flow<br /><br />
[[Prediction#Exit_Table|Exit]] - a point where the instruction stream can leave the EBB<br /><br />
[[Prediction#Exit_Table|Exit Table]] - a hardware hash table containing exit point usage for EBBs, used to predict control flow<br /><br />
[[ExuCore]] - the collection of functional units and facilities serving operations from the exu instruction stream<br /><br />
<br />
<div style="font-size: 10pt; font-weight: bold;" id="f">f</div><br />
<br />
[[Events#Faults|Fault]] - an interrupt normal program flow cannot recover from in a meaningful way<br /><br />
[[Execution#fwr|First Winner Rule]] - only the first successful conditional branch operation in an instruction is taken<br /><br />
[[FlowCore]] - the collection of functional units and facilities serving operations from the flow instruction stream<br /><br />
[[Functional Unit|FU, Functional Unit]] - the hardware module that provides the functionality to perform an operation<br /><br />
<br />
<div style="font-size: 10pt; font-weight: bold;" id="g">g</div><br />
[[Ganging]] - combining more than two belt operands in more than one slot to perform a more complex operation<br /><br />
<br />
<div style="font-size: 10pt; font-weight: bold;" id="h">h</div><br />
<br />
<div style="font-size: 10pt; font-weight: bold;" id="i">i</div><br />
[[Memory#Implicit_Zero_and_Virtual_Zero|Implicit Zero]] - loads from new stack frames are implicitly zero<br /><br />
[[Events#Interrupts|Interrupt]] - an event that has predefined but configurable handling code in the form of a function<br /><br />
[[Encoding#Instructions_and_Operations_and_Bundles|Instruction]] - a collection of operations that get executed together<br /><br />
[[Encoding#Split_Instruction_Streams|Instruction Stream]] - a sequence of instructions, the Mill has 2 working in parallel<br /><br />
<br />
<div style="font-size: 10pt; font-weight: bold;" id="j">j</div><br />
<br />
<div style="font-size: 10pt; font-weight: bold;" id="k">k</div><br />
<br />
<div style="font-size: 10pt; font-weight: bold;" id="l">l</div><br />
<br />
<div style="font-size: 10pt; font-weight: bold;" id="m">m</div><br />
[[Metadata]] - tags attached to belt slots that describe the data in it<br /><br />
[[Decode#Morsels|Morsel]] - the amount of bits needed to address all belt locations on a core<br /><br />
<br />
<div style="font-size: 10pt; font-weight: bold;" id="n">n</div><br />
[[Metadata#None_and_NaR|None]] - undefined data in a slot that is silently ignored by operations<br /><br />
[[Metadata#None_and_NaR|NaR]] - Not a Result, undefined data that traps when used in certain operations<br /><br />
<br />
<div style="font-size: 10pt; font-weight: bold;" id="o">o</div><br />
[[Encoding#Instructions_and_Operations_and_Bundles|Operation]] – the most basic semantically defined hardware unit of execution<br /><br />
<br />
<div style="font-size: 10pt; font-weight: bold;" id="p">p</div><br />
[[Phasing|Phase, Phasing]] - sequenced execution of different operations within one instruction<br /><br />
[[Protection#Protection_Lookaside_Buffer|PLB]] - on chip cache for looking up protection regions for a virtual address<br /><br />
[[Pipeline]] - a logical and physical grouping of functional units sharing infrastructure for sequential step by step proceesing each cycle, emphasis on the physical aspect<br /><br />
[[Pipelining]] - arrangeing operations in the instruction stream in such a way as to maximize functional unit utilization<br /><br />
[[Protection#Portals|Portal]] - a gateway between different protection domains or turfs a thread can pass through<br /><br />
[[Prediction]] - deciding which branch to take in advance to prefetch the right code<br /><br />
[[Protection#Protection_Lookaside_Buffer|PLB]] - Protection Lookaside Buffer<br /><br />
[[Protection#Portals|Portal]] - a cross turf call destination<br /><br />
[[Protection#Regions_and_Turfs|Protection Region]] - specified continuous memory region with attached permissions<br /><br />
<br />
<div style="font-size: 10pt; font-weight: bold;" id="q">q</div><br />
<br />
<div style="font-size: 10pt; font-weight: bold;" id="r">r</div><br />
[[Retire Station]] - the piece of hardware that implements loads from memory, and where those loaded values end up<br /><br />
[[Protection#Region_Table|Region Table]] - the memory backing for the PLB<br /><br />
[[Pipeline#Result_Replay|Replay]] - the way the hardware restores machine state after being interrupted<br /><br />
<br />
<div style="font-size: 10pt; font-weight: bold;" id="s">s</div><br />
[[Scratchpad]] - Temporary buffer for operands from the belt<br /><br />
[[Virtual Address#Single_Address_Space|SAS]] - Single Address Space<br /><br />
[[Protection#Services|Service]] - a stateful call interface that can cross protection barriers<br /><br />
[[Slot]] - a logical and physical grouping of functional units sharing infrastructure for sequential step by step proceesing each cycle, emphasis on the logical aspect<br /><br />
[[Specializer]] - turns general/abstract Mill code into concrete hardware specific machine instructions<br /><br />
[[Speculation]] - computing several paths in branches in parallel only to later throw away the unneeded results<br /><br />
[[Spiller]] - securely manages temporary memory used by certain operations in hardware<br /><br />
[[Protection#Stacklets|Stacklet]] - hardware managed memory line used in fragmented stacks<br /><br />
[[Protection#Stacklet_Info_Block|Stacklet Info Block]] - preserves stacklet state for a thread across portal calls<br /><br />
<br />
<div style="font-size: 10pt; font-weight: bold;" id="t">t</div><br />
[[Protection#Threads|Thread]] - a contained and IDd flow of execution<br /><br />
[[Memory#Address_Translation|TLB]] - Translation Lookaside Buffer<br /><br />
[[Events#Traps|Trap]] - an interrupt that afterwards is intended to resume normal program flow<br /><br />
[[Protection#Regions_and_Turfs|Turf]] - memory protection domain on the Mill, a collection of regions<br /><br />
<br />
<div style="font-size: 10pt; font-weight: bold;" id="u">u</div><br />
<div style="font-size: 10pt; font-weight: bold;" id="v">v</div><br />
[[Memory#Implicit_Zero_and_Virtual_Zero|Virtual Zero]] - loads from all uninitialized memory yield zero<br /><br />
<br />
<div style="font-size: 10pt; font-weight: bold;" id="w">w</div><br />
[[Protection#Well_Known_Regions|WKR, Well Known Region]] - protection regions not defined in the PLB but in registers, automatically managed by hardware<br /><br />
<br />
<div style="font-size: 10pt; font-weight: bold;" id="x">x</div><br />
<div style="font-size: 10pt; font-weight: bold;" id="y">y</div><br />
<div style="font-size: 10pt; font-weight: bold;" id="z">z</div></div>Janhttp://millcomputing.com/wiki/GlossaryGlossary2015-01-07T00:59:39Z<p>Jan: </p>
<hr />
<div><p style="font-size: 12pt;"><br />
[[#0|0]]<br />
[[#a|a]]<br />
[[#b|b]]<br />
[[#c|c]]<br />
[[#d|d]]<br />
[[#e|e]]<br />
[[#f|f]]<br />
[[#g|g]]<br />
[[#h|h]]<br />
[[#i|i]]<br />
[[#j|j]]<br />
[[#k|k]]<br />
[[#l|l]]<br />
[[#m|m]]<br />
[[#n|n]]<br />
[[#o|o]]<br />
[[#p|p]]<br />
[[#q|q]]<br />
[[#r|r]]<br />
[[#s|s]]<br />
[[#t|t]]<br />
[[#u|u]]<br />
[[#v|v]]<br />
[[#w|w]]<br />
[[#x|x]]<br />
[[#y|y]]<br />
[[#z|z]]<br />
</p><br />
<br />
<div style="font-size: 10pt; font-weight: bold;" id="0">0</div><br />
<br />
<div style="font-size: 10pt; font-weight: bold;" id="a">a</div><br />
[[Abstract Assembly|Abstract Code]] - general data flow code for the Mill architecture, distribution format<br /><br />
[[Abstract Assembly|Abstract Assembly]] - general data flow code for the Mill architecture in human readable form, mainly used as compiler output<br /><br />
<br />
<div style="font-size: 10pt; font-weight: bold;" id="b">b</div><br />
[[Belt]] - provides the functionality of general purpose registers<br /><br />
[[Belt#Belt_Position_Data_Format|Belt Position/Belt Location]] - the read only data source for machine operations<br /><br />
[[Block]] - a subsection of an instruction that contains a subset of the operations or data in a defined encoding format.<br /><br />
[[Encoding#Instructions_and_Operations_and_Bundles|Bundle]] - a collection of instructions that get fetched from memory together<br /><br />
<br />
<div style="font-size: 10pt; font-weight: bold;" id="c">c</div><br />
[[Concrete Assembly|Concrete Code]] - specialized executable code for a specific Mill processor<br /><br />
[[Concrete Assembly|Concrete Assembly]] - specialized executable code for a specific Mill processor in human readable form, mainly used for testing and in the debugger<br /><br />
[[Crossbar]] - the interconnecting framework that routes the data sources to the functional units<br /><br />
<br />
<div style="font-size: 10pt; font-weight: bold;" id="d">d</div><br />
[[Decode]] - turning instruciton stream bit patters into requests to functional units<br /><br />
[[Domains]] - operand value types, i.e. different interpretations of bit patterns by operations<br /><br />
<br />
<div style="font-size: 10pt; font-weight: bold;" id="e">e</div><br />
[[Encoding#Extended_Basic_Block|EBB]] - extended basic block, a batch or sequence of instructions with one entry point and one or more exit points<br /><br />
[[Encoding]] – the semantic bit patterns representing operations<br /><br />
[[Events|Event]] - an asynchronous diversion from normal program flow<br /><br />
[[Prediction#Exit_Table|Exit]] - a point where the instruction stream can leave the EBB<br /><br />
[[Prediction#Exit_Table|Exit Table]] - a hardware hash table containing exit point usage for EBBs, used to predict control flow<br /><br />
[[ExuCore]] - the collection of functional units and facilities serving operations from the exu instruction stream<br /><br />
<br />
<div style="font-size: 10pt; font-weight: bold;" id="f">f</div><br />
<br />
[[Events#Faults|Fault]] - an interrupt normal program flow cannot recover from in a meaningful way<br /><br />
[[Execution#fwr|First Winner Rule]] - only the first successful conditional branch operation in an instruction is taken<br /><br />
[[FlowCore]] - the collection of functional units and facilities serving operations from the flow instruction stream<br /><br />
[[Functional Unit|FU, Functional Unit]] - the hardware module that provides the functionality to perform an operation<br /><br />
<br />
<div style="font-size: 10pt; font-weight: bold;" id="g">g</div><br />
[[Ganging]] - combining more than two belt operands in more than one slot to perform a more complex operation<br /><br />
<br />
<div style="font-size: 10pt; font-weight: bold;" id="h">h</div><br />
<br />
<div style="font-size: 10pt; font-weight: bold;" id="i">i</div><br />
[[Memory#Implicit_Zero_and_Virtual_Zero|Implicit Zero]] - loads from new stack frames are implicitly zero<br /><br />
[[Events#Interrupts|Interrupt]] - an event that has predefined but configurable handling code in the form of a function<br /><br />
[[Encoding#Instructions_and_Operations_and_Bundles|Instruction]] - a collection of operations that get executed together<br /><br />
[[Encoding#Split_Instruction_Streams|Instruction Stream]] - a sequence of instructions, the Mill has 2 working in parallel<br /><br />
<br />
<div style="font-size: 10pt; font-weight: bold;" id="j">j</div><br />
<br />
<div style="font-size: 10pt; font-weight: bold;" id="k">k</div><br />
<br />
<div style="font-size: 10pt; font-weight: bold;" id="l">l</div><br />
<br />
<div style="font-size: 10pt; font-weight: bold;" id="m">m</div><br />
[[Metadata]] - tags attached to belt slots that describe the data in it<br /><br />
[[Decode#Morsels|Morsel]] - the amount of bits needed to address all belt locations on a core<br /><br />
<br />
<div style="font-size: 10pt; font-weight: bold;" id="n">n</div><br />
[[Metadata#None_and_NaR|None]] - undefined data in a slot that is silently ignored by operations<br /><br />
[[Metadata#None_and_NaR|NaR]] - Not a Result, undefined data that traps when used in certain operations<br /><br />
<br />
<div style="font-size: 10pt; font-weight: bold;" id="o">o</div><br />
[[Encoding#Instructions_and_Operations_and_Bundles|Operation]] – the most basic semantically defined hardware unit of execution<br /><br />
<br />
<div style="font-size: 10pt; font-weight: bold;" id="p">p</div><br />
[[Phasing|Phase, Phasing]] - sequenced execution of different operations within one instruction<br /><br />
[[Protection#Protection_Lookaside_Buffer|PLB]] - on chip cache for looking up protection regions for a virtual address<br /><br />
[[Pipeline]] - a logical and physical grouping of functional units sharing infrastructure for sequential step by step proceesing each cycle, emphasis on the physical aspect<br /><br />
[[Pipelining]] - arrangeing operations in the instruction stream in such a way as to maximize functional unit utilization<br /><br />
[[Protection#Portals|Portal]] - a gateway between different protection domains or turfs a thread can pass through<br /><br />
[[Prediction]] - deciding which branch to take in advance to prefetch the right code<br /><br />
[[Protection#Protection_Lookaside_Buffer|PLB]] - Protection Lookaside Buffer<br /><br />
[[Protection#Portals|Portal]] - a cross turf call destination<br /><br />
[[Protection#Regions_and_Turfs|Protection Region]] - specified continuous memory region with attached permissions<br /><br />
<br />
<div style="font-size: 10pt; font-weight: bold;" id="q">q</div><br />
<br />
<div style="font-size: 10pt; font-weight: bold;" id="r">r</div><br />
[[Retire Station]] - the piece of hardware that implements loads from memory, and where those loaded values end up<br /><br />
[[Protection#Region_Table|Region Table]] - the memory backing for the PLB<br /><br />
[[Pipeline#Result_Replay|Replay]] - the way the hardware restores machine state after being interrupted<br /><br />
<br />
<div style="font-size: 10pt; font-weight: bold;" id="s">s</div><br />
[[Scratchpad]] - Temporary buffer for operands from the belt<br /><br />
[[Virtual Address#Single_Address_Space|SAS]] - Single Address Space<br /><br />
[[Protection#Services|Service]] - a stateful call interface that can cross protection barriers<br /><br />
[[Slot]] - a logical and physical grouping of functional units sharing infrastructure for sequential step by step proceesing each cycle, emphasis on the logical aspect<br /><br />
[[Specializer]] - turns general/abstract Mill code into concrete hardware specific machine instructions<br /><br />
[[Speculation]] - computing several paths in branches in parallel only to later throw away the unneeded results<br /><br />
[[Spiller]] - securely manages temporary memory used by certain operations in hardware<br /><br />
[[Protection#Stacklets|Stacklet]] - hardware managed memory line used in fragmented stacks<br /><br />
[[Protection#Stacklet_Info_Block|Stacklet Info Block]] - preserves stacklet state for a thread across portal calls<br /><br />
<br />
<div style="font-size: 10pt; font-weight: bold;" id="t">t</div><br />
[[Protection#Threads|Thread]] - a contained and IDd flow of execution<br /><br />
[[Memory#Address_Translation|TLB]] - Translation Lookaside Buffer<br /><br />
[[Events#Traps|Trap]] - an interrupt that afterwards is intended to resume normal program flow<br /><br />
[[Protection#Regions_and_Turfs|Turf]] - memory protection domain on the Mill, a collection of regions<br /><br />
<br />
<div style="font-size: 10pt; font-weight: bold;" id="u">u</div><br />
<div style="font-size: 10pt; font-weight: bold;" id="v">v</div><br />
[[Memory#Implicit_Zero_and_Virtual_Zero|Virtual Zero]] - loads from all uninitialized memory yield zero<br /><br />
<br />
<div style="font-size: 10pt; font-weight: bold;" id="w">w</div><br />
[[Protection#Well_Known_Regions|WKR, Well Known Region]] - protection regions not defined in the PLB but in registers, automatically managed by hardware<br /><br />
<br />
<div style="font-size: 10pt; font-weight: bold;" id="x">x</div><br />
<div style="font-size: 10pt; font-weight: bold;" id="y">y</div><br />
<div style="font-size: 10pt; font-weight: bold;" id="z">z</div></div>Janhttp://millcomputing.com/wiki/ExecutionExecution2015-01-07T00:58:44Z<p>Jan: /* Multi-Branch, or the First Winner Rule */</p>
<hr />
<div>This is an overview page briefly explaining several aspects of Mill code execution. Those topics were mainly covered in [http://www.youtube.com/watch?v=43kh4y3Mnhw this talk], but there are some more.<br />
<br />
All aspects of execution on the Mill are geared at improving data flow and control behavior in the hardware.<br />
<br />
== [[Decode|Wide Issue]] ==<br />
<br />
The prerequisite for almost all the Mill ways to get good performance at low energy cost is it being a very wide issue machine. There are many many functional units in comparison to conventional architectures, and it all revolves around feeding them with instructions and data. Each instruction can contain up to 30 and more operations, although only the high end Mill processors have this extreme width.<br />
<br />
How widely a specific Mill processor can issue operations is determined by the number of [[Slot]]s. They are not all created equal and each slot has its own set of operations it supports.<br />
<br />
== [[Phasing]] ==<br />
<br />
Phasing enables data flow connection, even over branch borders, within one instruction. The execution (and decode) of different kinds of operations within an instruction is tiered and chained in a phase shift over cycles.<br />
<br />
== [[Pipelining]] ==<br />
<br />
All code is statically scheduled to maximize functional unit utilization on each cycle and as a corollary to have no stall or bubbles in the pipeline. This is particularly useful in loops, since the compiler has a lot to work with to unroll them and ultimately execute them in parallel to a large degree with little or no penalty in code size.<br />
<br />
== [[Speculation]] ==<br />
<br />
Is another aspect of statically scheduling to increase parallel execution, <abbr title="Instruction Level Parallelism">ILP</abbr>.<br />
<br />
== [[Ganging|Gangs]] ==<br />
<br />
Gangs combine [[Slots]] and their inputs to form more complex operations and to facilitate not only data flow between different phases of one instruction, but to have data flow between operations within one phase of one instruction.<br />
<br />
== <span id="fwr">Multi-Branch, or the First Winner Rule</span> ==<br />
<br />
One instruction can contain several branches. Those are all executed at the same time too to check for a lot of different conditions. Only one of them can be the correct one though. While execution of branches is parallel, evaluation of branches is in a defined order, from left to right in instruction encoding order, which is issue order. The first conditional branch to be true in that order is the one taken.<br />
<br />
The first successful conditional branch operation in an instruction, and as such consequently also the first in an [[EBB]] is taken.<br />
<br />
This is called the ''First Winner Rule''.<br />
<br />
Another consequence is, there can only ever be one unconditional branch in an EBB, as the last operation in the last instruction.<br />
<br />
== Media ==<br />
[http://www.youtube.com/watch?v=43kh4y3Mnhw Presentation on Execution by Ivan Godard] - [http://millcomputing.com/blog/wp-content/uploads/2014/02/execution.02.pptx Slides]<br /><br />
[http://www.youtube.com/watch?v=JS5hCjueqQ0 Presentation on Pipelining by Ivan Godard] - [http://millcomputing.com/blog/wp-content/uploads/2014/07/pipelining.06.pptx Slides]<br /><br />
[http://www.youtube.com/watch?v=DZ8HN9Cnjhc Presentation on Metadata and Speculation by Ivan Godard] - [http://millcomputing.com/blog/wp-content/uploads/2013/12/metadata.021.pptx Slides]</div>Janhttp://millcomputing.com/wiki/ExecutionExecution2015-01-07T00:48:37Z<p>Jan: /* Multi-Branch */</p>
<hr />
<div>This is an overview page briefly explaining several aspects of Mill code execution. Those topics were mainly covered in [http://www.youtube.com/watch?v=43kh4y3Mnhw this talk], but there are some more.<br />
<br />
All aspects of execution on the Mill are geared at improving data flow and control behavior in the hardware.<br />
<br />
== [[Decode|Wide Issue]] ==<br />
<br />
The prerequisite for almost all the Mill ways to get good performance at low energy cost is it being a very wide issue machine. There are many many functional units in comparison to conventional architectures, and it all revolves around feeding them with instructions and data. Each instruction can contain up to 30 and more operations, although only the high end Mill processors have this extreme width.<br />
<br />
How widely a specific Mill processor can issue operations is determined by the number of [[Slot]]s. They are not all created equal and each slot has its own set of operations it supports.<br />
<br />
== [[Phasing]] ==<br />
<br />
Phasing enables data flow connection, even over branch borders, within one instruction. The execution (and decode) of different kinds of operations within an instruction is tiered and chained in a phase shift over cycles.<br />
<br />
== [[Pipelining]] ==<br />
<br />
All code is statically scheduled to maximize functional unit utilization on each cycle and as a corollary to have no stall or bubbles in the pipeline. This is particularly useful in loops, since the compiler has a lot to work with to unroll them and ultimately execute them in parallel to a large degree with little or no penalty in code size.<br />
<br />
== [[Speculation]] ==<br />
<br />
Is another aspect of statically scheduling to increase parallel execution, <abbr title="Instruction Level Parallelism">ILP</abbr>.<br />
<br />
== [[Ganging|Gangs]] ==<br />
<br />
Gangs combine [[Slots]] and their inputs to form more complex operations and to facilitate not only data flow between different phases of one instruction, but to have data flow between operations within one phase of one instruction.<br />
<br />
== <span id="fwr">Multi-Branch, or the First Winner Rule</span> ==<br />
<br />
One instruction can contain several branches. Those are all executed at the same time too to check for a lot of different conditions. Only one of them can be the correct one though. While execution of branches is parallel, evaluation of branches is in a defined order, from left to right in instruction encoding order, which is issue order. The first conditional branch to be true in that order is the one taken.<br />
<br />
This is called the ''First Winner Rule''.<br />
<br />
== Media ==<br />
[http://www.youtube.com/watch?v=43kh4y3Mnhw Presentation on Execution by Ivan Godard] - [http://millcomputing.com/blog/wp-content/uploads/2014/02/execution.02.pptx Slides]<br /><br />
[http://www.youtube.com/watch?v=JS5hCjueqQ0 Presentation on Pipelining by Ivan Godard] - [http://millcomputing.com/blog/wp-content/uploads/2014/07/pipelining.06.pptx Slides]<br /><br />
[http://www.youtube.com/watch?v=DZ8HN9Cnjhc Presentation on Metadata and Speculation by Ivan Godard] - [http://millcomputing.com/blog/wp-content/uploads/2013/12/metadata.021.pptx Slides]</div>Janhttp://millcomputing.com/wiki/MemoryMemory2015-01-03T09:06:25Z<p>Jan: /* Retire Stations */</p>
<hr />
<div>A lot of the power and performance gains of the Mill, but also many of its [[Protection|security]] improvements over conventional architectures come from the various facilities of the memory management. Most subsystems have their own dedicated pages. This page is an overview.<br />
<br />
<div style="position: absolute; left: 18em;"><br />
<imagemap><br />
File:Memory-hierarchy.png|alt=Memory/Cache Hierarchy<br />
desc none<br />
</imagemap><br />
</div><br />
<br />
== Overview ==<br />
<br />
The Mill architecture is a 64bit architecture; there are no 32bit Mills. For this reason it is possible and indeed prudent to adopt a single address space (SAS) memory model. All threads and processes share the same address space. Any address points to the same location for every process. To do this securely and efficiently, the [[Protection|memory access protection]] and address translation have been split into two separate modules, whereas on conventional architectures those two tasks are conflated into one.<br />
<br />
As can be seen from this rough system chart, There is a combined L2 cache, although some low level implementations may choose to omit this for space and energy reasons. The Mill has facilities that make an L2 cache less critical.<br /><br />
L1 caches are separate for instructions and data already. Furthermore, they are separate for [[ExuCore]] instructions and [[FlowCore]] instructions. Smaller, more specialized caches can be made faster and more efficient chiefly via shorter signal paths.<br /><br />
The D$1 data cache feeds into the retire stations with [[Instruction Set/load|load operations]] and receives the values from the [[Instruction Set/store|store operations]].<br />
<br />
== [[Protection]] ==<br />
<br />
All [[Protection]] happens by defining protection attributes on virtual address regions. This happens above the Level 1 caches and separately for instructions and data with different attributes; execute and portal for instructions, read and write for data. The <abbr title="instruction Protection Lookaside Buffer">iPLB</abbr> and <abbr title="data Protection Lookaside Buffer">dPLB</abbr> lookup tables are specialized and can be small and fast. And even better optimizations exist in the [[Protection#Well_Known_Regions|well known regions]] for the most common cases. More on this under [[Protection]].<br />
<br />
== Address Translation ==<br />
<br />
Because address translation is separated from access protection, and because all processes share one address space, the translation and <abbr title="Translation Lookaside Buffer">TLB</abbr> accesses can be moved below the caches. In fact the TLB only ever needs to be accessed when there is a cache miss or evict. In that case there is a +300 cycle stall anyway, which means the TLB can be big, flat, slow, and energy efficient. The few extra cycles for a TLB lookup are largely masked by the system memory access.<br /><br />
On conventional machines the TLB is right in the critical path between the top level cache and the functional units. This means the TLB must be small, fast, and power hungry with a complex hierarchy. And you still spend up to 20-30% of your cycles and power budget on TLB stalls and TLB hierarchy shuffling.<br />
<br />
=== Reserved Address Space ===<br />
<br />
The virtual address space is 60bit. This is because the top 4 bits of the [[Virtual Address]]es are reserved for system use like garbage collection.<br />
<br />
The top part of this 60bit address space is reserved to facilitate fast [[Protection#Stacklets|protection domain or turf]] switches with secure stacks. More on this there.<br />
<br />
== Retire Stations ==<br />
<br />
Retire stations serve the load/store <abbr title="Functional Unit">FU</abbr>s or [[Slot]] for the [[Instruction Set/load|load operation]]. They implement the deferred load operation and conceptually are part of the [[FlowCore]]. The load operation is explicitly deferred, i.e. it has a parameter which determines exactly at which point in the future it has to make the value available and drop it on the [[Belt]]. This explicit static but parametrized scheduling allows the hiding of almost all cache latencies in memory access. A DRAM stall will still have the same cost, but due to innovations in cache access, specialized mechanisms for the most common memory access patterns, and exact [[Prediction]], the amount of DRAM accesses has been vastly reduced.<br /><br />
Another important aspect of this deferred load operation is that it will not load the value at the point of the issuing of the load operation, but at the point when it is scheduled to yield the value. This makes the load hardware immune to [[Aliasing]], which means the compiler can stop worrying about aliasing completely and aggressively optimize.<br /><br />
This is achieved when the active retire stations, i.e. the retire stations that have a load pending to return, monitor the store wires for stores on their address. And whenever they see there is a store on their address, they copy the value for later return.<br />
<br />
== Implicit Zero and Virtual Zero ==<br />
<br />
Loads from uninitialized but accessible memory always yield zero on the Mill. There are two mechanisms to ensure that.<br />
<br />
The first is virtual zero. When a load misses the caches and also misses the <abbr title="Translation Lookaside Buffer">TLB</abbr> with no <abbr title="Page Table Entry">PTE</abbr> in it, it means there have been no stores to the address yet, and in this case the <abbr title="Memory Management Unit">MMU</abbr> returns zero for the load to bring back to the retire station. The big gain for this is that the OS doesn't have to explicitly zero out new pages, which would be a lot of bandwidth and time. And accesses to uninitialized memory only take the time of the cache and TLB lookups instead of having to do memory round trips.<br /><br />
This also has security benefits, since no one can snoop on memory garbage piles.<br />
<br />
An optimization of this for data stacks is the implicit zero. The problems of uninitialized memory and bandwidth waste are that the virtual zero addresses for general memory accesses are even more compounded for the data stack. This is because of the high frequency of new accesses and the frequency with which recently written data is never used again. On conventional architectures this causes a staggering amount of cache thrashing and superfluous memory accesses.<br /><br />
The [[Instruction Set/Stackf|stackf]] instruction allocates a new stack frame, i.e. a number of new cache lines, but it does so just by putting markers for those cache lines into the implicit zero registers.<br /><br />
When a subsequent load happens on a newly allocated stack frame, the hardware knows it is a stack access due to the [[Protection#Well_Known_Regions|well known region]] and stack frame [[Registers]]. The hardware doesn't even need to check the [[Protection|dPLB]] or the top level caches, it just returns zero. So while virtual zero returns zero with only the cost of the cache accesses for uninitialized memory. For the most frequent case of uninitialized stack accesses, there are not top level cache delays, just immediate access. And of course it also makes it impossible to snoop on old stack frames.<br /><br />
Only when a [[Instruction Set/Store|store]] happens on a new stack frame will an actual new cache line be allocated, the new value be written, and those values be marked as written by a flag in the implicit zero registers, all by hardware. Uninitialized loads are still implicitly zero in this cache line; only the actually stored value is pulled from the cache.<br />
<br />
In the majority of cases the stack frame is deallocated before it ever has been written to memory, and the cache line can just be discarded and freed up for future use.<br />
<br />
== Data Cache ==<br />
<br />
All data caches and shared caches have 9 bits per byte. The additional bit is the valid bit. Whenever a new cache line is allocated, always because of a store to a new location, the new value is set for the bytes of the store and their valid bits are set. All other bytes remain invalid.<br /><br />
<br />
=== Backless Memory ===<br />
<br />
[[File:Cache-lines.png]]<br />
<br />
All stores are to the top of the data cache, and thus are neither write-back nor write-through. And, by definition, stores cannot miss the cache and don't involve memory. Cache lines can be evicted to lower levels. And when they are loaded again they are hoisted up again. When there are cache lines for the same address on multiple levels, they get merged on evict or hoist, with the upper level winning if both bytes are valid. In the above example, after the width 16 load, you would have the full merged string "StKill the wabbit\0" on the top level cache.<br />
<br />
All this usually happens completely in cache without any physical memory involvement. It is backless memory and a vast improvement in all access times. And because all this happens in cache with the valid bit mechanisms, there are also no alignment penalties for loads and stores of data types of different widths. The load and store operations only support power of two widths for the data, but they can be on addresses of any alignment without penalty.<br />
<br />
If there are still invalid bytes left at the lowest cache level, and there is a <abbr title="Page Table Entry">PTE</abbr> for the cache line, then of course the remaining bytes are taken from memory, and the line is hoisted from memory and merged. But for that to happen a line first has to be completely evicted to physical memory, and then new writes without intermediate loads have to have created new lines in cache for the addresses in the line.<br /><br />
As a result the cases where actual access to physical memory is necessary have been vastly reduced. And often the temporary data in smaller subroutines never gets into physical memory at all so the whole lifetime of the objects has been spent in cache.<br />
<br />
=== Memory Allocation ===<br />
<br />
Only when a lowest level cache line is evicted does an actual memory page get allocated. And even this happens completely in hardware with cache line size pages and a bit map allocator from a hierarchy with larger pages. Invalid bits are still set to 0 by the MMU.<br /><br />
It is those larger pages that are managed by the OS in traps raised by the <abbr title="Memory Management Unit">MMU</abbr> when it runs low on backing memory pages.<br /><br />
A big advantage of this allocation behavior is that in the vast majority of cases you only get to write into memory when a larger number of writes has accumulated in cache already, and they are written all at once. In contrast to write-back caches, where this is also the case, you don't always need to evict a cache line on a read miss. This is because you can often just merge the memory backing into the invalid bytes of the existing cache line. Only a read from a truly cold and unpredicted and thus unprefetched line triggers an evict that causes a stall. A store into a cold line triggers an evict too, after it cascaded down the caches, but this evict almost never causes a stall since the evicted line is most likely a cold line.<br />
<br />
== Instruction Cache ==<br />
<br />
The instruction caches are read only. And as mentioned before, they are specialized to their Instruction stream. This means they are managed differently from the data caches to facilitate better instruction [[Prediction#Prefetch_and_Fetch|prefetch]] and [[Decode]] without bubbles in the pipeline. More in this on the respective pages.<br />
<br />
== Sequential Consistency ==<br />
<br />
All memory accesses happen in the order they occur in the program. This is sequential consistency. No access reordering happens, and consequently there is no need for memory fences and the like.<br /><br />
Loads and stores may be placed in the same instruction or retire in the same cycle, and as such are issued and executed in parallel. But the order they retire in is still determined by the order they appear in the instruction, and as such by the order of the [[Slot]]s they were issued into.<br /><br />
This order is not only maintained on a single core, but a defined order for all cores on a chip is maintained with the cache coherency protocol.<br />
<br />
== [[Spiller]] ==<br />
<br />
The spiller is a dedicated central hardware module that preserves internal core state for as long as it may be needed. As such it may save internal core state to dedicated DRAM areas, i.e. the spiller space. This memory is not accessible by any other mechanism, and no other hardware mechanisms can interfere with the spiller and meddle with its own internal state. Spiller memory accesses don't need to go through the [[Protection]] layer, since no one can make the spiller to do anything insecure. It uses the L2 cache as a buffer and everything still goes through address translation, because special system tools like [[Debugger]]s occasionally need to read spiller state.<br /><br />
<br />
== [[Streamer]] ==<br />
<br />
== Rationale ==<br />
<br />
Memory latency is the main bottleneck that dictates how modern processors are designed. Memory latency is the reason why all the expensive out-of-order hardware is so prevalent on virtually all general purpose processors since the 60s. If anything the steadily increasing gap in frequency between memory and processor cores makes the latency felt even more today.<br />
<br />
So hiding the latency of the memory accesses and reducing the amount of memory accesses are the primary goals in any processor architecture. Both is mainly achieved with the use of caches. Sophisticated [[Prediction]] and prefetch fills the caches as far in advance as possible. [[Instruction Set/Load|Load]] and [[Instruction Set/Load|store]] deferring and [[Pipelining]] and [[Speculation]] hide the cache latencies, and increase the levels of <abbr title="Instruction Level Parallelism">IPL</abbr> for loads and stores by making memory accesses less dependent on each other. And the cache management protocols determine the amount of actual memory accesses.<br />
<br />
All three aspects have new solutions on the Mill. Generally those solutions are not really more powerful or faster than the solutions of conventional out-of-order architectures. They are only vastly cheaper. Truly random and unpredictable work loads still can't be helped though.<br />
<br />
== See Also ==<br />
<br />
[[Spiller]], [[Virtual Addresses]]<br />
<br />
== Media ==<br />
[http://www.youtube.com/watch?v=bjRDaaGlER8 Presentation on the Memory Hierarchy by Ivan Godard] - [http://millcomputing.com/blog/wp-content/uploads/2013/12/2013-10-16_mill_cpu_hierarchy_08.pptx Slides]</div>Janhttp://millcomputing.com/wiki/MemoryMemory2015-01-03T09:01:23Z<p>Jan: /* Overview */</p>
<hr />
<div>A lot of the power and performance gains of the Mill, but also many of its [[Protection|security]] improvements over conventional architectures come from the various facilities of the memory management. Most subsystems have their own dedicated pages. This page is an overview.<br />
<br />
<div style="position: absolute; left: 18em;"><br />
<imagemap><br />
File:Memory-hierarchy.png|alt=Memory/Cache Hierarchy<br />
desc none<br />
</imagemap><br />
</div><br />
<br />
== Overview ==<br />
<br />
The Mill architecture is a 64bit architecture; there are no 32bit Mills. For this reason it is possible and indeed prudent to adopt a single address space (SAS) memory model. All threads and processes share the same address space. Any address points to the same location for every process. To do this securely and efficiently, the [[Protection|memory access protection]] and address translation have been split into two separate modules, whereas on conventional architectures those two tasks are conflated into one.<br />
<br />
As can be seen from this rough system chart, There is a combined L2 cache, although some low level implementations may choose to omit this for space and energy reasons. The Mill has facilities that make an L2 cache less critical.<br /><br />
L1 caches are separate for instructions and data already. Furthermore, they are separate for [[ExuCore]] instructions and [[FlowCore]] instructions. Smaller, more specialized caches can be made faster and more efficient chiefly via shorter signal paths.<br /><br />
The D$1 data cache feeds into the retire stations with [[Instruction Set/load|load operations]] and receives the values from the [[Instruction Set/store|store operations]].<br />
<br />
== [[Protection]] ==<br />
<br />
All [[Protection]] happens by defining protection attributes on virtual address regions. This happens above the Level 1 caches and separately for instructions and data with different attributes; execute and portal for instructions, read and write for data. The <abbr title="instruction Protection Lookaside Buffer">iPLB</abbr> and <abbr title="data Protection Lookaside Buffer">dPLB</abbr> lookup tables are specialized and can be small and fast. And even better optimizations exist in the [[Protection#Well_Known_Regions|well known regions]] for the most common cases. More on this under [[Protection]].<br />
<br />
== Address Translation ==<br />
<br />
Because address translation is separated from access protection, and because all processes share one address space, the translation and <abbr title="Translation Lookaside Buffer">TLB</abbr> accesses can be moved below the caches. In fact the TLB only ever needs to be accessed when there is a cache miss or evict. In that case there is a +300 cycle stall anyway, which means the TLB can be big, flat, slow, and energy efficient. The few extra cycles for a TLB lookup are largely masked by the system memory access.<br /><br />
On conventional machines the TLB is right in the critical path between the top level cache and the functional units. This means the TLB must be small, fast, and power hungry with a complex hierarchy. And you still spend up to 20-30% of your cycles and power budget on TLB stalls and TLB hierarchy shuffling.<br />
<br />
=== Reserved Address Space ===<br />
<br />
The virtual address space is 60bit. This is because the top 4 bits of the [[Virtual Address]]es are reserved for system use like garbage collection.<br />
<br />
The top part of this 60bit address space is reserved to facilitate fast [[Protection#Stacklets|protection domain or turf]] switches with secure stacks. More on this there.<br />
<br />
== Retire Stations ==<br />
<br />
Retire stations server the load/store <abbr title="Functional Unit">FU</abbr>s or [[Slot]] for the [[Instruction Set/Load|load operation]]. They implement the deferred load operation and conceptually are part of the [[FlowCore]]. The load operation is explicitly deferred, i.e. it has a parameter which determines exactly at which point in the future it has to make the value available and drop it on the [[Belt]]. This explicit static but parametrized scheduling allows the hiding of almost all cache latencies in memory access. A DRAM stall will still have the same cost, but due to innovations in cache access, specialized mechanisms for the most common memory access patterns, and exact [[Prediction]], the amount of DRAM accesses has been vastly reduced.<br /><br />
Another important aspect of this deferred load operation is that it will not load the value at the point of the issuing of the load operation, but at the point when it is scheduled to yield the value. This makes the load hardware immune to [[Aliasing]], which means the compiler can stop worrying about aliasing completely and aggressively optimize.<br /><br />
This is achieved when the active retire stations, i.e. the retire stations that have a load pending to return, monitor the store wires for stores on their address. And whenever they see there is a store on their address, they copy the value for later return.<br />
<br />
== Implicit Zero and Virtual Zero ==<br />
<br />
Loads from uninitialized but accessible memory always yield zero on the Mill. There are two mechanisms to ensure that.<br />
<br />
The first is virtual zero. When a load misses the caches and also misses the <abbr title="Translation Lookaside Buffer">TLB</abbr> with no <abbr title="Page Table Entry">PTE</abbr> in it, it means there have been no stores to the address yet, and in this case the <abbr title="Memory Management Unit">MMU</abbr> returns zero for the load to bring back to the retire station. The big gain for this is that the OS doesn't have to explicitly zero out new pages, which would be a lot of bandwidth and time. And accesses to uninitialized memory only take the time of the cache and TLB lookups instead of having to do memory round trips.<br /><br />
This also has security benefits, since no one can snoop on memory garbage piles.<br />
<br />
An optimization of this for data stacks is the implicit zero. The problems of uninitialized memory and bandwidth waste are that the virtual zero addresses for general memory accesses are even more compounded for the data stack. This is because of the high frequency of new accesses and the frequency with which recently written data is never used again. On conventional architectures this causes a staggering amount of cache thrashing and superfluous memory accesses.<br /><br />
The [[Instruction Set/Stackf|stackf]] instruction allocates a new stack frame, i.e. a number of new cache lines, but it does so just by putting markers for those cache lines into the implicit zero registers.<br /><br />
When a subsequent load happens on a newly allocated stack frame, the hardware knows it is a stack access due to the [[Protection#Well_Known_Regions|well known region]] and stack frame [[Registers]]. The hardware doesn't even need to check the [[Protection|dPLB]] or the top level caches, it just returns zero. So while virtual zero returns zero with only the cost of the cache accesses for uninitialized memory. For the most frequent case of uninitialized stack accesses, there are not top level cache delays, just immediate access. And of course it also makes it impossible to snoop on old stack frames.<br /><br />
Only when a [[Instruction Set/Store|store]] happens on a new stack frame will an actual new cache line be allocated, the new value be written, and those values be marked as written by a flag in the implicit zero registers, all by hardware. Uninitialized loads are still implicitly zero in this cache line; only the actually stored value is pulled from the cache.<br />
<br />
In the majority of cases the stack frame is deallocated before it ever has been written to memory, and the cache line can just be discarded and freed up for future use.<br />
<br />
== Data Cache ==<br />
<br />
All data caches and shared caches have 9 bits per byte. The additional bit is the valid bit. Whenever a new cache line is allocated, always because of a store to a new location, the new value is set for the bytes of the store and their valid bits are set. All other bytes remain invalid.<br /><br />
<br />
=== Backless Memory ===<br />
<br />
[[File:Cache-lines.png]]<br />
<br />
All stores are to the top of the data cache, and thus are neither write-back nor write-through. And, by definition, stores cannot miss the cache and don't involve memory. Cache lines can be evicted to lower levels. And when they are loaded again they are hoisted up again. When there are cache lines for the same address on multiple levels, they get merged on evict or hoist, with the upper level winning if both bytes are valid. In the above example, after the width 16 load, you would have the full merged string "StKill the wabbit\0" on the top level cache.<br />
<br />
All this usually happens completely in cache without any physical memory involvement. It is backless memory and a vast improvement in all access times. And because all this happens in cache with the valid bit mechanisms, there are also no alignment penalties for loads and stores of data types of different widths. The load and store operations only support power of two widths for the data, but they can be on addresses of any alignment without penalty.<br />
<br />
If there are still invalid bytes left at the lowest cache level, and there is a <abbr title="Page Table Entry">PTE</abbr> for the cache line, then of course the remaining bytes are taken from memory, and the line is hoisted from memory and merged. But for that to happen a line first has to be completely evicted to physical memory, and then new writes without intermediate loads have to have created new lines in cache for the addresses in the line.<br /><br />
As a result the cases where actual access to physical memory is necessary have been vastly reduced. And often the temporary data in smaller subroutines never gets into physical memory at all so the whole lifetime of the objects has been spent in cache.<br />
<br />
=== Memory Allocation ===<br />
<br />
Only when a lowest level cache line is evicted does an actual memory page get allocated. And even this happens completely in hardware with cache line size pages and a bit map allocator from a hierarchy with larger pages. Invalid bits are still set to 0 by the MMU.<br /><br />
It is those larger pages that are managed by the OS in traps raised by the <abbr title="Memory Management Unit">MMU</abbr> when it runs low on backing memory pages.<br /><br />
A big advantage of this allocation behavior is that in the vast majority of cases you only get to write into memory when a larger number of writes has accumulated in cache already, and they are written all at once. In contrast to write-back caches, where this is also the case, you don't always need to evict a cache line on a read miss. This is because you can often just merge the memory backing into the invalid bytes of the existing cache line. Only a read from a truly cold and unpredicted and thus unprefetched line triggers an evict that causes a stall. A store into a cold line triggers an evict too, after it cascaded down the caches, but this evict almost never causes a stall since the evicted line is most likely a cold line.<br />
<br />
== Instruction Cache ==<br />
<br />
The instruction caches are read only. And as mentioned before, they are specialized to their Instruction stream. This means they are managed differently from the data caches to facilitate better instruction [[Prediction#Prefetch_and_Fetch|prefetch]] and [[Decode]] without bubbles in the pipeline. More in this on the respective pages.<br />
<br />
== Sequential Consistency ==<br />
<br />
All memory accesses happen in the order they occur in the program. This is sequential consistency. No access reordering happens, and consequently there is no need for memory fences and the like.<br /><br />
Loads and stores may be placed in the same instruction or retire in the same cycle, and as such are issued and executed in parallel. But the order they retire in is still determined by the order they appear in the instruction, and as such by the order of the [[Slot]]s they were issued into.<br /><br />
This order is not only maintained on a single core, but a defined order for all cores on a chip is maintained with the cache coherency protocol.<br />
<br />
== [[Spiller]] ==<br />
<br />
The spiller is a dedicated central hardware module that preserves internal core state for as long as it may be needed. As such it may save internal core state to dedicated DRAM areas, i.e. the spiller space. This memory is not accessible by any other mechanism, and no other hardware mechanisms can interfere with the spiller and meddle with its own internal state. Spiller memory accesses don't need to go through the [[Protection]] layer, since no one can make the spiller to do anything insecure. It uses the L2 cache as a buffer and everything still goes through address translation, because special system tools like [[Debugger]]s occasionally need to read spiller state.<br /><br />
<br />
== [[Streamer]] ==<br />
<br />
== Rationale ==<br />
<br />
Memory latency is the main bottleneck that dictates how modern processors are designed. Memory latency is the reason why all the expensive out-of-order hardware is so prevalent on virtually all general purpose processors since the 60s. If anything the steadily increasing gap in frequency between memory and processor cores makes the latency felt even more today.<br />
<br />
So hiding the latency of the memory accesses and reducing the amount of memory accesses are the primary goals in any processor architecture. Both is mainly achieved with the use of caches. Sophisticated [[Prediction]] and prefetch fills the caches as far in advance as possible. [[Instruction Set/Load|Load]] and [[Instruction Set/Load|store]] deferring and [[Pipelining]] and [[Speculation]] hide the cache latencies, and increase the levels of <abbr title="Instruction Level Parallelism">IPL</abbr> for loads and stores by making memory accesses less dependent on each other. And the cache management protocols determine the amount of actual memory accesses.<br />
<br />
All three aspects have new solutions on the Mill. Generally those solutions are not really more powerful or faster than the solutions of conventional out-of-order architectures. They are only vastly cheaper. Truly random and unpredictable work loads still can't be helped though.<br />
<br />
== See Also ==<br />
<br />
[[Spiller]], [[Virtual Addresses]]<br />
<br />
== Media ==<br />
[http://www.youtube.com/watch?v=bjRDaaGlER8 Presentation on the Memory Hierarchy by Ivan Godard] - [http://millcomputing.com/blog/wp-content/uploads/2013/12/2013-10-16_mill_cpu_hierarchy_08.pptx Slides]</div>Janhttp://millcomputing.com/wiki/CommunityCommunity2015-01-02T21:57:49Z<p>Jan: Redirected page to Mill Computing Wiki:About</p>
<hr />
<div>#REDIRECT [[Mill Computing Wiki:About]]</div>Janhttp://millcomputing.com/wiki/Main_PageMain Page2015-01-02T21:56:56Z<p>Jan: </p>
<hr />
<div>The Mill is a new general purpose processor architecture.<br />
It intends to run general purpose code at <abbr title="Digital Signal Processor">DSP</abbr> speed and power requirements.<br />
A number of new innovations and established old technologies used in a new context need to be employed to achieve this goal.<br />
<br />
<table style="width: 50%; padding: 2% 5%; font-size: 12pt"><br />
<tr><td>'''[[Architecture]]'''</td><td>'''[[Infrastructure]]'''</td><td>'''[[Community]]'''</td></tr><br />
<tr><td>[[Belt]]</td><td>[[Synthesis]]</td></tr><br />
<tr><td>[[Encoding]]</td><td>[[Compiler]]</td></tr><br />
<tr><td>[[Memory]]</td><td>[[Specializer]]</td></tr><br />
<tr><td>[[Metadata]]</td><td>[[Debugger]]</td></tr><br />
<tr><td>[[Protection]]</td><td>[[Simulator]]</td></tr><br />
<tr><td>[[Prediction]]</td><td>[[Specification]]</td></tr><br />
<tr><td>[[Execution]]</td><td>[[conAsm]]</td></tr><br />
<tr><td>[[Instruction Set]]</td><td>[[genAsm]]</td></tr><br />
</table></div>Janhttp://millcomputing.com/wiki/Mill_Computing_Wiki:AboutMill Computing Wiki:About2015-01-02T21:55:28Z<p>Jan: </p>
<hr />
<div>This public wiki serves tree primary purposes:<br />
<br />
# It is a reference for the Mill architecture, for anything that can be made available publicly.<br />A lot of reference pages on this wiki are generated automatically and are regularly overwritten, in particular, all pages and subpages under [[Instruction Set]] and [[Cores]] are generated, but also [[Registers]], [[Functional Units]] and [[Instruction Set by Category]]. They are not editable by normal users.<br />
# It is also intended to foster discussion in a more directed context than possible on a forum.<br />All forum users can create and edit wiki pages and start discussions, even for generated pages.<br />
# And last but not least, it exists to facilitate community projects making use of the Mill architecture.<br />Community projects must be kept separate under their own subpage namespaces under the [[Community]] page.</div>Janhttp://millcomputing.com/wiki/SpeculationSpeculation2014-12-31T02:49:14Z<p>Jan: /* Speculable and Realizing Operations */</p>
<hr />
<div>Speculation preemptively does computing work you are not really sure you will need, so that by the time you are sure it is already done.<br />
<br />
== [[Metadata#None_and_NaR|None and NaR]] ==<br />
<br />
The problem is, often the work you try to do prematurely can clobber all the actual work you are doing, and can get in the way whenever there is any shared state. So the more you avoid shared state, the more you can do in parallel without getting in each others way.<br />
<br />
The Mill already does a lot in this regard by having <abbr title="Static Single Assignment">SSA</abbr> semantics on the [[Belt]]. This works great for proper data values. Conventional architectures tend to have error and condition codes as global shared state though. [[Metadata]] to the rescue. In particular None and <abbr title="Not a Result">NaR</abbr> and the floating point status flags.<br />
<br />
== Speculable and Realizing Operations ==<br />
<br />
By far the most of the operations in the Mill instruction set can be speculated. What this means is, if an operand to the operation is None or NaR, all the operation does is to make the result None or NaR, or combine the [[Metadata#IEEE_754_Floating_Point_Flags|status flags]] in the case of floating point operations in the result. This is even true for the [[Instruction_Set/load|load]] operation.<br />
<br />
Only when values are put into shared system state it becomes relevant whether the values are valid or not. This is when those values become realized. There are only comparatively few operations that realize values, in particlar [[Instruction_Set/store|load]], [[Instruction_Set/store|store]] and branches. They all are in the [[Phasing#Writer_Phase|writer phase]], or rather, in the phases following the [[Phasing#Compute_Phase|compute phase]].<br />
<br />
When a realizing operation encounters a None, it does nothing.<br /><br />
A load from a None address produces a None.<br /><br />
A store with a None value or address doesn't write anything.<br /><br />
A branch or call to an address that is a None doesn't happen.<br />
<br />
When a realizing operation encounters a NaR, it faults, and eventually a fault handler matching the NaR is called.<br /><br />
The exception is a load, which from a NaR address just produces the NaR, like a speculable operation. The main reason load isn't speculable, despite loads being potentially speculative, is that it can cause stalls and the like, which needs to be accounted for.<br /><br />
A store with a NaR in any value or address operand doesn't write anything, but raises the appropriate [[Fault]].<br /><br />
Same for branches or calls to a NaR address.<br />
<br />
== [[Instruction_Set/pick|Pick]] ==<br />
<br />
The pick operation is a special beast. It has the semantics of the C ?: operator and it has zero latency and 3 operands. All this is only really possible, and cheaply possible, because it doesn't actually need a functional unit. It is implemented in the renaming of [[Belt]] locations at the cycle boundary. And it is speculable, in contrast to true branches. With those attributes it can replace a lot of conditional branches, and tends to be the operation that picks which of all the speculatively computed values are passed on to be realized.<br />
<br />
== Rationale ==<br />
<br />
Speculation is one of the few areas where the Mill favours higher energy consumption, because the performance gains are so great. Unneeded computation still costs energy, but since the Mill architecure is very wide issue and has a lot of functional units, it can exploit a lot of instrucion level parallelism. It saves time. And actually on modern fab processes with high leakage transistors, an idle circuit doesn't use that much less energy than a busy one. So computing several branches in parallel, when you have the width, really saves energy in comparison to doing it in sequence with lots of idle units.<br />
<br />
In general purpose code the problem usually is to find that much <abbr title="Instruction Level Parallelism">ILP</abbr>, because there are so many branches. Most branches only exist to avoid unwanted side effects under certain circumstances.<br />
<br />
The Mill has devised a few ways to avoid the unwanted side effects, which means far fewer of the branches in a program are hard barriers to <abbr title="Instruction Level Parallelism">ILP</abbr>. [[Phasing]] is one of the ways. Software [[Pipelining]] of loops also makes extensive use of the <abbr title="Not a Result">NaR</abbr> and the None [[Metadata]] tags for this purpose.<br />
<br />
Speculation increases <abbr title="Instruction Level Parallelism">ILP</abbr> across branch boundaries independently of loops. <abbr title="Not a Result">NaR</abbr> and None and [[Instruction_Set/pick|pick]] enable if-conversion on a massive scale on the Mill, removing branches altogether from the code by utilizing (meta)data flow instead of control flow. And even without [[Instruction_Set/pick|pick]], but with [[Execution#Multi-Branch|parallel branches]] or with [[Condition Code|condition codes]] the ILP is greatly increased here, too.<br />
<br />
=== Speculation vs. [[Prediction]] ===<br />
<br />
Some might ask what the difference is between speculation and [[Prediction|prediction]]. Those concepts are only superficially connected. While both try to avoid stalls due to branches in the execution pipelines, both go about it very differently.<br />
<br />
Speculation eliminates branches by going down all paths of execution and choosing the right result after the fact. This generally only works for relatively short and small differences in the paths and somewhat similar code, but can cover many paths all at once on wide issue machines and avoids all latencies.<br />
<br />
Prediction chooses one path, and tries to become as good as possible at choosing the correct one. There is no excess computation work done here. But a wrong guess means a penalty of idleness for several cycles, usually 5 cycles on the Mill. With a really wrong guess this can become a full memory latency penalty in rare cases. This works for long paths and very different code down the different paths too.<br />
<br />
== Media ==<br />
[http://www.youtube.com/watch?v=DZ8HN9Cnjhc Presentation on Metadata and Speculation by Ivan Godard] - [http://millcomputing.com/blog/wp-content/uploads/2013/12/metadata.021.pptx Slides]</div>Janhttp://millcomputing.com/wiki/SpeculationSpeculation2014-12-31T02:48:42Z<p>Jan: /* Pick */</p>
<hr />
<div>Speculation preemptively does computing work you are not really sure you will need, so that by the time you are sure it is already done.<br />
<br />
== [[Metadata#None_and_NaR|None and NaR]] ==<br />
<br />
The problem is, often the work you try to do prematurely can clobber all the actual work you are doing, and can get in the way whenever there is any shared state. So the more you avoid shared state, the more you can do in parallel without getting in each others way.<br />
<br />
The Mill already does a lot in this regard by having <abbr title="Static Single Assignment">SSA</abbr> semantics on the [[Belt]]. This works great for proper data values. Conventional architectures tend to have error and condition codes as global shared state though. [[Metadata]] to the rescue. In particular None and <abbr title="Not a Result">NaR</abbr> and the floating point status flags.<br />
<br />
== Speculable and Realizing Operations ==<br />
<br />
By far the most of the operations in the Mill instruction set can be speculated. What this means is, if an operand to the operation is None or NaR, all the operation does is to make the result None or NaR, or combine the [[Metadata#IEEE_754_Floating_Point_Flags|status flags]] in the case of floating point operations in the result. This is even true for the [[Instruction_Set/Load|load]] operation.<br />
<br />
Only when values are put into shared system state it becomes relevant whether the values are valid or not. This is when those values become realized. There are only comparatively few operations that realize values, in particlar [[Instruction_Set/store|load]], [[Instruction_Set/store|store]] and branches. They all are in the [[Phasing#Writer_Phase|writer phase]], or rather, in the phases following the [[Phasing#Compute_Phase|compute phase]].<br />
<br />
When a realizing operation encounters a None, it does nothing.<br /><br />
A load from a None address produces a None.<br /><br />
A store with a None value or address doesn't write anything.<br /><br />
A branch or call to an address that is a None doesn't happen.<br />
<br />
When a realizing operation encounters a NaR, it faults, and eventually a fault handler matching the NaR is called.<br /><br />
The exception is a load, which from a NaR address just produces the NaR, like a speculable operation. The main reason load isn't speculable, despite loads being potentially speculative, is that it can cause stalls and the like, which needs to be accounted for.<br /><br />
A store with a NaR in any value or address operand doesn't write anything, but raises the appropriate [[Fault]].<br /><br />
Same for branches or calls to a NaR address.<br />
<br />
== [[Instruction_Set/pick|Pick]] ==<br />
<br />
The pick operation is a special beast. It has the semantics of the C ?: operator and it has zero latency and 3 operands. All this is only really possible, and cheaply possible, because it doesn't actually need a functional unit. It is implemented in the renaming of [[Belt]] locations at the cycle boundary. And it is speculable, in contrast to true branches. With those attributes it can replace a lot of conditional branches, and tends to be the operation that picks which of all the speculatively computed values are passed on to be realized.<br />
<br />
== Rationale ==<br />
<br />
Speculation is one of the few areas where the Mill favours higher energy consumption, because the performance gains are so great. Unneeded computation still costs energy, but since the Mill architecure is very wide issue and has a lot of functional units, it can exploit a lot of instrucion level parallelism. It saves time. And actually on modern fab processes with high leakage transistors, an idle circuit doesn't use that much less energy than a busy one. So computing several branches in parallel, when you have the width, really saves energy in comparison to doing it in sequence with lots of idle units.<br />
<br />
In general purpose code the problem usually is to find that much <abbr title="Instruction Level Parallelism">ILP</abbr>, because there are so many branches. Most branches only exist to avoid unwanted side effects under certain circumstances.<br />
<br />
The Mill has devised a few ways to avoid the unwanted side effects, which means far fewer of the branches in a program are hard barriers to <abbr title="Instruction Level Parallelism">ILP</abbr>. [[Phasing]] is one of the ways. Software [[Pipelining]] of loops also makes extensive use of the <abbr title="Not a Result">NaR</abbr> and the None [[Metadata]] tags for this purpose.<br />
<br />
Speculation increases <abbr title="Instruction Level Parallelism">ILP</abbr> across branch boundaries independently of loops. <abbr title="Not a Result">NaR</abbr> and None and [[Instruction_Set/pick|pick]] enable if-conversion on a massive scale on the Mill, removing branches altogether from the code by utilizing (meta)data flow instead of control flow. And even without [[Instruction_Set/pick|pick]], but with [[Execution#Multi-Branch|parallel branches]] or with [[Condition Code|condition codes]] the ILP is greatly increased here, too.<br />
<br />
=== Speculation vs. [[Prediction]] ===<br />
<br />
Some might ask what the difference is between speculation and [[Prediction|prediction]]. Those concepts are only superficially connected. While both try to avoid stalls due to branches in the execution pipelines, both go about it very differently.<br />
<br />
Speculation eliminates branches by going down all paths of execution and choosing the right result after the fact. This generally only works for relatively short and small differences in the paths and somewhat similar code, but can cover many paths all at once on wide issue machines and avoids all latencies.<br />
<br />
Prediction chooses one path, and tries to become as good as possible at choosing the correct one. There is no excess computation work done here. But a wrong guess means a penalty of idleness for several cycles, usually 5 cycles on the Mill. With a really wrong guess this can become a full memory latency penalty in rare cases. This works for long paths and very different code down the different paths too.<br />
<br />
== Media ==<br />
[http://www.youtube.com/watch?v=DZ8HN9Cnjhc Presentation on Metadata and Speculation by Ivan Godard] - [http://millcomputing.com/blog/wp-content/uploads/2013/12/metadata.021.pptx Slides]</div>Janhttp://millcomputing.com/wiki/SpeculationSpeculation2014-12-31T02:48:21Z<p>Jan: /* Rationale */</p>
<hr />
<div>Speculation preemptively does computing work you are not really sure you will need, so that by the time you are sure it is already done.<br />
<br />
== [[Metadata#None_and_NaR|None and NaR]] ==<br />
<br />
The problem is, often the work you try to do prematurely can clobber all the actual work you are doing, and can get in the way whenever there is any shared state. So the more you avoid shared state, the more you can do in parallel without getting in each others way.<br />
<br />
The Mill already does a lot in this regard by having <abbr title="Static Single Assignment">SSA</abbr> semantics on the [[Belt]]. This works great for proper data values. Conventional architectures tend to have error and condition codes as global shared state though. [[Metadata]] to the rescue. In particular None and <abbr title="Not a Result">NaR</abbr> and the floating point status flags.<br />
<br />
== Speculable and Realizing Operations ==<br />
<br />
By far the most of the operations in the Mill instruction set can be speculated. What this means is, if an operand to the operation is None or NaR, all the operation does is to make the result None or NaR, or combine the [[Metadata#IEEE_754_Floating_Point_Flags|status flags]] in the case of floating point operations in the result. This is even true for the [[Instruction_Set/Load|load]] operation.<br />
<br />
Only when values are put into shared system state it becomes relevant whether the values are valid or not. This is when those values become realized. There are only comparatively few operations that realize values, in particlar [[Instruction_Set/store|load]], [[Instruction_Set/store|store]] and branches. They all are in the [[Phasing#Writer_Phase|writer phase]], or rather, in the phases following the [[Phasing#Compute_Phase|compute phase]].<br />
<br />
When a realizing operation encounters a None, it does nothing.<br /><br />
A load from a None address produces a None.<br /><br />
A store with a None value or address doesn't write anything.<br /><br />
A branch or call to an address that is a None doesn't happen.<br />
<br />
When a realizing operation encounters a NaR, it faults, and eventually a fault handler matching the NaR is called.<br /><br />
The exception is a load, which from a NaR address just produces the NaR, like a speculable operation. The main reason load isn't speculable, despite loads being potentially speculative, is that it can cause stalls and the like, which needs to be accounted for.<br /><br />
A store with a NaR in any value or address operand doesn't write anything, but raises the appropriate [[Fault]].<br /><br />
Same for branches or calls to a NaR address.<br />
<br />
== [[Instruction_Set/Pick|Pick]] ==<br />
<br />
The pick operation is a special beast. It has the semantics of the C ?: operator and it has zero latency and 3 operands. All this is only really possible, and cheaply possible, because it doesn't actually need a functional unit. It is implemented in the renaming of [[Belt]] locations at the cycle boundary. And it is speculable, in contrast to true branches. With those attributes it can replace a lot of conditional branches, and tends to be the operation that picks which of all the speculatively computed values are passed on to be realized.<br />
<br />
== Rationale ==<br />
<br />
Speculation is one of the few areas where the Mill favours higher energy consumption, because the performance gains are so great. Unneeded computation still costs energy, but since the Mill architecure is very wide issue and has a lot of functional units, it can exploit a lot of instrucion level parallelism. It saves time. And actually on modern fab processes with high leakage transistors, an idle circuit doesn't use that much less energy than a busy one. So computing several branches in parallel, when you have the width, really saves energy in comparison to doing it in sequence with lots of idle units.<br />
<br />
In general purpose code the problem usually is to find that much <abbr title="Instruction Level Parallelism">ILP</abbr>, because there are so many branches. Most branches only exist to avoid unwanted side effects under certain circumstances.<br />
<br />
The Mill has devised a few ways to avoid the unwanted side effects, which means far fewer of the branches in a program are hard barriers to <abbr title="Instruction Level Parallelism">ILP</abbr>. [[Phasing]] is one of the ways. Software [[Pipelining]] of loops also makes extensive use of the <abbr title="Not a Result">NaR</abbr> and the None [[Metadata]] tags for this purpose.<br />
<br />
Speculation increases <abbr title="Instruction Level Parallelism">ILP</abbr> across branch boundaries independently of loops. <abbr title="Not a Result">NaR</abbr> and None and [[Instruction_Set/pick|pick]] enable if-conversion on a massive scale on the Mill, removing branches altogether from the code by utilizing (meta)data flow instead of control flow. And even without [[Instruction_Set/pick|pick]], but with [[Execution#Multi-Branch|parallel branches]] or with [[Condition Code|condition codes]] the ILP is greatly increased here, too.<br />
<br />
=== Speculation vs. [[Prediction]] ===<br />
<br />
Some might ask what the difference is between speculation and [[Prediction|prediction]]. Those concepts are only superficially connected. While both try to avoid stalls due to branches in the execution pipelines, both go about it very differently.<br />
<br />
Speculation eliminates branches by going down all paths of execution and choosing the right result after the fact. This generally only works for relatively short and small differences in the paths and somewhat similar code, but can cover many paths all at once on wide issue machines and avoids all latencies.<br />
<br />
Prediction chooses one path, and tries to become as good as possible at choosing the correct one. There is no excess computation work done here. But a wrong guess means a penalty of idleness for several cycles, usually 5 cycles on the Mill. With a really wrong guess this can become a full memory latency penalty in rare cases. This works for long paths and very different code down the different paths too.<br />
<br />
== Media ==<br />
[http://www.youtube.com/watch?v=DZ8HN9Cnjhc Presentation on Metadata and Speculation by Ivan Godard] - [http://millcomputing.com/blog/wp-content/uploads/2013/12/metadata.021.pptx Slides]</div>Janhttp://millcomputing.com/wiki/SpeculationSpeculation2014-12-31T00:05:10Z<p>Jan: /* Speculable and Realizing Operations */</p>
<hr />
<div>Speculation preemptively does computing work you are not really sure you will need, so that by the time you are sure it is already done.<br />
<br />
== [[Metadata#None_and_NaR|None and NaR]] ==<br />
<br />
The problem is, often the work you try to do prematurely can clobber all the actual work you are doing, and can get in the way whenever there is any shared state. So the more you avoid shared state, the more you can do in parallel without getting in each others way.<br />
<br />
The Mill already does a lot in this regard by having <abbr title="Static Single Assignment">SSA</abbr> semantics on the [[Belt]]. This works great for proper data values. Conventional architectures tend to have error and condition codes as global shared state though. [[Metadata]] to the rescue. In particular None and <abbr title="Not a Result">NaR</abbr> and the floating point status flags.<br />
<br />
== Speculable and Realizing Operations ==<br />
<br />
By far the most of the operations in the Mill instruction set can be speculated. What this means is, if an operand to the operation is None or NaR, all the operation does is to make the result None or NaR, or combine the [[Metadata#IEEE_754_Floating_Point_Flags|status flags]] in the case of floating point operations in the result. This is even true for the [[Instruction_Set/Load|load]] operation.<br />
<br />
Only when values are put into shared system state it becomes relevant whether the values are valid or not. This is when those values become realized. There are only comparatively few operations that realize values, in particlar [[Instruction_Set/store|load]], [[Instruction_Set/store|store]] and branches. They all are in the [[Phasing#Writer_Phase|writer phase]], or rather, in the phases following the [[Phasing#Compute_Phase|compute phase]].<br />
<br />
When a realizing operation encounters a None, it does nothing.<br /><br />
A load from a None address produces a None.<br /><br />
A store with a None value or address doesn't write anything.<br /><br />
A branch or call to an address that is a None doesn't happen.<br />
<br />
When a realizing operation encounters a NaR, it faults, and eventually a fault handler matching the NaR is called.<br /><br />
The exception is a load, which from a NaR address just produces the NaR, like a speculable operation. The main reason load isn't speculable, despite loads being potentially speculative, is that it can cause stalls and the like, which needs to be accounted for.<br /><br />
A store with a NaR in any value or address operand doesn't write anything, but raises the appropriate [[Fault]].<br /><br />
Same for branches or calls to a NaR address.<br />
<br />
== [[Instruction_Set/Pick|Pick]] ==<br />
<br />
The pick operation is a special beast. It has the semantics of the C ?: operator and it has zero latency and 3 operands. All this is only really possible, and cheaply possible, because it doesn't actually need a functional unit. It is implemented in the renaming of [[Belt]] locations at the cycle boundary. And it is speculable, in contrast to true branches. With those attributes it can replace a lot of conditional branches, and tends to be the operation that picks which of all the speculatively computed values are passed on to be realized.<br />
<br />
== Rationale ==<br />
<br />
Speculation is one of the few areas where the Mill favours higher energy consumption, because the performance gains are so great. Unneeded computation still costs energy, but since the Mill architecure is very wide issue and has a lot of functional units, it can exploit a lot of instrucion level parallelism. It saves time. And actually on modern fab processes with high leakage transistors, an idle circuit doesn't use that much less energy than a busy one. So computing several branches in parallel, when you have the width, really saves energy in comparison to doing it in sequence with lots of idle units.<br />
<br />
In general purpose code the problem usually is to find that much <abbr title="Instruction Level Parallelism">ILP</abbr>, because there are so many branches. Most branches only exist to avoid unwanted side effects under certain circumstances.<br />
<br />
The Mill has devised a few ways to avoid the unwanted side effects, which means far fewer of the branches in a program are hard barriers to <abbr title="Instruction Level Parallelism">ILP</abbr>. [[Phasing]] is one of the ways. Software [[Pipelining]] of loops also makes extensive use of the <abbr title="Not a Result">NaR</abbr> and the None [[Metadata]] tags for this purpose.<br />
<br />
Speculation increases <abbr title="Instruction Level Parallelism">ILP</abbr> across branch boundaries independently of loops. <abbr title="Not a Result">NaR</abbr> and None and [[Instruction_Set/Pick|pick]] enable if-conversion on a massive scale on the Mill, removing branches altogether from the code by utilizing (meta)data flow instead of control flow. And even without [[Instruction_Set/Pick|pick]], but with [[Execution#Multi-Branch|parallel branches]] or with [[Gangs#Condition_Codes|condition codes]] the ILP is greatly increased here, too.<br />
<br />
=== Speculation vs. [[Prediction]] ===<br />
<br />
Some might ask what the difference is between speculation and [[Prediction|prediction]]. Those concepts are only superficially connected. While both try to avoid stalls due to branches in the execution pipelines, both go about it very differently.<br />
<br />
Speculation eliminates branches by going down all paths of execution and choosing the right result after the fact. This generally only works for relatively short and small differences in the paths and somewhat similar code, but can cover many paths all at once on wide issue machines and avoids all latencies.<br />
<br />
Prediction chooses one path, and tries to become as good as possible at choosing the correct one. There is no excess computation work done here. But a wrong guess means a penalty of idleness for several cycles, usually 5 cycles on the Mill. With a really wrong guess this can become a full memory latency penalty in rare cases. This works for long paths and very different code down the different paths too.<br />
<br />
== Media ==<br />
[http://www.youtube.com/watch?v=DZ8HN9Cnjhc Presentation on Metadata and Speculation by Ivan Godard] - [http://millcomputing.com/blog/wp-content/uploads/2013/12/metadata.021.pptx Slides]</div>Janhttp://millcomputing.com/wiki/FaultFault2014-12-31T00:02:10Z<p>Jan: Redirected page to Events#Fault</p>
<hr />
<div>#REDIRECT [[Events#Fault]]</div>Janhttp://millcomputing.com/wiki/OperandsOperands2014-12-30T23:37:01Z<p>Jan: /* The Operand Matrix */</p>
<hr />
<div>Pretty much all operations take values from the belt as parameters. Those are the operands. There are also other kinds of arguments like [[Immediates|immediate values]], or special [[Registers]], but the belt operands are the data values you care about and operate with in your programs. The belt can hold any kind of supported bit pattern and the connected [[Metadata]] like byte width and scalarity. But not all value types make sense for all operations. There are a lot of operations that fail when they encounter inappropriate data. And the type of the result values very often is completely dependent on the type of the input operands.<br />
<br />
For this reason there is a matrix connected to each operation that defines which kinds of operands are accepted and which kinds of result they produce. The [[Compiler]] and [Specializer]] are aware of those restrictions and can act accordingly. And the hardware generates the appropriate [[Metadata#None_and_NaR|NaR]] when it encounters inappropriate operands for some reason.<br />
<br />
== The Operand Matrix ==<br />
<br />
=== Operand Value Types ===<br />
<br />
The semantic interpretation of the bit patterns in a belt value is completely up to the operation itself. This is not relevant to the belt value type maintained in the Metadata. The only thing that matters is the byte width and the scalarity. Belt values can be 1, 2, 4 or 8 bytes wide on all [[Cores]] and also 16 on the higher end hardware. They can be a scalar value or a vector value. The vector size is fixed for a specific core, if a value is a vector it has always the same amount of elements of the given byte width. This means there are 10 possible operand types, the 5 scalar byte widths and the 5 vectored variants. Eight of those are available on all cores.<br />
<br />
=== The Matrix ===<br />
<br />
The common case for all operations is 2 operands one result. This means the matrix of possible operands and results for each operation is a 10×10 matrix for the operand types with the result types at the appropriate coordinates.<br /><br />
This is even expanded to operations with only one operand, where the one operand is used for both dimensions, and the result are at the diagonal, and for ganged operations with more operands, which either don't care about the values or the first two operands are enough.<br />
<br />
=== The Abbreviated Matrix===<br />
<br />
This full 10×10 matrix is of course a little bit unwieldy to display for documentation, for this reason it is usually referred to by symbolic name, which usually is the name of the operation this matrix was initially created for. Often this name doesn't really explain anything though, so for purely documentary reasons there is an abbreviated format to show which kinds of operands are allowed/accepted and which kinds of results they produce.<br /><br />
For the simpler cases this should be pretty self explanatory, for the rare more complicated cases the exact semantics are explained in the respective operation page.<br />
<br />
: - separates the operands from the result<br />
[] - everything in square brackets can be a vector or a scalar value<br />
dfpx - lower case operands are the widths possible of the respective [[Domains|domain]]<br />
(x is any integer)<br />
what is important is that all widths of the domain are the same within the operation<br />
DFPX - upper case must be vectors of the domain<br />
&#189; - half width of the operands, until lowest possible for domain<br />
2 - double width of the operands, up to maximum possible on core<br />
i - bit index/count, can be any width, and width is independent of other operand<br />
n - vector element index, can be any width too and is independent of other operand</div>Janhttp://millcomputing.com/wiki/CoreCore2014-12-30T00:56:03Z<p>Jan: Redirected page to Cores</p>
<hr />
<div>#REDIRECT [[Cores]]</div>Janhttp://millcomputing.com/wiki/OverflowOverflow2014-12-29T23:50:43Z<p>Jan: </p>
<hr />
<div>When integer operations produce values that can't be contained within the bit width of the result, there are different strategies of how to deal with it. Which one is used depends on the chosen operation.<br />
<br />
In all cases the [[Condition_Code|condition codes]] are generated and can be queried with the dedicated ganged operations. What the overflow behavior does determine is the primary result that is produced.<br />
<br />
# modulo - normal silent overflow or underflow, also called wraparound<br />
# saturating - the highest of lowest representable value in the given with is the returned result<br />
# excepting - the appropriate [[NaR]] is created as the result<br />
# widening - here the scalar byte width is doubled, whether it is needed or not, and the full and exact result is the return value</div>Janhttp://millcomputing.com/wiki/DomainsDomains2014-12-29T17:05:07Z<p>Jan: /* f - Floating Point */</p>
<hr />
<div>Domains are the different kinds of scalar values the different Mill operations can work with. What this really means is, A value on the belt is just a value on the belt, it is just bits. The [[Metadata]] determines how many bits and whether they are arranged in vectors or not, but how exactly those bits are interpreted is up to the different operations. And as such the different operations tend to fall into different categories called domain. Some operations don't care at all what kind of values they are dealing with, they just move them around, but anything that produces new bit patterns based on old ones does have a domain it interprets the bitpatterns as.<br /><br />
Most of the domains are indicated in the opcode mnemonic of the operation via suffix.<br />
<br />
{| class="wikitable" style="width:500px;text-align:center;position: absolute; left: 20em; margin-top: 7em"<br />
|+ Byte Widths for Different Domains<br />
! style="width:2.5em" | op<br />
! style="width:2.5em" | u<br />
! style="width:2.5em" | s<br />
! style="width:2.5em" | p<br />
! style="width:2.5em" | f<br />
! style="width:2.5em" | d<br />
! style="width:2.5em" | uf<br />
! style="width:2.5em" | sf<br />
! style="color:#555;width:2.5em" | i<br />
! style="color:#555;width:2.5em" | n<br />
! style="color:#555;width:2.5em" | sel<br />
! style="color:#555;width:2.5em" | pred<br />
|-<br />
| 1-16<br />
| 1-16<br />
| 1-16<br />
| 8<br />
| 4-16<br />
| 8-16<br />
| 1-16<br />
| 1-16<br />
| 1<br />
| 1<br />
| 1<br />
| 1<br />
|}<br />
<br />
= Domains =<br />
<br />
== <span id="op">Logical</span> ==<br />
<br />
The logical domain is the default domain in a sense in that it is not indicated by any operation suffix. It treats bits just as bits, and a whole slew of booloan and logical operations work on pure bits. Also quite a few signed and unsigned integer operations alias to the logical domain when they have modulo/wraparound [[Overflow|overflow]] behavior. In the documentation pages for the operations operands that are interpreted as logical bits are presented as an op type. This is also the case for operations that don't really change the bit patterns but just move them around in some way.<br />
<br />
== <span id="u">u - Unsigned Integer</span> ==<br />
<br />
Most of the unsigned integer arithmetic with modulo [[Overflow|overflow]] behavior aliases to the logical domain hardware.<br />
<br />
== <span id="s">s - Signed Integer</span> ==<br />
<br />
Most of the signed integer arithmetic with modulo [[Overflow|overflow]] behavior aliases to the logical domain hardware, too.<br />
<br />
== <span id="p">p - Pointers</span> ==<br />
<br />
Pointers are always 64bit, and the upper 4 bits are ignored in pointer arithmetic because they are reserved for special purposes, like triggering traps for garbage collection. More under [[Virtual Address]]es.<br />
<br />
== <span id="f">f - Floating Point</span> ==<br />
<br />
This is [http://en.wikipedia.org/wiki/IEEE_floating_point IEEE 754] binary floating points. Available as 32, 64 and 128 bits. The [[Instruction_Set/narrowf|narrowf]] operation can produce 16bit floats, and the [[Instruction_Set/widenf|widenf]] operation can use them as an operand to produce 32bit floats, but this is a are pure interchange format to be loaded and stored, and no arithmetic is available on it.<br />
<br />
== <span id="d">d - Decimal Floating Point</span> ==<br />
<br />
[http://en.wikipedia.org/wiki/IEEE_floating_point IEEE 754] decimal floating points. Available as 64 and 128 bits. On most [[Cores]] this is emulated. The [[Instruction_Set/narrowd|narrowd]] operation can produce 32bit decimal floats, and the [[Instruction_Set/widenf|widenf]] operation can use them as an operand to produce 64bit decimal floats, but this is a are pure interchange format to be loaded and stored, and no arithmetics are available on it.<br />
<br />
== <span id="sf">sf - Signed Fixed Point</span> ==<br />
<br />
Generally fixed point arithmetic is the same as integer arithmetic, except for shifts and multiplication and widening and narrowing.<br />
<br />
== <span id="uf">uf - Unsigned Fixed Point</span> ==<br />
<br />
The same goes for unsigned fixed point arithmetic.<br />
<br />
<br /><br /><br />
= Pseudo Domains =<br />
<br />
There are also some operands to some operations that get interpreted in a special way without applying to the whole operation. Those are not truly full fledged domains and aren't part of the mnemonics but are listed here anyway.<br />
Generally they take the width of the other operands in the operation, but only the lowest few bits or even just the lowest bit matter.<br />
<br />
== <span id="i" style="color:#555">i - Vector Index</span> ==<br />
<br />
Some operations build and take apart vector operands and index the vector elements. The values can be any width actually, but if the index exceeds the vector element count, [[Metadata#None_and_NaR|NaRs]] happen.<br />
<br />
== <span id="n" style="color:#555">n - Bit Count</span> ==<br />
<br />
Shifts and bit tests and similar operations move or index bits within a value. If the n value is bigger than the width of the other operand, [[Metadata#None_and_NaR|NaRs]] happen here as well.<br />
<br />
== <span id="sel" style="color:#555">sel - Selector</span> ==<br />
<br />
The [[Instruction_Set/pick|pick]] and [[Instruction_Set/recur|recur]] operations select values based on a select predicate where only the lowest bit is evaluated.<br />
<br />
== <span id="sel" style="color:#555">pred - Predicate</span> ==<br />
<br />
The conditional branches also have to make choices, those predicates really require 0 or 1 values though and don't just evaluate the lowest bit.</div>Janhttp://millcomputing.com/wiki/DomainsDomains2014-12-29T17:04:44Z<p>Jan: /* d - Decimal Floating Point */</p>
<hr />
<div>Domains are the different kinds of scalar values the different Mill operations can work with. What this really means is, A value on the belt is just a value on the belt, it is just bits. The [[Metadata]] determines how many bits and whether they are arranged in vectors or not, but how exactly those bits are interpreted is up to the different operations. And as such the different operations tend to fall into different categories called domain. Some operations don't care at all what kind of values they are dealing with, they just move them around, but anything that produces new bit patterns based on old ones does have a domain it interprets the bitpatterns as.<br /><br />
Most of the domains are indicated in the opcode mnemonic of the operation via suffix.<br />
<br />
{| class="wikitable" style="width:500px;text-align:center;position: absolute; left: 20em; margin-top: 7em"<br />
|+ Byte Widths for Different Domains<br />
! style="width:2.5em" | op<br />
! style="width:2.5em" | u<br />
! style="width:2.5em" | s<br />
! style="width:2.5em" | p<br />
! style="width:2.5em" | f<br />
! style="width:2.5em" | d<br />
! style="width:2.5em" | uf<br />
! style="width:2.5em" | sf<br />
! style="color:#555;width:2.5em" | i<br />
! style="color:#555;width:2.5em" | n<br />
! style="color:#555;width:2.5em" | sel<br />
! style="color:#555;width:2.5em" | pred<br />
|-<br />
| 1-16<br />
| 1-16<br />
| 1-16<br />
| 8<br />
| 4-16<br />
| 8-16<br />
| 1-16<br />
| 1-16<br />
| 1<br />
| 1<br />
| 1<br />
| 1<br />
|}<br />
<br />
= Domains =<br />
<br />
== <span id="op">Logical</span> ==<br />
<br />
The logical domain is the default domain in a sense in that it is not indicated by any operation suffix. It treats bits just as bits, and a whole slew of booloan and logical operations work on pure bits. Also quite a few signed and unsigned integer operations alias to the logical domain when they have modulo/wraparound [[Overflow|overflow]] behavior. In the documentation pages for the operations operands that are interpreted as logical bits are presented as an op type. This is also the case for operations that don't really change the bit patterns but just move them around in some way.<br />
<br />
== <span id="u">u - Unsigned Integer</span> ==<br />
<br />
Most of the unsigned integer arithmetic with modulo [[Overflow|overflow]] behavior aliases to the logical domain hardware.<br />
<br />
== <span id="s">s - Signed Integer</span> ==<br />
<br />
Most of the signed integer arithmetic with modulo [[Overflow|overflow]] behavior aliases to the logical domain hardware, too.<br />
<br />
== <span id="p">p - Pointers</span> ==<br />
<br />
Pointers are always 64bit, and the upper 4 bits are ignored in pointer arithmetic because they are reserved for special purposes, like triggering traps for garbage collection. More under [[Virtual Address]]es.<br />
<br />
== <span id="f">f - Floating Point</span> ==<br />
<br />
This is [http://en.wikipedia.org/wiki/IEEE_floating_point IEEE 754] binary floating points. Available as 32, 64 and 128 bits. The [[Instruction_Set/narrowf|narrowf]] operation can produce 16bit floats, and the [[Instruction_Set/widenf|widenf]] operation can use them as an operand to produce 32bit floats, but this is a are pure interchange format to load and store, and no arithmetic is available on it.<br />
<br />
== <span id="d">d - Decimal Floating Point</span> ==<br />
<br />
[http://en.wikipedia.org/wiki/IEEE_floating_point IEEE 754] decimal floating points. Available as 64 and 128 bits. On most [[Cores]] this is emulated. The [[Instruction_Set/narrowd|narrowd]] operation can produce 32bit decimal floats, and the [[Instruction_Set/widenf|widenf]] operation can use them as an operand to produce 64bit decimal floats, but this is a are pure interchange format to be loaded and stored, and no arithmetics are available on it.<br />
<br />
== <span id="sf">sf - Signed Fixed Point</span> ==<br />
<br />
Generally fixed point arithmetic is the same as integer arithmetic, except for shifts and multiplication and widening and narrowing.<br />
<br />
== <span id="uf">uf - Unsigned Fixed Point</span> ==<br />
<br />
The same goes for unsigned fixed point arithmetic.<br />
<br />
<br /><br /><br />
= Pseudo Domains =<br />
<br />
There are also some operands to some operations that get interpreted in a special way without applying to the whole operation. Those are not truly full fledged domains and aren't part of the mnemonics but are listed here anyway.<br />
Generally they take the width of the other operands in the operation, but only the lowest few bits or even just the lowest bit matter.<br />
<br />
== <span id="i" style="color:#555">i - Vector Index</span> ==<br />
<br />
Some operations build and take apart vector operands and index the vector elements. The values can be any width actually, but if the index exceeds the vector element count, [[Metadata#None_and_NaR|NaRs]] happen.<br />
<br />
== <span id="n" style="color:#555">n - Bit Count</span> ==<br />
<br />
Shifts and bit tests and similar operations move or index bits within a value. If the n value is bigger than the width of the other operand, [[Metadata#None_and_NaR|NaRs]] happen here as well.<br />
<br />
== <span id="sel" style="color:#555">sel - Selector</span> ==<br />
<br />
The [[Instruction_Set/pick|pick]] and [[Instruction_Set/recur|recur]] operations select values based on a select predicate where only the lowest bit is evaluated.<br />
<br />
== <span id="sel" style="color:#555">pred - Predicate</span> ==<br />
<br />
The conditional branches also have to make choices, those predicates really require 0 or 1 values though and don't just evaluate the lowest bit.</div>Janhttp://millcomputing.com/wiki/DomainsDomains2014-12-29T17:04:22Z<p>Jan: /* f - Floating Point */</p>
<hr />
<div>Domains are the different kinds of scalar values the different Mill operations can work with. What this really means is, A value on the belt is just a value on the belt, it is just bits. The [[Metadata]] determines how many bits and whether they are arranged in vectors or not, but how exactly those bits are interpreted is up to the different operations. And as such the different operations tend to fall into different categories called domain. Some operations don't care at all what kind of values they are dealing with, they just move them around, but anything that produces new bit patterns based on old ones does have a domain it interprets the bitpatterns as.<br /><br />
Most of the domains are indicated in the opcode mnemonic of the operation via suffix.<br />
<br />
{| class="wikitable" style="width:500px;text-align:center;position: absolute; left: 20em; margin-top: 7em"<br />
|+ Byte Widths for Different Domains<br />
! style="width:2.5em" | op<br />
! style="width:2.5em" | u<br />
! style="width:2.5em" | s<br />
! style="width:2.5em" | p<br />
! style="width:2.5em" | f<br />
! style="width:2.5em" | d<br />
! style="width:2.5em" | uf<br />
! style="width:2.5em" | sf<br />
! style="color:#555;width:2.5em" | i<br />
! style="color:#555;width:2.5em" | n<br />
! style="color:#555;width:2.5em" | sel<br />
! style="color:#555;width:2.5em" | pred<br />
|-<br />
| 1-16<br />
| 1-16<br />
| 1-16<br />
| 8<br />
| 4-16<br />
| 8-16<br />
| 1-16<br />
| 1-16<br />
| 1<br />
| 1<br />
| 1<br />
| 1<br />
|}<br />
<br />
= Domains =<br />
<br />
== <span id="op">Logical</span> ==<br />
<br />
The logical domain is the default domain in a sense in that it is not indicated by any operation suffix. It treats bits just as bits, and a whole slew of booloan and logical operations work on pure bits. Also quite a few signed and unsigned integer operations alias to the logical domain when they have modulo/wraparound [[Overflow|overflow]] behavior. In the documentation pages for the operations operands that are interpreted as logical bits are presented as an op type. This is also the case for operations that don't really change the bit patterns but just move them around in some way.<br />
<br />
== <span id="u">u - Unsigned Integer</span> ==<br />
<br />
Most of the unsigned integer arithmetic with modulo [[Overflow|overflow]] behavior aliases to the logical domain hardware.<br />
<br />
== <span id="s">s - Signed Integer</span> ==<br />
<br />
Most of the signed integer arithmetic with modulo [[Overflow|overflow]] behavior aliases to the logical domain hardware, too.<br />
<br />
== <span id="p">p - Pointers</span> ==<br />
<br />
Pointers are always 64bit, and the upper 4 bits are ignored in pointer arithmetic because they are reserved for special purposes, like triggering traps for garbage collection. More under [[Virtual Address]]es.<br />
<br />
== <span id="f">f - Floating Point</span> ==<br />
<br />
This is [http://en.wikipedia.org/wiki/IEEE_floating_point IEEE 754] binary floating points. Available as 32, 64 and 128 bits. The [[Instruction_Set/narrowf|narrowf]] operation can produce 16bit floats, and the [[Instruction_Set/widenf|widenf]] operation can use them as an operand to produce 32bit floats, but this is a are pure interchange format to load and store, and no arithmetic is available on it.<br />
<br />
== <span id="d">d - Decimal Floating Point</span> ==<br />
<br />
[http://en.wikipedia.org/wiki/IEEE_floating_point IEEE 754] decimal floating points. Available as 64 and 128 bits. On most [[Cores]] this is emulated. The [[Instruction_Set/narrowd|narrowd]] operation can produce 32bit decimal floats, and the [[Instruction_Set/widenf|widenf]] operation can use them as an operand to produce 64bit decimal floats, but this is a are pure interchange format to be saved and stored, and no arithmetics are available on it.<br />
<br />
== <span id="sf">sf - Signed Fixed Point</span> ==<br />
<br />
Generally fixed point arithmetic is the same as integer arithmetic, except for shifts and multiplication and widening and narrowing.<br />
<br />
== <span id="uf">uf - Unsigned Fixed Point</span> ==<br />
<br />
The same goes for unsigned fixed point arithmetic.<br />
<br />
<br /><br /><br />
= Pseudo Domains =<br />
<br />
There are also some operands to some operations that get interpreted in a special way without applying to the whole operation. Those are not truly full fledged domains and aren't part of the mnemonics but are listed here anyway.<br />
Generally they take the width of the other operands in the operation, but only the lowest few bits or even just the lowest bit matter.<br />
<br />
== <span id="i" style="color:#555">i - Vector Index</span> ==<br />
<br />
Some operations build and take apart vector operands and index the vector elements. The values can be any width actually, but if the index exceeds the vector element count, [[Metadata#None_and_NaR|NaRs]] happen.<br />
<br />
== <span id="n" style="color:#555">n - Bit Count</span> ==<br />
<br />
Shifts and bit tests and similar operations move or index bits within a value. If the n value is bigger than the width of the other operand, [[Metadata#None_and_NaR|NaRs]] happen here as well.<br />
<br />
== <span id="sel" style="color:#555">sel - Selector</span> ==<br />
<br />
The [[Instruction_Set/pick|pick]] and [[Instruction_Set/recur|recur]] operations select values based on a select predicate where only the lowest bit is evaluated.<br />
<br />
== <span id="sel" style="color:#555">pred - Predicate</span> ==<br />
<br />
The conditional branches also have to make choices, those predicates really require 0 or 1 values though and don't just evaluate the lowest bit.</div>Janhttp://millcomputing.com/wiki/DomainsDomains2014-12-29T17:03:21Z<p>Jan: /* d - Decimal Floating Point */</p>
<hr />
<div>Domains are the different kinds of scalar values the different Mill operations can work with. What this really means is, A value on the belt is just a value on the belt, it is just bits. The [[Metadata]] determines how many bits and whether they are arranged in vectors or not, but how exactly those bits are interpreted is up to the different operations. And as such the different operations tend to fall into different categories called domain. Some operations don't care at all what kind of values they are dealing with, they just move them around, but anything that produces new bit patterns based on old ones does have a domain it interprets the bitpatterns as.<br /><br />
Most of the domains are indicated in the opcode mnemonic of the operation via suffix.<br />
<br />
{| class="wikitable" style="width:500px;text-align:center;position: absolute; left: 20em; margin-top: 7em"<br />
|+ Byte Widths for Different Domains<br />
! style="width:2.5em" | op<br />
! style="width:2.5em" | u<br />
! style="width:2.5em" | s<br />
! style="width:2.5em" | p<br />
! style="width:2.5em" | f<br />
! style="width:2.5em" | d<br />
! style="width:2.5em" | uf<br />
! style="width:2.5em" | sf<br />
! style="color:#555;width:2.5em" | i<br />
! style="color:#555;width:2.5em" | n<br />
! style="color:#555;width:2.5em" | sel<br />
! style="color:#555;width:2.5em" | pred<br />
|-<br />
| 1-16<br />
| 1-16<br />
| 1-16<br />
| 8<br />
| 4-16<br />
| 8-16<br />
| 1-16<br />
| 1-16<br />
| 1<br />
| 1<br />
| 1<br />
| 1<br />
|}<br />
<br />
= Domains =<br />
<br />
== <span id="op">Logical</span> ==<br />
<br />
The logical domain is the default domain in a sense in that it is not indicated by any operation suffix. It treats bits just as bits, and a whole slew of booloan and logical operations work on pure bits. Also quite a few signed and unsigned integer operations alias to the logical domain when they have modulo/wraparound [[Overflow|overflow]] behavior. In the documentation pages for the operations operands that are interpreted as logical bits are presented as an op type. This is also the case for operations that don't really change the bit patterns but just move them around in some way.<br />
<br />
== <span id="u">u - Unsigned Integer</span> ==<br />
<br />
Most of the unsigned integer arithmetic with modulo [[Overflow|overflow]] behavior aliases to the logical domain hardware.<br />
<br />
== <span id="s">s - Signed Integer</span> ==<br />
<br />
Most of the signed integer arithmetic with modulo [[Overflow|overflow]] behavior aliases to the logical domain hardware, too.<br />
<br />
== <span id="p">p - Pointers</span> ==<br />
<br />
Pointers are always 64bit, and the upper 4 bits are ignored in pointer arithmetic because they are reserved for special purposes, like triggering traps for garbage collection. More under [[Virtual Address]]es.<br />
<br />
== <span id="f">f - Floating Point</span> ==<br />
<br />
This is [http://en.wikipedia.org/wiki/IEEE_floating_point IEEE 754] binary floating points. Available as 32, 64 and 128 bits. The [[Instruction_Set/narrowf|narrowf]] operation can produce 16bit floats, and the [[Instruction_Set/widenf|widenf]] operation can use them as an operand to produce 32bit floats, but those are pure interchange formats, and no arithmetic is available on it.<br />
<br />
== <span id="d">d - Decimal Floating Point</span> ==<br />
<br />
[http://en.wikipedia.org/wiki/IEEE_floating_point IEEE 754] decimal floating points. Available as 64 and 128 bits. On most [[Cores]] this is emulated. The [[Instruction_Set/narrowd|narrowd]] operation can produce 32bit decimal floats, and the [[Instruction_Set/widenf|widenf]] operation can use them as an operand to produce 64bit decimal floats, but this is a are pure interchange format to be saved and stored, and no arithmetics are available on it.<br />
<br />
== <span id="sf">sf - Signed Fixed Point</span> ==<br />
<br />
Generally fixed point arithmetic is the same as integer arithmetic, except for shifts and multiplication and widening and narrowing.<br />
<br />
== <span id="uf">uf - Unsigned Fixed Point</span> ==<br />
<br />
The same goes for unsigned fixed point arithmetic.<br />
<br />
<br /><br /><br />
= Pseudo Domains =<br />
<br />
There are also some operands to some operations that get interpreted in a special way without applying to the whole operation. Those are not truly full fledged domains and aren't part of the mnemonics but are listed here anyway.<br />
Generally they take the width of the other operands in the operation, but only the lowest few bits or even just the lowest bit matter.<br />
<br />
== <span id="i" style="color:#555">i - Vector Index</span> ==<br />
<br />
Some operations build and take apart vector operands and index the vector elements. The values can be any width actually, but if the index exceeds the vector element count, [[Metadata#None_and_NaR|NaRs]] happen.<br />
<br />
== <span id="n" style="color:#555">n - Bit Count</span> ==<br />
<br />
Shifts and bit tests and similar operations move or index bits within a value. If the n value is bigger than the width of the other operand, [[Metadata#None_and_NaR|NaRs]] happen here as well.<br />
<br />
== <span id="sel" style="color:#555">sel - Selector</span> ==<br />
<br />
The [[Instruction_Set/pick|pick]] and [[Instruction_Set/recur|recur]] operations select values based on a select predicate where only the lowest bit is evaluated.<br />
<br />
== <span id="sel" style="color:#555">pred - Predicate</span> ==<br />
<br />
The conditional branches also have to make choices, those predicates really require 0 or 1 values though and don't just evaluate the lowest bit.</div>Janhttp://millcomputing.com/wiki/DomainsDomains2014-12-29T17:00:46Z<p>Jan: /* f - Floating Point */</p>
<hr />
<div>Domains are the different kinds of scalar values the different Mill operations can work with. What this really means is, A value on the belt is just a value on the belt, it is just bits. The [[Metadata]] determines how many bits and whether they are arranged in vectors or not, but how exactly those bits are interpreted is up to the different operations. And as such the different operations tend to fall into different categories called domain. Some operations don't care at all what kind of values they are dealing with, they just move them around, but anything that produces new bit patterns based on old ones does have a domain it interprets the bitpatterns as.<br /><br />
Most of the domains are indicated in the opcode mnemonic of the operation via suffix.<br />
<br />
{| class="wikitable" style="width:500px;text-align:center;position: absolute; left: 20em; margin-top: 7em"<br />
|+ Byte Widths for Different Domains<br />
! style="width:2.5em" | op<br />
! style="width:2.5em" | u<br />
! style="width:2.5em" | s<br />
! style="width:2.5em" | p<br />
! style="width:2.5em" | f<br />
! style="width:2.5em" | d<br />
! style="width:2.5em" | uf<br />
! style="width:2.5em" | sf<br />
! style="color:#555;width:2.5em" | i<br />
! style="color:#555;width:2.5em" | n<br />
! style="color:#555;width:2.5em" | sel<br />
! style="color:#555;width:2.5em" | pred<br />
|-<br />
| 1-16<br />
| 1-16<br />
| 1-16<br />
| 8<br />
| 4-16<br />
| 8-16<br />
| 1-16<br />
| 1-16<br />
| 1<br />
| 1<br />
| 1<br />
| 1<br />
|}<br />
<br />
= Domains =<br />
<br />
== <span id="op">Logical</span> ==<br />
<br />
The logical domain is the default domain in a sense in that it is not indicated by any operation suffix. It treats bits just as bits, and a whole slew of booloan and logical operations work on pure bits. Also quite a few signed and unsigned integer operations alias to the logical domain when they have modulo/wraparound [[Overflow|overflow]] behavior. In the documentation pages for the operations operands that are interpreted as logical bits are presented as an op type. This is also the case for operations that don't really change the bit patterns but just move them around in some way.<br />
<br />
== <span id="u">u - Unsigned Integer</span> ==<br />
<br />
Most of the unsigned integer arithmetic with modulo [[Overflow|overflow]] behavior aliases to the logical domain hardware.<br />
<br />
== <span id="s">s - Signed Integer</span> ==<br />
<br />
Most of the signed integer arithmetic with modulo [[Overflow|overflow]] behavior aliases to the logical domain hardware, too.<br />
<br />
== <span id="p">p - Pointers</span> ==<br />
<br />
Pointers are always 64bit, and the upper 4 bits are ignored in pointer arithmetic because they are reserved for special purposes, like triggering traps for garbage collection. More under [[Virtual Address]]es.<br />
<br />
== <span id="f">f - Floating Point</span> ==<br />
<br />
This is [http://en.wikipedia.org/wiki/IEEE_floating_point IEEE 754] binary floating points. Available as 32, 64 and 128 bits. The [[Instruction_Set/narrowf|narrowf]] operation can produce 16bit floats, and the [[Instruction_Set/widenf|widenf]] operation can use them as an operand to produce 32bit floats, but those are pure interchange formats, and no arithmetic is available on it.<br />
<br />
== <span id="d">d - Decimal Floating Point</span> ==<br />
<br />
[http://en.wikipedia.org/wiki/IEEE_floating_point|IEEE 754] decimal floating points. Available as 64 and 128 bits. On most [[Cores]] this is emulated.<br />
<br />
== <span id="sf">sf - Signed Fixed Point</span> ==<br />
<br />
Generally fixed point arithmetic is the same as integer arithmetic, except for shifts and multiplication and widening and narrowing.<br />
<br />
== <span id="uf">uf - Unsigned Fixed Point</span> ==<br />
<br />
The same goes for unsigned fixed point arithmetic.<br />
<br />
<br /><br /><br />
= Pseudo Domains =<br />
<br />
There are also some operands to some operations that get interpreted in a special way without applying to the whole operation. Those are not truly full fledged domains and aren't part of the mnemonics but are listed here anyway.<br />
Generally they take the width of the other operands in the operation, but only the lowest few bits or even just the lowest bit matter.<br />
<br />
== <span id="i" style="color:#555">i - Vector Index</span> ==<br />
<br />
Some operations build and take apart vector operands and index the vector elements. The values can be any width actually, but if the index exceeds the vector element count, [[Metadata#None_and_NaR|NaRs]] happen.<br />
<br />
== <span id="n" style="color:#555">n - Bit Count</span> ==<br />
<br />
Shifts and bit tests and similar operations move or index bits within a value. If the n value is bigger than the width of the other operand, [[Metadata#None_and_NaR|NaRs]] happen here as well.<br />
<br />
== <span id="sel" style="color:#555">sel - Selector</span> ==<br />
<br />
The [[Instruction_Set/pick|pick]] and [[Instruction_Set/recur|recur]] operations select values based on a select predicate where only the lowest bit is evaluated.<br />
<br />
== <span id="sel" style="color:#555">pred - Predicate</span> ==<br />
<br />
The conditional branches also have to make choices, those predicates really require 0 or 1 values though and don't just evaluate the lowest bit.</div>Janhttp://millcomputing.com/wiki/SpeculableSpeculable2014-12-21T06:25:19Z<p>Jan: Redirected page to Speculation#Speculable and Realizing Operations</p>
<hr />
<div>#REDIRECT [[Speculation#Speculable_and_Realizing_Operations]]</div>Janhttp://millcomputing.com/wiki/SlotSlot2014-12-20T19:13:06Z<p>Jan: </p>
<hr />
<div>A slot is actually two things:<br />
<br />
* for one it is a defined subsection in an instruction block that encodes an operation<br />
* and it is also the piece of [[Decode]] hardware that decodes and dispatches the operations in this subsection of the instruction block (and only this subsection)<br />
<br />
Each slot has a limited set of operations that can occur in it, and only those operations can appear. For a specific [[Cores|core]] every slot can have it's unique own set of available operations. But especially on the big cores there are several slots with the same population.<br />
<br />
The operation population of a slot depends on the [[Functional Units]] added to the [[Pipeline]] dedicated to this slot in the [[Specification]]. Often this whole chain of decoder and pipeline with its funcitonal units is called a slot.<br /><br />
Each slot can issue one new operation each cycle, all of which are executed in parallel and independently to all the other operations in the parallel slots. Well, apart from a few exceptions like unconditional branches, of which there can only be one executed in each instruction, but which can appear in any of the branching slots of the instruction. This is an optimization though.</div>Janhttp://millcomputing.com/wiki/GangGang2014-12-16T14:39:32Z<p>Jan: Redirected page to Ganging</p>
<hr />
<div>#REDIRECT [[Ganging]]</div>Janhttp://millcomputing.com/wiki/GlossaryGlossary2014-12-16T12:02:42Z<p>Jan: </p>
<hr />
<div><p style="font-size: 12pt;"><br />
[[#0|0]]<br />
[[#a|a]]<br />
[[#b|b]]<br />
[[#c|c]]<br />
[[#d|d]]<br />
[[#e|e]]<br />
[[#f|f]]<br />
[[#g|g]]<br />
[[#h|h]]<br />
[[#i|i]]<br />
[[#j|j]]<br />
[[#k|k]]<br />
[[#l|l]]<br />
[[#m|m]]<br />
[[#n|n]]<br />
[[#o|o]]<br />
[[#p|p]]<br />
[[#q|q]]<br />
[[#r|r]]<br />
[[#s|s]]<br />
[[#t|t]]<br />
[[#u|u]]<br />
[[#v|v]]<br />
[[#w|w]]<br />
[[#x|x]]<br />
[[#y|y]]<br />
[[#z|z]]<br />
</p><br />
<br />
<div style="font-size: 10pt; font-weight: bold;" id="0">0</div><br />
<br />
<div style="font-size: 10pt; font-weight: bold;" id="a">a</div><br />
[[Abstract Assembly|Abstract Code]] - general data flow code for the Mill architecture, distribution format<br /><br />
[[Abstract Assembly|Abstract Assembly]] - general data flow code for the Mill architecture in human readable form, mainly used as compiler output<br /><br />
<br />
<div style="font-size: 10pt; font-weight: bold;" id="b">b</div><br />
[[Belt]] - provides the functionality of general purpose registers<br /><br />
[[Belt#Belt_Position_Data_Format|Belt Position/Belt Location]] - the read only data source for machine operations<br /><br />
[[Block]] - a subsection of an instruction that contains a subset of the operations or data in a defined encoding format.<br /><br />
[[Encoding#Instructions_and_Operations_and_Bundles|Bundle]] - a collection of instructions that get fetched from memory together<br /><br />
<br />
<div style="font-size: 10pt; font-weight: bold;" id="c">c</div><br />
[[Concrete Assembly|Concrete Code]] - specialized executable code for a specific Mill processor<br /><br />
[[Concrete Assembly|Concrete Assembly]] - specialized executable code for a specific Mill processor in human readable form, mainly used for testing and in the debugger<br /><br />
[[Crossbar]] - the interconnecting framework that routes the data sources to the functional units<br /><br />
<br />
<div style="font-size: 10pt; font-weight: bold;" id="d">d</div><br />
[[Decode]] - turning instruciton stream bit patters into requests to functional units<br /><br />
[[Domains]] - operand value types, i.e. different interpretations of bit patterns by operations<br /><br />
<br />
<div style="font-size: 10pt; font-weight: bold;" id="e">e</div><br />
[[Encoding#Extended_Basic_Block|EBB]] - extended basic block, a batch or sequence of instructions with one entry point and one or more exit points<br /><br />
[[Encoding]] – the semantic bit patterns representing operations<br /><br />
[[Events|Event]] - an asynchronous diversion from normal program flow<br /><br />
[[Prediction#Exit_Table|Exit]] - a point where the instruction stream can leave the EBB<br /><br />
[[Prediction#Exit_Table|Exit Table]] - a hardware hash table containing exit point usage for EBBs, used to predict control flow<br /><br />
[[ExuCore]] - the collection of functional units and facilities serving operations from the exu instruction stream<br /><br />
<br />
<div style="font-size: 10pt; font-weight: bold;" id="f">f</div><br />
<br />
[[Events#Faults|Fault]] - an interrupt normal program flow cannot recover from in a meaningful way<br /><br />
[[FlowCore]] - the collection of functional units and facilities serving operations from the flow instruction stream<br /><br />
[[Functional Unit|FU, Functional Unit]] - the hardware module that provides the functionality to perform an operation<br /><br />
<br />
<div style="font-size: 10pt; font-weight: bold;" id="g">g</div><br />
[[Ganging]] - combining more than two belt operands in more than one slot to perform a more complex operation<br /><br />
<br />
<div style="font-size: 10pt; font-weight: bold;" id="h">h</div><br />
<br />
<div style="font-size: 10pt; font-weight: bold;" id="i">i</div><br />
[[Memory#Implicit_Zero_and_Virtual_Zero|Implicit Zero]] - loads from new stack frames are implicitly zero<br /><br />
[[Events#Interrupts|Interrupt]] - an event that has predefined but configurable handling code in the form of a function<br /><br />
[[Encoding#Instructions_and_Operations_and_Bundles|Instruction]] - a collection of operations that get executed together<br /><br />
[[Encoding#Split_Instruction_Streams|Instruction Stream]] - a sequence of instructions, the Mill has 2 working in parallel<br /><br />
<br />
<div style="font-size: 10pt; font-weight: bold;" id="j">j</div><br />
<br />
<div style="font-size: 10pt; font-weight: bold;" id="k">k</div><br />
<br />
<div style="font-size: 10pt; font-weight: bold;" id="l">l</div><br />
<br />
<div style="font-size: 10pt; font-weight: bold;" id="m">m</div><br />
[[Metadata]] - tags attached to belt slots that describe the data in it<br /><br />
[[Decode#Morsels|Morsel]] - the amount of bits needed to address all belt locations on a core<br /><br />
<br />
<div style="font-size: 10pt; font-weight: bold;" id="n">n</div><br />
[[Metadata#None_and_NaR|None]] - undefined data in a slot that is silently ignored by operations<br /><br />
[[Metadata#None_and_NaR|NaR]] - Not a Result, undefined data that traps when used in certain operations<br /><br />
<br />
<div style="font-size: 10pt; font-weight: bold;" id="o">o</div><br />
[[Encoding#Instructions_and_Operations_and_Bundles|Operation]] – the most basic semantically defined hardware unit of execution<br /><br />
<br />
<div style="font-size: 10pt; font-weight: bold;" id="p">p</div><br />
[[Phasing|Phase, Phasing]] - sequenced execution of different operations within one instruction<br /><br />
[[Protection#Protection_Lookaside_Buffer|PLB]] - on chip cache for looking up protection regions for a virtual address<br /><br />
[[Pipeline]] - a logical and physical grouping of functional units sharing infrastructure for sequential step by step proceesing each cycle, emphasis on the physical aspect<br /><br />
[[Pipelining]] - arrangeing operations in the instruction stream in such a way as to maximize functional unit utilization<br /><br />
[[Protection#Portals|Portal]] - a gateway between different protection domains or turfs a thread can pass through<br /><br />
[[Prediction]] - deciding which branch to take in advance to prefetch the right code<br /><br />
[[Protection#Protection_Lookaside_Buffer|PLB]] - Protection Lookaside Buffer<br /><br />
[[Protection#Portals|Portal]] - a cross turf call destination<br /><br />
[[Protection#Regions_and_Turfs|Protection Region]] - specified continuous memory region with attached permissions<br /><br />
<br />
<div style="font-size: 10pt; font-weight: bold;" id="q">q</div><br />
<br />
<div style="font-size: 10pt; font-weight: bold;" id="r">r</div><br />
[[Retire Station]] - the piece of hardware that implements loads from memory, and where those loaded values end up<br /><br />
[[Protection#Region_Table|Region Table]] - the memory backing for the PLB<br /><br />
[[Pipeline#Result_Replay|Replay]] - the way the hardware restores machine state after being interrupted<br /><br />
<br />
<div style="font-size: 10pt; font-weight: bold;" id="s">s</div><br />
[[Scratchpad]] - Temporary buffer for operands from the belt<br /><br />
[[Virtual Address#Single_Address_Space|SAS]] - Single Address Space<br /><br />
[[Protection#Services|Service]] - a stateful call interface that can cross protection barriers<br /><br />
[[Slot]] - a logical and physical grouping of functional units sharing infrastructure for sequential step by step proceesing each cycle, emphasis on the logical aspect<br /><br />
[[Specializer]] - turns general/abstract Mill code into concrete hardware specific machine instructions<br /><br />
[[Speculation]] - computing several paths in branches in parallel only to later throw away the unneeded results<br /><br />
[[Spiller]] - securely manages temporary memory used by certain operations in hardware<br /><br />
[[Protection#Stacklets|Stacklet]] - hardware managed memory line used in fragmented stacks<br /><br />
[[Protection#Stacklet_Info_Block|Stacklet Info Block]] - preserves stacklet state for a thread across portal calls<br /><br />
<br />
<div style="font-size: 10pt; font-weight: bold;" id="t">t</div><br />
[[Protection#Threads|Thread]] - a contained and IDd flow of execution<br /><br />
[[Memory#Address_Translation|TLB]] - Translation Lookaside Buffer<br /><br />
[[Events#Traps|Trap]] - an interrupt that afterwards is intended to resume normal program flow<br /><br />
[[Protection#Regions_and_Turfs|Turf]] - memory protection domain on the Mill, a collection of regions<br /><br />
<br />
<div style="font-size: 10pt; font-weight: bold;" id="u">u</div><br />
<div style="font-size: 10pt; font-weight: bold;" id="v">v</div><br />
[[Memory#Implicit_Zero_and_Virtual_Zero|Virtual Zero]] - loads from all uninitialized memory yield zero<br /><br />
<br />
<div style="font-size: 10pt; font-weight: bold;" id="w">w</div><br />
[[Protection#Well_Known_Regions|WKR, Well Known Region]] - protection regions not defined in the PLB but in registers, automatically managed by hardware<br /><br />
<br />
<div style="font-size: 10pt; font-weight: bold;" id="x">x</div><br />
<div style="font-size: 10pt; font-weight: bold;" id="y">y</div><br />
<div style="font-size: 10pt; font-weight: bold;" id="z">z</div></div>Janhttp://millcomputing.com/wiki/BlockBlock2014-12-16T11:58:38Z<p>Jan: Redirected page to Encoding#General Instruction Format</p>
<hr />
<div>#REDIRECT [[Encoding#General_Instruction_Format]]</div>Janhttp://millcomputing.com/wiki/Condition_CodeCondition Code2014-12-16T08:14:18Z<p>Jan: </p>
<hr />
<div>A byproduct of many value producing operations is that you get certain properties of the result value for free.<br />
<br />
In traditional architectures those status flags are kept in a global status register, and each operation that produces status flags replaces the previous value.<br /><br />
This approach has two major drawbacks:<br />
<br />
For one those bits are only rarely needed, yet they are part of the thread and process space and always need to be preserved, just in case.<br /><br />
And then of course you are limited to the most recently produced status flags.<br /><br />
Both means, as global changeable data usually does, a big obstacle to parallelism and speculative execution.<br />
<br />
The Mill takes a different approach. There is no global status flag register. A lot of operations still produce these flags as condition codes, but usually they get immediately discarded. Only when the program actually needs one or more of these condition codes, as determined by the compiler, they get explicitly extracted into the belt with a dedicated operation, which is [[Ganging|ganged]] with the actual value and condition code producing operation. And from the belt other operations can make use of it as normal arguments.<br />
<br />
The condition codes, and their respective extracting operation names are:<br />
<br />
# [[Instruction_Set/carry|carry]]<br />
# [[Instruction_Set/overflows|overflow]]<br />
# [[Instruction_Set/fault|fault]]<br />
# [[Instruction_Set/eql|eql]]<br />
# [[Instruction_Set/neq|neq]]<br />
# [[Instruction_Set/gtr|gtr]]<br />
# [[Instruction_Set/geq|geq]]<br />
# [[Instruction_Set/lss|lss]]<br />
# [[Instruction_Set/leq|leq]]<br />
<br />
The condition codes are generated and are available for extraction no matter what the overflow behavior of the operation itself is. Even modulo wraparound operations produce overflow and carry codes.</div>Janhttp://millcomputing.com/wiki/EBBEBB2014-12-15T19:19:59Z<p>Jan: Redirected page to Encoding#Extended Basic Block</p>
<hr />
<div>#REDIRECT [[Encoding#Extended_Basic_Block]]</div>Janhttp://millcomputing.com/wiki/Condition_CodeCondition Code2014-12-14T10:30:58Z<p>Jan: </p>
<hr />
<div>A byproduct of many value producing operations is that you get certain properties of the result value for free.<br />
<br />
In traditional architectures those status flags are kept in a global status register, and each operation that produces status flags replaces the previous value.<br /><br />
This approach has two major drawbacks:<br />
<br />
For one those bits are only rarely needed, yet they are part of the thread and process space and always need to be preserved, just in case.<br /><br />
And then of course you are limited to the most recently produced status flags.<br /><br />
Both means, as global changeable data usually does, a big obstacle to parallelism and speculative execution.<br />
<br />
The Mill takes a different approach. There is no global status flag register. A lot of operations still produce these flags as condition codes, but usually they get immediately discarded. Only when the program actually needs one or more of these condition codes, as determined by the compiler, they get explicitly extracted into the belt with a dedicated operation, which is [[Ganging|ganged]] with the actual value and condition code producing operation. And from the belt other operations can make use of it as normal arguments.<br />
<br />
The condition codes, and their respective extracting operation names are:<br />
<br />
# [[Instruction_Set/carry|carry]] - for unsigned integer this is also the overflow<br />
# [[Instruction_Set/overflows|overflow]]<br />
# [[Instruction_Set/fault|fault]] - when a [[NaR]] is produced<br />
# [[Instruction_Set/eql|eql]]<br />
# [[Instruction_Set/neq|neq]]<br />
# [[Instruction_Set/gtr|gtr]]<br />
# [[Instruction_Set/geq|geq]]<br />
# [[Instruction_Set/lss|lss]]<br />
# [[Instruction_Set/leq|leq]]<br />
<br />
The condition codes are generated and are available for extraction no matter what the overflow behavior of the operation itself is. Even modulo wraparound operations produce overflow and carry codes.</div>Janhttp://millcomputing.com/wiki/Condition_CodeCondition Code2014-12-14T06:51:50Z<p>Jan: Created page with "A byproduct of many value producing operations is that you get certain properties of the result value for free. In traditional architecture those status flags are kept in the..."</p>
<hr />
<div>A byproduct of many value producing operations is that you get certain properties of the result value for free.<br />
<br />
In traditional architecture those status flags are kept in the a global status register, and each operation that produces status flags replaces the previous value.<br /><br />
This approach has two major drawbacks:<br />
<br />
For one those bits are only rarely needed, yet they are part of the thread and process space and always need to be preserved, just in case.<br /><br />
And then of course you are limited to the most recently produced status flags.<br />
<br />
The Mill takes a different approach. There is no global status flag register. A lot of operations still produce these flags as condition codes, but usually they get immediately discarded. Only when the program actually needs one or more of these condition codes, as determined by the compiler, they get explicitly extracted into a belt register with a dedicated operation, which is [[Ganging|ganged]] with the actual value and condition code producing operation. And from the belt other operations can make use of it like normal arguments.<br />
<br />
The condition codes, and their respective extracting operation names are:<br />
<br />
# [[Instruction_Set/carry|carry]] - for unsigned integer this is also the overflow<br />
# [[Instruction_Set/overflows|overflow]]<br />
# [[Instruction_Set/fault|fault]] - when a [[NaR]] is produced<br />
# [[Instruction_Set/eql|eql]]<br />
# [[Instruction_Set/neq|neq]]<br />
# [[Instruction_Set/gtr|gtr]]<br />
# [[Instruction_Set/geq|geq]]<br />
# [[Instruction_Set/lss|lss]]<br />
# [[Instruction_Set/leq|leq]]<br />
<br />
The condition codes are generated and are available for extraction no matter what the overflow behavior of the operation itself is. Even modulo wraparound operations produce overflow and carry codes.</div>Janhttp://millcomputing.com/wiki/Cores/GoldCores/Gold2014-12-14T04:46:20Z<p>Jan: </p>
<hr />
<div>{{DISPLAYTITLE:Gold Core}}<br />
<b>[[Cores]]:</b> [[Cores/Tin|Tin]]&nbsp;[[Cores/Copper|Copper]]&nbsp;[[Cores/Silver|Silver]]&nbsp;[[Cores/Gold|Gold]]&nbsp;[[Cores/Decimal8|Decimal8]]&nbsp;[[Cores/Decimal16|Decimal16]]&nbsp;<br />
<br />
The Gold core was conceived as the high end product, the reasonably most powerful configuration. It offers massive parallelism for both integer and floating point workloads, even for wide 128bit floating point. It can serve in a big compute server for simulations, or in a media creation workstation.<br />
<br />
<br />
<b>[[Belt]]</b>: 32&nbsp;&nbsp;<b>[[Decode#Morsel|Morsel]]</b>: 5bit&nbsp;&nbsp;<b>[[Operands|Scalar Width]]</b>: 128bit&nbsp;&nbsp;<b>[[Operands|Operand Maximum Size]]</b>: 32B&nbsp;&nbsp;<br />
<br />
<b>[[Pipeline]]s</b>: 37&nbsp;&nbsp;<b>[[Retire Station]]s</b>: 16&nbsp;&nbsp;<b>[[Scratchpad]]</b>: 512B&nbsp;&nbsp;<br />
<br />
<b>[[Spiller|Spill Buffers]]</b>: 16&nbsp;&nbsp;<b>[[Spiller|Spiller Stack Size]]</b>: 256MB&nbsp;&nbsp;<br />
<br />
<b>[[Memory#Instruction_Cache|iCache Line]]</b>: 64B&nbsp;&nbsp;<br />
<br />
<b>8 reader slots</b>, 11bits wide&nbsp;&nbsp;&nbsp;<b>5 writer slots</b>, 8bits wide&nbsp;&nbsp;&nbsp;<b>4 pick slots</b>, 16bits wide&nbsp;&nbsp;&nbsp;<br />
<br />
<b>exu slot 0</b>, 23bits wide, with functional units: [[Functional Unit#alu|alu]]&nbsp;[[Functional Unit#bfp|bfp]]&nbsp;[[Functional Unit#bfpm|bfpm]]&nbsp;[[Functional Unit#bfpmas|bfpmas]]&nbsp;[[Functional Unit#count|count]]&nbsp;[[Functional Unit#mul|mul]]&nbsp;[[Functional Unit#nope|nope]]&nbsp;[[Functional Unit#shift|shift]]&nbsp;[[Functional Unit#shuffle|shuffle]]&nbsp;<br />
<br />
<b>exu slot 1</b>, 23bits wide, with functional units: [[Functional Unit#alu|alu]]&nbsp;[[Functional Unit#bfp|bfp]]&nbsp;[[Functional Unit#bfpm|bfpm]]&nbsp;[[Functional Unit#exuArgs|exuArgs]]&nbsp;[[Functional Unit#mul|mul]]&nbsp;[[Functional Unit#nope|nope]]&nbsp;[[Functional Unit#shift|shift]]&nbsp;<br />
<br />
<b>exu slot 2</b>, 23bits wide, with functional units: [[Functional Unit#alu|alu]]&nbsp;[[Functional Unit#bfp|bfp]]&nbsp;[[Functional Unit#bfpm|bfpm]]&nbsp;[[Functional Unit#bfpmas|bfpmas]]&nbsp;[[Functional Unit#exuArgs|exuArgs]]&nbsp;[[Functional Unit#mul|mul]]&nbsp;<br />
<br />
<b>exu slot 3</b>, 23bits wide, with functional units: [[Functional Unit#alu|alu]]&nbsp;[[Functional Unit#bfp|bfp]]&nbsp;[[Functional Unit#bfpm|bfpm]]&nbsp;[[Functional Unit#exuArgs|exuArgs]]&nbsp;[[Functional Unit#mul|mul]]&nbsp;<br />
<br />
<b>exu slot 4</b>, 19bits wide, with functional units: [[Functional Unit#alu|alu]]&nbsp;[[Functional Unit#exuArgs|exuArgs]]&nbsp;<br />
<br />
<b>exu slot 5</b>, 19bits wide, with functional units: [[Functional Unit#alu|alu]]&nbsp;[[Functional Unit#exuArgs|exuArgs]]&nbsp;<br />
<br />
<b>exu slot 6</b>, 19bits wide, with functional units: [[Functional Unit#alu|alu]]&nbsp;[[Functional Unit#exuArgs|exuArgs]]&nbsp;<br />
<br />
<b>exu slot 7</b>, 19bits wide, with functional units: [[Functional Unit#alu|alu]]&nbsp;[[Functional Unit#exuArgs|exuArgs]]&nbsp;<br />
<br />
<b>flow slot 0</b>, 17bits wide, with functional units: [[Functional Unit#cache|cache]]&nbsp;[[Functional Unit#con|con]]&nbsp;[[Functional Unit#conform|conform]]&nbsp;[[Functional Unit#control|control]]&nbsp;[[Functional Unit#ls|ls]]&nbsp;[[Functional Unit#lsb|lsb]]&nbsp;[[Functional Unit#misc|misc]]&nbsp;[[Functional Unit#nopf|nopf]]&nbsp;<br />
<br />
<b>flow slot 1</b>, 17bits wide, with functional units: [[Functional Unit#con|con]]&nbsp;[[Functional Unit#conform|conform]]&nbsp;[[Functional Unit#control|control]]&nbsp;[[Functional Unit#flowArgs|flowArgs]]&nbsp;[[Functional Unit#ls|ls]]&nbsp;[[Functional Unit#lsb|lsb]]&nbsp;[[Functional Unit#misc|misc]]&nbsp;<br />
<br />
<b>flow slot 2</b>, 17bits wide, with functional units: [[Functional Unit#con|con]]&nbsp;[[Functional Unit#conform|conform]]&nbsp;[[Functional Unit#control|control]]&nbsp;[[Functional Unit#flowArgs|flowArgs]]&nbsp;[[Functional Unit#ls|ls]]&nbsp;[[Functional Unit#lsb|lsb]]&nbsp;<br />
<br />
<b>flow slot 3</b>, 17bits wide, with functional units: [[Functional Unit#con|con]]&nbsp;[[Functional Unit#conform|conform]]&nbsp;[[Functional Unit#control|control]]&nbsp;[[Functional Unit#flowArgs|flowArgs]]&nbsp;[[Functional Unit#ls|ls]]&nbsp;[[Functional Unit#lsb|lsb]]&nbsp;<br />
<br />
<b>flow slot 4</b>, 16bits wide, with functional units: [[Functional Unit#con|con]]&nbsp;[[Functional Unit#flowArgs|flowArgs]]&nbsp;[[Functional Unit#ls|ls]]&nbsp;[[Functional Unit#lsb|lsb]]&nbsp;<br />
<br />
<b>flow slot 5</b>, 16bits wide, with functional units: [[Functional Unit#con|con]]&nbsp;[[Functional Unit#flowArgs|flowArgs]]&nbsp;[[Functional Unit#ls|ls]]&nbsp;[[Functional Unit#lsb|lsb]]&nbsp;<br />
<br />
<b>flow slot 6</b>, 16bits wide, with functional units: [[Functional Unit#con|con]]&nbsp;[[Functional Unit#flowArgs|flowArgs]]&nbsp;[[Functional Unit#ls|ls]]&nbsp;[[Functional Unit#lsb|lsb]]&nbsp;<br />
<br />
<b>flow slot 7</b>, 16bits wide, with functional units: [[Functional Unit#con|con]]&nbsp;[[Functional Unit#flowArgs|flowArgs]]&nbsp;[[Functional Unit#ls|ls]]&nbsp;[[Functional Unit#lsb|lsb]]&nbsp;<br />
<br />
<br />
<br />
[[Cores/Gold/Encoding|Operation Encoding]]</div>Janhttp://millcomputing.com/wiki/OverflowOverflow2014-12-11T10:00:41Z<p>Jan: </p>
<hr />
<div>When integer operations produce values that can't be contained within the bit width of the result, there are different strategies of how to deal with it. Which one is used depends on the chosen operation.<br />
<br />
In all cases the [[Condition_Code|condition codes]] are generated and can be queried with the dedicated ganged operations. What the overflow behavior does determine is the primary result that is produced.<br />
<br />
# modulo - normal silent overflow or underflow, also called wraparound<br />
# saturating - the highest of lowest representable value in the given with is the returned result<br />
# excepting - the appropriate [[NaR]] is created as the result<br />
# widening - here the scalar byte width is doubled and the full and exact result is the return value</div>Janhttp://millcomputing.com/wiki/OverflowOverflow2014-12-11T09:59:40Z<p>Jan: </p>
<hr />
<div>When integer operations produce values that can't be contained within the bit width of the result, there are different strategies of how to deal with it.<br />
<br />
In all cases the [[Condition_Code|condition codes]] are generated and can be queried with the dedicated ganged operations. What the overflow behavior does determine is the primary result that is produced.<br />
<br />
# modulo - normal silent overflow or underflow, also called wraparound<br />
# saturating - the highest of lowest representable value in the given with is the returned result<br />
# excepting - the appropriate [[NaR]] is created as the result<br />
# widening - here the scalar byte width is doubled and the full and exact result is the return value</div>Janhttp://millcomputing.com/wiki/OverflowOverflow2014-12-11T09:59:17Z<p>Jan: Created page with "When integer operations produce values that can't be contained within the bit width of the result, there are different strategies of how to deal with it. In all cases the C..."</p>
<hr />
<div>When integer operations produce values that can't be contained within the bit width of the result, there are different strategies of how to deal with it.<br />
<br />
In all cases the [[Condition_Code|condition codes]] are generated and can be queried with the dedicated ganged operations. What the overflow behavior does determine is the primary result that is produced.<br />
<br />
1. modulo - normal silent overflow or underflow, also called wraparound<br />
2. saturating - the highest of lowest representable value in the given with is the returned result<br />
3. excepting - the appropriate [[NaR]] is created as the result<br />
4. widening - here the scalar byte width is doubled and the full and exact result is the return value</div>Janhttp://millcomputing.com/wiki/DomainsDomains2014-12-11T05:03:22Z<p>Jan: </p>
<hr />
<div>Domains are the different kinds of scalar values the different Mill operations can work with. What this really means is, A value on the belt is just a value on the belt, it is just bits. The [[Metadata]] determines how many bits and whether they are arranged in vectors or not, but how exactly those bits are interpreted is up to the different operations. And as such the different operations tend to fall into different categories called domain. Some operations don't care at all what kind of values they are dealing with, they just move them around, but anything that produces new bit patterns based on old ones does have a domain it interprets the bitpatterns as.<br /><br />
Most of the domains are indicated in the opcode mnemonic of the operation via suffix.<br />
<br />
{| class="wikitable" style="width:500px;text-align:center;position: absolute; left: 20em; margin-top: 7em"<br />
|+ Byte Widths for Different Domains<br />
! style="width:2.5em" | op<br />
! style="width:2.5em" | u<br />
! style="width:2.5em" | s<br />
! style="width:2.5em" | p<br />
! style="width:2.5em" | f<br />
! style="width:2.5em" | d<br />
! style="width:2.5em" | uf<br />
! style="width:2.5em" | sf<br />
! style="color:#555;width:2.5em" | i<br />
! style="color:#555;width:2.5em" | n<br />
! style="color:#555;width:2.5em" | sel<br />
! style="color:#555;width:2.5em" | pred<br />
|-<br />
| 1-16<br />
| 1-16<br />
| 1-16<br />
| 8<br />
| 4-16<br />
| 8-16<br />
| 1-16<br />
| 1-16<br />
| 1<br />
| 1<br />
| 1<br />
| 1<br />
|}<br />
<br />
= Domains =<br />
<br />
== <span id="op">Logical</span> ==<br />
<br />
The logical domain is the default domain in a sense in that it is not indicated by any operation suffix. It treats bits just as bits, and a whole slew of booloan and logical operations work on pure bits. Also quite a few signed and unsigned integer operations alias to the logical domain when they have modulo/wraparound [[Overflow|overflow]] behavior. In the documentation pages for the operations operands that are interpreted as logical bits are presented as an op type. This is also the case for operations that don't really change the bit patterns but just move them around in some way.<br />
<br />
== <span id="u">u - Unsigned Integer</span> ==<br />
<br />
Most of the unsigned integer arithmetic with modulo [[Overflow|overflow]] behavior aliases to the logical domain hardware.<br />
<br />
== <span id="s">s - Signed Integer</span> ==<br />
<br />
Most of the signed integer arithmetic with modulo [[Overflow|overflow]] behavior aliases to the logical domain hardware, too.<br />
<br />
== <span id="p">p - Pointers</span> ==<br />
<br />
Pointers are always 64bit, and the upper 4 bits are ignored in pointer arithmetic because they are reserved for special purposes, like triggering traps for garbage collection. More under [[Virtual Address]]es.<br />
<br />
== <span id="f">f - Floating Point</span> ==<br />
<br />
This is [http://en.wikipedia.org/wiki/IEEE_floating_point|IEEE 754] binary floating points. Available as 32, 64 and 128 bits.<br />
<br />
== <span id="d">d - Decimal Floating Point</span> ==<br />
<br />
[http://en.wikipedia.org/wiki/IEEE_floating_point|IEEE 754] decimal floating points. Available as 64 and 128 bits. On most [[Cores]] this is emulated.<br />
<br />
== <span id="sf">sf - Signed Fixed Point</span> ==<br />
<br />
Generally fixed point arithmetic is the same as integer arithmetic, except for shifts and multiplication and widening and narrowing.<br />
<br />
== <span id="uf">uf - Unsigned Fixed Point</span> ==<br />
<br />
The same goes for unsigned fixed point arithmetic.<br />
<br />
<br /><br /><br />
= Pseudo Domains =<br />
<br />
There are also some operands to some operations that get interpreted in a special way without applying to the whole operation. Those are not truly full fledged domains and aren't part of the mnemonics but are listed here anyway.<br />
Generally they take the width of the other operands in the operation, but only the lowest few bits or even just the lowest bit matter.<br />
<br />
== <span id="i" style="color:#555">i - Vector Index</span> ==<br />
<br />
Some operations build and take apart vector operands and index the vector elements. The values can be any width actually, but if the index exceeds the vector element count, [[Metadata#None_and_NaR|NaRs]] happen.<br />
<br />
== <span id="n" style="color:#555">n - Bit Count</span> ==<br />
<br />
Shifts and bit tests and similar operations move or index bits within a value. If the n value is bigger than the width of the other operand, [[Metadata#None_and_NaR|NaRs]] happen here as well.<br />
<br />
== <span id="sel" style="color:#555">sel - Selector</span> ==<br />
<br />
The [[Instruction_Set/pick|pick]] and [[Instruction_Set/recur|recur]] operations select values based on a select predicate where only the lowest bit is evaluated.<br />
<br />
== <span id="sel" style="color:#555">pred - Predicate</span> ==<br />
<br />
The conditional branches also have to make choices, those predicates really require 0 or 1 values though and don't just evaluate the lowest bit.</div>Janhttp://millcomputing.com/wiki/MediaWiki:Common.cssMediaWiki:Common.css2014-12-04T15:18:31Z<p>Jan: </p>
<hr />
<div>/* CSS placed here will be applied to all skins */<br />
<br />
th span.mw-collapsible-toggle { font-size: smaller; font-weight: normal; }<br />
<br />
hr { margin-top: 2em; }<br />
<br />
table.encoding { border-spacing: 0; border-collapse: collapse; margin: 1em 0; }<br />
table.encoding td { border: 1px solid #999; text-align: center; }<br />
table.encoding th:first-child { text-align: center; }<br />
table.encoding th { width: 3em; text-align: left; }<br />
table.encoding sub { color: #666; font-size: 70%; }<br />
table.encoding i { color: #060; }</div>Janhttp://millcomputing.com/wiki/EncodingEncoding2014-11-29T19:36:08Z<p>Jan: </p>
<hr />
<div>The Mill architecture employs a unique split stream instruction encoding that enables sustained decoding rates of over 30 operations per cycle by being wide issue and very dense. It provides those unparalleled numbers with a fraction of the energy and transistor cost of mainstream variable instruction length instruction sets like x86.<br />
<br />
== Semantics ==<br />
<br />
=== Wide Issue ===<br />
<br />
The Mill architecture is wide issue: each instruction contains multiple operations that are issued together. This is not a fixed width as is customary in <abbr title="Very Long Instruction Word">VLIW</abbr> architectures. It can be as narrow or as wide as the instruction level parallelism allows, with the upper bound obviously being the amount of [[Functional Units]].<br />
<br />
=== <span id="EBB">Extended Basic Block</span> ===<br />
[[File:ebb.png|right|alt=EBB overview]]<br />
Code on the Mill is organized into <abbr title="Extended Basic Block">EBB</abbr>s, i.e. batches of code with one entry point and one or more exit points. There is no implicit fall through in EBBs. The instruction flow can only leave them with an explicit branch, which means at least the last operation in every EBB is an unconditional branch. It can contain more conditional branches that either go to the top of other EBBs or to the top of itself. And in contrast to other architectures, there can even be calls in the EBB. They don't leave it, as long as they return normally. The image to the right is a purely logical view. The right block is even a normal canonical basic block with one entry point and one exit. In reality things are a little more complicated, as will be seen later here.<br />
<br />
=== <span id="instructions">Instructions and Operations and Bundles</span> ===<br />
The unusual encoding requires making clear distinctions between instructions and operations and bundles that are not really necessary on traditional machines. In the earliest <abbr title="Reduced Instructions Set Computer">RISC</abbr> architectures, an instruction and an operation and a bundle are usually the same thing: a word size bundle of bits is retrieved from the instruction cache and dropped into the decoder. There, one instruction is retrieved and issued resulting in one operation being performed. On wide issue machines, one instruction can contain several operations that are all issued together. Modern machines drop a bundle containing several instructions at once into the decoder.<br />
<br />
So, a bundle is the batch of memory that gets fetched from memory and dropped into the decoder together.<br /><br />
An instruction is all the operations that get issued to the functional units together.<br /><br />
And an operation is the most basic piece of processing in a functional unit (an add or xor for example).<br />
<br />
=== <span id="streams">Split Instruction Streams</span> ===<br />
Conventionally there are two approaches to instruction encoding: fixed length instructions and variable length instructions. Fixed length instruction are cheap and easy to decode, but don't offer very good code density. Variable lengths instructions can offer good code density, but decoding them tends to have polynomial cost. The rate of decode in the former tends to be limited by instruction cache size and instruction cache throughput then. For the latter, the limiting factor is the cost of of processing and interpreting the the actual bits, of recognizing the instructions in the stream. The best fixed length instruction decoders can scrape low double digits per cycle. And Intel's heroics can get up to 4 instructions per cycle on x86. How to overcome this bottleneck?<br />
<br />
It has to be variable length encoding for simple code density reasons. No way around that. But since decoding one stream of instructions is n<sup>2</sup> in cost, decoding 2 streams equivalent to the one is a quarter of the cost. And when those two streams are split in their functionality, you get even more gains in code density and simplicity from being able to have different meanings of the bit patterns on each side <ref name="split_stream">[http://millcomputing.com/blog/wp-content/uploads/2013/12/mill_cpu_split-stream_encoding.pdf The Mill: Split Stream Encoding]</ref>.<br />
<br />
The functional split on the Mill for the two different streams is into:<br />
* an [[ExuCore]], for all arithmetic and logic, actual computation<br />
* a [[FlowCore]], for flow control and memory accesses<br />
<br />
This gives a fairly even split in the average workload for both sides.<br />
<br />
== Implementation ==<br />
<br />
There are of course a few technical hurdles to overcome for split stream encoding. For one, this requires two <abbr title="program counter">pc</abbr>s. How are branch targets specified with more than one pc? Encoding two addresses in whichever way is not very efficient. There could be implicit base addresses in dedicated memory regions for the two streams and only one offset encoded. Then, significant address space waste occurs whenever the two streams are of different lengths for the same flow of control.<br />
<br />
The approach the Mill takes is to only have one address as a branch target, where both program counters end up. From there, one stream walks up the memory and the other stream walks down. Both sides can be of different lengths as needed. So while this branch target logically is the top of an EBB, in memory it actually jumps somewhere in the middle of it.<br />
<br />
<br /><br />
<imagemap><br />
File:Ebb-memory-layout.png|center<br />
desc none<br />
rect 0 20 180 60 [[Decode#Flow_Stream]]<br />
rect 188 20 442 60 [[Decode#Exu_Stream]]<br />
rect 452 20 564 60 [[Decode#Flow_Stream]]<br />
rect 564 20 668 60 [[Decode#Exu_Stream]]<br />
</imagemap><br />
<br />
The increasingly dark blue bars to each side of the entry points to the EBB are the instructions. The same shade of blue for two instructions on each side means they are pulled from the cache and dropped in their respective decoder together, the [[Flow]] instructions from the decreasing program counter side, the [[Execution]] instructions from the increasing side.<br /> Each instruction is a half-bundle, which together with the instruction on the other side makes a full bundle. Both instructions belong logically together and are issued together. The info byte contains information on how many and which cache lines to pull for the EBB.<br />
<br />
It should be noted that because of the amount of operations that can be packed into one instruction and the [[Phasing|Phased]] execution, the vast majority of EBBs on the Mill only consist of one exu and one flow instruction.<br />
<br />
Also, each instruction has a different size. While those full instruction sizes are always full byte sizes, there is also always a minimum size of how many bytes an instruction has. This depends on the processor [[Specification]], in particular how many [[Slot]]s and functional units and [[Belt]] positions this specific Mill processor has.<br />
<br />
But it is still a variable length [http://en.wikipedia.org/wiki/Very_long_instruction_word VLIW] instruction set, and those normally are very hard to parse.<br />
<br />
== General Instruction Format ==<br />
<br />
The reason this is not the case on the Mill is in the instruction format. This format is different on both of the instruction streams on each side, but both also follow a general pattern and idea mirrored in orientation on each side, which is described here.<br />
<br />
Each instruction is decoded in 3 cycles, each cycle looking at different blocks of the instruction with dedicated hardware. As soon as one block is dealt with for one instruction the corresponding block of the next instruction immediately follows. Those blocks roughly correspond to the different [[Phasing|Phases]]. In particular the first block is dominated by reader phase operations in the exu stream.<br /><br />
But how does the decoder know where to take the next instruction from, when all instructions are a different length? Simple: each instruction has a fixed length header, that contains a shift count which tells the decoder how long this specific instruction is, and where the next instruction starts. And in the case of a branch it is the [[Prediction|prediction mechanisms]] that determine the instruction pointer in prefetch, so it is available without delay.<br />
<br />
This fixed header also contains the amount of operations/size information for each block, and it goes through the instruction like this:<br />
<br />
# Cycle looks at the header and block 1 immediately following the fixed header. The decoder always knows where it is, because it is a fixed location and immediately starts decoding it assuming maximum possible size of the block. In the next cycle it knows how big it actually is, because the header has been parsed too and can cancel all the excess.<br />
# Cycle looks at block 2, which is aligned to the end of the instruction and block 3 immediately following block 1 This is now possible because all size information, for the individual blocks and for the instruction itself, have been retrieved from the header in the previous cycle.<br />
# Cycle finishes up with all the information retrieved from all blocks.<br />
<br />
<br /><br />
[[File:General-instruction.png|center]]<br />
<br /><br />
<br />
Each block can only contain operations/data that belong into it and has its own format and dedicated decoding hardware. Because of this specialization in each block, each of those formats is very simple and dense and the specialized decoders can be fast and cheap.<br />
<br />
This process of decoding isn't limited to 3 blocks and cycles. The general pattern of decoding the head and block 1 on the first cycle and 2 more blocks on each consecutive one can be expanded to arbitrary sizes, but it turns out 3 is enough and meshes well with [[Phasing]]. It could be considered another doubling of the instruction streams.<br />
<br />
The blocks are not limited to byte sizes and boundaries, indeed they rarely are. They contain a number of fixed size fragments/operations and usually this doesn't add up. This means almost always there is an alignment hole in the middle of an instruction where the block 2 and 3 meet. This space isn't wasted. It contains a delay count that serves as a <abbr title="No Operation">NOP</abbr> of the given length. But not for the instruction itself, instead for the instruction of the opposing side, i.e. the alignment holes on flow encode the delays on execute and vice versa. There are also normal NOPs, but you rarely ever need to use them.<br /><br />
This kind of efficient NOP is very important on wide issue machines, and it is also very important in statically scheduled exposed pipeline architectures. The operations in the instruction can have different latencies, and this delay serves to statically sync with the other stream.<br />
<br />
In an interesting side note: in contrast to conventional architectures, especially RISC architectures, on the Mill the program counters or instruction pointers are always explicitly set/altered with values from the instruction stream or from operands. There is no implicit increment. On other processors this only happens with branch instructions.<br />
<br />
== Rationale ==<br />
<br />
The need to be statically scheduled wide issue and to be able to decode a lot more instructions/operations per cycle has become apparent for quite a while. In the [http://en.wikipedia.org/wiki/Digital_signal_processor DSP] world this is standard practice for decades already. The problem is, the many and unpredictable branches in general purpose workloads make static scheduling very hard and punish it severely with stalls whenever static scheduling fails.<br />
<br />
Additionally, wide issue machines with many parallel operations in one instruction are very inefficient in code size when instruction level parallelism is limited, as it tends to be in general purpose workloads with small current data flow windows.<br />
<br />
There are two factors that offer a way out of this though:<br />
<br />
# Compilers have unlimited data flow windows and can statically and optimally schedule everything if the instruction set provides those interfaces.<br />
# ~80% of all code is in loops, and statically scheduled loops have unbounded instruction level parallelism.<br />
<br />
So, when the problem of statically scheduling most loops has been solved, and it has been a gnarly problem since the start of computing, but for the Mill it is [[Execution#Loops|solved]], performance mainly becomes a problem of how many functional units you have, and how fast you can feed them with instructions and data.<br />
<br />
You still don't want a fixed width wide issue instruction set, for that general purpose workloads are just far too irregular, but you want the option to be as parallel as you have functional units, when the control and data flow allows it.<br />
<br />
And of course you want different processors for different workloads, even if it is general purpose code. Office desktops and communication servers and scientific computing servers and mobile devices and game consoles and industrial controllers, they all have vastly different requirements. They need different kinds and numbers of functional units, of cache sizes, of bandwidths etc. It is impossible for one architecture and one instruction set, much less one processor, to serve all those roles equally well. Software compatibility between all them is still highly beneficial, often even necessary.<br />
<br />
There are other advantages to be gained from split instruction streams and shifted blocks. What it says is that certain memory regions in relative position to each other can only contain certain content. This is a very effective way to increase entropy and code density.<br />
<br />
Furthermore, this specialization allows you to specialize and dedicate caches. This is only relevant for the Flow and Exu streams, not for the instruction part shifts, which are already in the decoder. What dedicated and split caches allow you, apart from packing them much more tightly, you also can make them smaller and faster by vastly reducing the wire lengths from the bottom of the cache to the decoder, with comparable combined capacity.<br />
<br />
This is actually a trend that can be seen all over the Mill architecture: smaller, more and more specialized caches are more efficient and faster and can be placed better on the chip than fewer, large, non-specialized ones. Division of labor works for machines, too.<br />
<br />
== See Also ==<br />
<br />
[[Decode]], [[Phasing]], [[Instruction Set]]<br />
<br />
== Media ==<br />
[http://www.youtube.com/watch?v=LgLNyMAi-0I Presentation on the Encoding by Ivan Godard] - [http://millcomputing.com/blog/wp-content/uploads/2013/12/2013-07-11_mill_cpu_encoding_08.pptx Slides]<br />
<br />
== References ==<br />
<references /></div>Janhttp://millcomputing.com/wiki/MemoryMemory2014-11-24T07:28:23Z<p>Jan: /* Instruction Cache */</p>
<hr />
<div>A lot of the power and performance gains of the Mill, but also many of its [[Protection|security]] improvements over conventional architectures come from the various facilities of the memory management. Most subsystems have their own dedicated pages. This page is an overview.<br />
<br />
<div style="position: absolute; left: 18em;"><br />
<imagemap><br />
File:Memory-hierarchy.png|alt=Memory/Cache Hierarchy<br />
desc none<br />
</imagemap><br />
</div><br />
<br />
== Overview ==<br />
<br />
The Mill architecture is a 64bit architecture, there are no 32bit Mills. For this reason it is possible and indeed prudent to adopt a single address space (SAS) memory model. All threads and processes share the same address space. Any address points to the same location for every process. To do this securely and efficiently the [[Protection|memory access protection]] and address translation have been split into two separate modules, whereas on conventional architectures those two tasks are conflated into one.<br />
<br />
As can be seen from this rough system chart, There is a combined L2 cache, although some low level implementations may choose to omit this for space and energy reasons. The Mill has facilities that make an L2 cache less critical.<br /><br />
L1 caches are separate for instructions and data already, and even more, they are already separate for [[ExuCore]] instructions and [[FlowCore]] instructions. Smaller, more specialized caches can be made faster and more efficient in many regards, but chiefly via shorter signal paths.<br /><br />
The D$1 data cache feeds into the retire stations with [[Instruction Set/Load|load operations]] and receives the values from the [[Instruction Set/Store|store operations]].<br />
<br />
== [[Protection]] ==<br />
<br />
All [[Protection]] happens by defining protection attributes on virtual address regions. This happens above the Level 1 caches and separately for instructions and data with different attributes, execute and portal for instructions, read and write for data. The <abbr title="instruction Protection Lookaside Buffer">iPLB</abbr> and <abbr title="data Protection Lookaside Buffer">dPLB</abbr> lookup tables are specialized and can be small and fast. And even better optimizations exist in the [[Protection#Well_Known_Regions|well known regions]] for the most common cases. More on this under [[Protection]].<br />
<br />
== Address Translation ==<br />
<br />
Because address translation is separated from access protection, and because all processes share one address space, the translation and <abbr title="Translation Lookaside Buffer">TLB</abbr> accesses can be moved below the caches. In fact the TLB only ever needs to be accessed when there is a cache miss or evict. In that case there is a +300 cycle stall anyway, which means the TLB can be big and flat and slow and energy efficient. The few extra cycles for a TLB lookup are largely masked by the system memory access.<br /><br />
On conventional machines the TLB is right in the critical path between the top level cache and the functional units. This means the TLB must be small and with a complex hierarchy and fast and power hungry. And you still spend up to 20-30% of your cycles and power budget on TLB stalls and TLB hierarchy shuffling.<br />
<br />
=== Reserved Address Space ===<br />
<br />
The virtual address space is 60bit. This is because the top 4 bits of the [[Virtual Address]]es are reserved for system use like garbage collection.<br />
<br />
The top part of this 60bit address space is reserved to facilitate fast [[Protection#Stacklets|protection domain or turf]] switches with secure stacks. More on this there.<br />
<br />
== Retire Stations ==<br />
<br />
Retire stations server the load/store <abbr title="Functional Unit">FU</abbr>s or [[Slot]] for the [[Instruction Set/Load|load operation]]. They implement the deferred load operation and conceptually are part of the [[FlowCore]]. The load operation is explicitly deferred, i.e. it has a parameter which determines exactly at which point in the future it has to make the value available and drop it on the [[Belt]]. This explicit static but parametrized scheduling allows the hiding of almost all cache latencies in memory access. A DRAM stall will still have the same cost, but due to innovations in cache access and specialized mechanisms for the most common memory access patterns, and exact [[Prediction]] the amount of DRAM accesses has been vastly reduced, too.<br /><br />
Another important aspect of this deferred load operation is, that it will not load the value at the point of the issuing of the load operation, but at the point of when it is scheduled to yield the value. This makes the load hardware immune to [[Aliasing]], which means the compiler can stop worrying about aliasing completely and aggressively optimize.<br /><br />
This is achieved by the active retire stations, i.e. the retire stations that have a load pending to return, monitor the store wires for stores on their address. And whenever they see there is a store on their address they just copy the value for later return.<br />
<br />
== Implicit Zero and Virtual Zero ==<br />
<br />
Loads from uninitialized but accessible memory always yield zero on the Mill. There are two mechanisms to ensure that.<br />
<br />
The first is virtual zero. When a load misses the caches and also misses the <abbr title="Translation Lookaside Buffer">TLB</abbr> with no <abbr title="Page Table Entry">PTE</abbr> in it, it means there have been no stores to the address yet, and in this case the <abbr title="Memory Management Unit">MMU</abbr> returns zero for the load to bring back to the retire station. The big gain for this is that the OS doesn't have to explicitly zero out new pages, which would be a lot of bandwidth and time, and accesses to uninitialized memory only take the time of the cache and TLB lookups instead of having to do memory round trips.<br /><br />
This also has security benefits, since no one can snoop on memory garbage piles.<br />
<br />
An optimization of this for data stacks is the implicit zero. The problems of uninitialized memory and of bandwidth waste that the virtual zero addresses for general memory accesses are even more compounded for the data stack, because of the high frequency of new accesses and because of the frequency with which recently written data is never used again. On conventional architectures this causes a staggering amount of cache thrashing and superfluous memory accesses.<br /><br />
The [[Instruction Set/Stackf|stackf]] instruction allocates a new stack frame, i.e. a number of new cache lines, but it does so just by putting markers for those cache lines into the implicit zero registers.<br /><br />
When a subsequent load happens on a newly allocated stack frame, the hardware knows it is a stack access due to the [[Protection#Well_Known_Regions|well known region]] and stack frame [[Registers]]. The hardware doesn't even need to check the [[Protection|dPLB]] or the top level caches, it just returns zero. So while virtual zero returns zero with only the cost of the cache accesses for uninitialized memory, for the most frequent case of uninitialized stack accesses you don't even have top level cache delays, but immediate access. And of course it also makes it impossible to snoop on old stack frames.<br /><br />
Only when a [[Instruction Set/Store|store]] happens on a new stack frame will an actual new cache line be allocated, the new value be written and a flag in the implicit zero registers marks those values as written, all by hardware. Uninitialized loads are still implicitly zero in this cache line, only the actually stored value is pulled from the cache.<br />
<br />
In the majority of cases the stack frame is deallocated before it ever has been written to memory, and the cache line can just be discarded and freed up for future use.<br />
<br />
== Data Cache ==<br />
<br />
All data caches and shared caches have 9 bits per byte. The additional bit is the valid bit. Whenever a new cache line is allocated, always because of a store to a new location, the new value is set for the bytes of the store and their valid bits are set. All other bytes remain invalid.<br /><br />
<br />
=== Backless Memory ===<br />
<br />
[[File:Cache-lines.png]]<br />
<br />
All stores are to the top of the data cache, and thus are neither write-back nor write-through. And, by definition, stores cannot miss the cache and don't involve memory. Cache lines can be evicted to lower levels. And when they are loaded again they are hoisted up again. When there are cache lines for the same address on multiple levels, they get merged on evict or hoist, with the upper level winning if both bytes are valid. In the above example, after the width 16 load, you would have the full merged string "StKill the wabbit\0" on the top level cache.<br />
<br />
All this usually happens without any physical memory involvement, completely in cache. It is backless memory and a vast improvement in all access times. And because all this happens in cache with the valid bit mechanisms, there are also no alignment penalties for loads and stores of data types of different widths. The load and store operations only support power of two widths for the data, but they can be on addresses of any alignment without penalty.<br />
<br />
If there are still invalid bytes left at the lowest cache level, and there is a <abbr title="Page Table Entry">PTE</abbr> for the cache line, then of course the remaining bytes are taken from memory, and the line is hoisted from memory and merged. But for that to happen a line first has to be completely evicted to physical memory, and then new writes without intermediate loads have to have created new lines in cache for the addresses in the line.<br /><br />
As a result the cases where actual access to physical memory is necessary have been vastly reduced. And often the temporary data in smaller subroutines never gets into physical memory at all, the whole lifetime of the objects has been spent in cache.<br />
<br />
=== Memory Allocation ===<br />
<br />
Only when a lowest level cache line is evicted an actual memory page gets allocated. And even this happens completely in hardware with cache line size pages and a bit map allocator from a hierarchy with larger pages. Still invalid bits are set to 0 by the MMU.<br /><br />
It is those larger pages that are managed by the OS in traps raised by the <abbr title="Memory Management Unit">MMU</abbr> when it runs low on backing memory pages.<br /><br />
A big advantage of this allocation behavior is, that in the vast majority of cases you only get to write into memory when a larger number of writes has accumulated in cache already, and they are written all at once. In contrast to write-back caches, where this is also the case, you don't always need to evict a cache line on a read miss, because you can often just merge the memory backing into the invalid bytes of the existing cache line. Only a read from a truly cold and unpredicted and thus unprefetched line triggers an evict that causes a stall. A store into a cold line triggers an evict too, after it cascaded down the caches, but this evict almost never causes a stall, since the evicted line is most likely a cold line.<br />
<br />
== Instruction Cache ==<br />
<br />
The instruction caches are of course read only. And as mentioned before, they are specialized to their Instruction stream. This means they are managed differently from the data caches, to facilitate better instruction [[Prediction#Prefetch_and_Fetch|prefetch]] and [[Decode]] without bubbles in the pipeline. More in this on the respective pages.<br />
<br />
== Sequential Consistency ==<br />
<br />
All memory accesses happen in the order they occur in the program. This is sequential consistency. No access reordering happens, and consequently there is no need for memory fences and the like.<br /><br />
Loads and stores may be placed in the same instruction or retire in the same cycle, and as such are issued and executed in parallel. But the order they retire in is still determined by the order they appear in the instruction, and as such by the order of the [[Slot]]s they were issued into.<br /><br />
This order is not only maintained on a single core, but a defined order for all cores on a chip is maintained with the cache coherency protocol.<br />
<br />
== [[Spiller]] ==<br />
<br />
The spiller is a dedicated central hardware module that preserves internal core state for as long as it may be needed. As such it may save internal core state to dedicated DRAM areas, the spiller space. This memory is not accessible by any other mechanism, and no other hardware mechanisms can interfere with the spiller and meddle with its own internal state. As such spiller memory accesses don't need to go through the [[Protection]] layer, since no one can make the spiller to do anything insecure. It uses the L2 cache as a buffer and of course everything still goes through address translation, because special system tools like [[Debugger]]s occasionally need to read spiller state.<br /><br />
<br />
== [[Streamer]] ==<br />
<br />
== Rationale ==<br />
<br />
Memory latency is the main bottleneck that dictates how modern processors are designed. Memory latency is the reason why all the expensive out-of-order hardware is so prevalent on virtually all general purpose processors since the 60s. If anything the steadily increasing gap in frequency between memory and processor cores makes the latency even more felt today.<br />
<br />
So hiding the latency of the memory accesses and reducing the amount of memory accesses are the primary goals in any processor architecture. Both is mainly achieved with the use of caches. Sophisticated [[Prediction]] and prefetch fills the caches as far in advance as possible. [[Instruction Set/Load|Load]] and [[Instruction Set/Load|store]] deferring and [[Pipelining]] and [[Speculation]] hide the cache latencies, and increase the levels of <abbr title="Instruction Level Parallelism">IPL</abbr> for loads and stores by making memory accesses less dependent on each other. And the cache management protocols determine the amount of actual memory accesses.<br />
<br />
All three aspects have new solutions on the Mill. Generally those solutions are not really more powerful or faster than the solutions of conventional out-of-order architectures. They are only vastly cheaper. Truly random and unpredictable work loads still can't be helped though.<br />
<br />
== See Also ==<br />
<br />
[[Spiller]], [[Virtual Addresses]]<br />
<br />
== Media ==<br />
[http://www.youtube.com/watch?v=bjRDaaGlER8 Presentation on the Memory Hierarchy by Ivan Godard] - [http://millcomputing.com/blog/wp-content/uploads/2013/12/2013-10-16_mill_cpu_hierarchy_08.pptx Slides]</div>Janhttp://millcomputing.com/wiki/NaRNaR2014-11-06T20:05:03Z<p>Jan: Created page with "NaR means "Not a Result". They primarily serve the role of traps in traditional architecures. The big difference is, they are speculable. Belt values are read-only as lon..."</p>
<hr />
<div>NaR means "Not a Result". They primarily serve the role of traps in traditional architecures. The big difference is, they are speculable.<br />
<br />
[[Belt]] values are read-only as long as the value exists. Unlike registers they are not reused for new values. This means when anything goes wrong in any operation this isn't indicated in some global state register, but a special value with special [[Metadata]] is stored in the scheduled belt location and sits there without causing any trouble.<br /><br />
Almost all operations in the architecture are [[Speculation|speculative]], which means when they encounter a NaR as a parameter, they just propagate it to the result.<br /><br />
Only when a non-speculative realizing operation encounters a NaR and would change global state with a NaR, usually a store or a branch or a call, then a fault or trap [[Event]] is raised. And Events go through handler tables as customary.<br />
<br />
== None ==<br />
<br />
The one exception to this is the special None NaR, which is just quietly ignored, i.e. whenever a realizing operation encounters a None operand, it does nothing.<br />
<br />
== Possible NaRs ==<br />
<br />
* <b>None</b> - usually explicitly created for data flow reasons, is ignored<br />
* <b>Address Overflow</b><br />
* <b>Bit Number Too Big</b> - when a value of the [[Domains#n|bit count]] pseudo domain doesn't fit into the width of the other argument<br />
* <b>Floating Point Division By Zero</b><br />
* <b>Floating Point Inexact</b> - IEEE 754 inexact<br />
* <b>Floating Point Invalid</b> - IEEE 754 Invalid<br />
* <b>Floating Point Overflow</b> - IEEE 754 Overflow<br />
* <b>Floating Point Underflow</b> - IEEE 754 Underflow<br />
* <b>Index Out Of Range</b> - when a value of the [[Domains#i|index count]] pseudo domain doesn't fit into the vector width of the other argument<br />
* <b>Integer Overflow</b><br />
* <b>Invalid Address</b><br />
* <b>Invalid Belt Operand</b> - a new [[Belt]] frame usually has a few uninitialized values at the end, accessing them raises this<br />
* <b>Invalid MMIO Access</b><br />
* <b>Invalid Special Register</b><br />
* <b>Invalid This</b><br />
* <b>Mismatched Sizes</b><br />
* <b>Multipally Accepted Address</b><br />
* <b>Must Be Scalar</b><br />
* <b>Not Narrowable</b><br />
* <b>Not Widenable</b><br />
* <b>Null Pointer</b><br />
* <b>Pointer Overflow</b><br />
* <b>Unaccepted Address</b><br />
* <b>Undefined Value</b><br />
* <b>User</b></div>Janhttp://millcomputing.com/wiki/Cores/CopperCores/Copper2014-11-05T22:16:58Z<p>Jan: Created page with "{{DISPLAYTITLE:Copper Core}} <b>Cores:</b> Tin&nbsp;Copper&nbsp;Silver&nbsp;Gold&nbsp;Cores/Decimal8|Decim..."</p>
<hr />
<div>{{DISPLAYTITLE:Copper Core}}<br />
<b>[[Cores]]:</b> [[Cores/Tin|Tin]]&nbsp;[[Cores/Copper|Copper]]&nbsp;[[Cores/Silver|Silver]]&nbsp;[[Cores/Gold|Gold]]&nbsp;[[Cores/Decimal8|Decimal8]]&nbsp;[[Cores/Decimal16|Decimal16]]&nbsp;<br />
<br />
The Copper core isn't muh bigger than Tin, but here both flow and exu slots are properly populated with functional units, so instruction level parallelism is approximately doubled in comparison. Mobile devices, low power servers or smart devices like printers are the expected primary targets.<br />
<br />
<br />
<b>[[Belt]]</b>: 8&nbsp;&nbsp;<b>[[Decode#Morsel|Morsel]]</b>: 3bit&nbsp;&nbsp;<b>[[Operands|Scalar Width]]</b>: 64bit&nbsp;&nbsp;<b>[[Operands|Operand Maximum Size]]</b>: 8B&nbsp;&nbsp;<br />
<br />
<b>[[Pipeline]]s</b>: 13&nbsp;&nbsp;<b>[[Retire Station]]s</b>: 8&nbsp;&nbsp;<b>[[Scratchpad]]</b>: 128B&nbsp;&nbsp;<br />
<br />
<b>[[Spiller|Spill Buffers]]</b>: 8&nbsp;&nbsp;<b>[[Spiller|Spiller Stack Size]]</b>: 16MB&nbsp;&nbsp;<br />
<br />
<b>[[Memory#Instruction_Cache|iCache Line]]</b>: 16B&nbsp;&nbsp;<br />
<br />
<b>2 reader slots</b>, 9bits wide&nbsp;&nbsp;&nbsp;<b>2 writer slots</b>, 6bits wide&nbsp;&nbsp;&nbsp;<b>1 pick slots</b>, 10bits wide&nbsp;&nbsp;&nbsp;<br />
<br />
<b>exu slot 0</b>, 18bits wide, with functional units: [[Functional Unit#alu|alu]]&nbsp;[[Functional Unit#count|count]]&nbsp;[[Functional Unit#mul|mul]]&nbsp;[[Functional Unit#nope|nope]]&nbsp;[[Functional Unit#shift|shift]]&nbsp;[[Functional Unit#shuffle|shuffle]]&nbsp;<br />
<br />
<b>exu slot 1</b>, 16bits wide, with functional units: [[Functional Unit#alu|alu]]&nbsp;[[Functional Unit#exuArgs|exuArgs]]&nbsp;<br />
<br />
<b>flow slot 0</b>, 15bits wide, with functional units: [[Functional Unit#cache|cache]]&nbsp;[[Functional Unit#con|con]]&nbsp;[[Functional Unit#conform|conform]]&nbsp;[[Functional Unit#control|control]]&nbsp;[[Functional Unit#ls|ls]]&nbsp;[[Functional Unit#misc|misc]]&nbsp;[[Functional Unit#nopf|nopf]]&nbsp;<br />
<br />
<b>flow slot 1</b>, 15bits wide, with functional units: [[Functional Unit#con|con]]&nbsp;[[Functional Unit#conform|conform]]&nbsp;[[Functional Unit#control|control]]&nbsp;[[Functional Unit#flowArgs|flowArgs]]&nbsp;[[Functional Unit#ls|ls]]&nbsp;<br />
<br />
<br />
<br />
[[Cores/Copper/Encoding|Operation Encoding]]</div>Janhttp://millcomputing.com/wiki/Cores/SilverCores/Silver2014-11-05T22:16:57Z<p>Jan: Created page with "{{DISPLAYTITLE:Silver Core}} <b>Cores:</b> Tin&nbsp;Copper&nbsp;Silver&nbsp;Gold&nbsp;Cores/Decimal8|Decim..."</p>
<hr />
<div>{{DISPLAYTITLE:Silver Core}}<br />
<b>[[Cores]]:</b> [[Cores/Tin|Tin]]&nbsp;[[Cores/Copper|Copper]]&nbsp;[[Cores/Silver|Silver]]&nbsp;[[Cores/Gold|Gold]]&nbsp;[[Cores/Decimal8|Decimal8]]&nbsp;[[Cores/Decimal16|Decimal16]]&nbsp;<br />
<br />
The Silver core offers a good amount of parallelism and also has native floating point arithmetic. It could be in a normal desktop or laptop computer or a mid range server with a more computation heavy workload.<br />
<br />
<br />
<b>[[Belt]]</b>: 16&nbsp;&nbsp;<b>[[Decode#Morsel|Morsel]]</b>: 4bit&nbsp;&nbsp;<b>[[Operands|Scalar Width]]</b>: 64bit&nbsp;&nbsp;<b>[[Operands|Operand Maximum Size]]</b>: 16B&nbsp;&nbsp;<br />
<br />
<b>[[Pipeline]]s</b>: 25&nbsp;&nbsp;<b>[[Retire Station]]s</b>: 16&nbsp;&nbsp;<b>[[Scratchpad]]</b>: 256B&nbsp;&nbsp;<br />
<br />
<b>[[Spiller|Spill Buffers]]</b>: 16&nbsp;&nbsp;<b>[[Spiller|Spiller Stack Size]]</b>: 256MB&nbsp;&nbsp;<br />
<br />
<b>[[Memory#Instruction_Cache|iCache Line]]</b>: 32B&nbsp;&nbsp;<br />
<br />
<b>6 reader slots</b>, 10bits wide&nbsp;&nbsp;&nbsp;<b>5 writer slots</b>, 7bits wide&nbsp;&nbsp;&nbsp;<b>2 pick slots</b>, 13bits wide&nbsp;&nbsp;&nbsp;<br />
<br />
<b>exu slot 0</b>, 20bits wide, with functional units: [[Functional Unit#alu|alu]]&nbsp;[[Functional Unit#bfp|bfp]]&nbsp;[[Functional Unit#bfpm|bfpm]]&nbsp;[[Functional Unit#bfpmas|bfpmas]]&nbsp;[[Functional Unit#count|count]]&nbsp;[[Functional Unit#mul|mul]]&nbsp;[[Functional Unit#nope|nope]]&nbsp;[[Functional Unit#shift|shift]]&nbsp;[[Functional Unit#shuffle|shuffle]]&nbsp;<br />
<br />
<b>exu slot 1</b>, 20bits wide, with functional units: [[Functional Unit#alu|alu]]&nbsp;[[Functional Unit#bfp|bfp]]&nbsp;[[Functional Unit#bfpm|bfpm]]&nbsp;[[Functional Unit#exuArgs|exuArgs]]&nbsp;[[Functional Unit#mul|mul]]&nbsp;[[Functional Unit#shift|shift]]&nbsp;<br />
<br />
<b>exu slot 2</b>, 16bits wide, with functional units: [[Functional Unit#alu|alu]]&nbsp;[[Functional Unit#exuArgs|exuArgs]]&nbsp;<br />
<br />
<b>exu slot 3</b>, 16bits wide, with functional units: [[Functional Unit#alu|alu]]&nbsp;[[Functional Unit#exuArgs|exuArgs]]&nbsp;<br />
<br />
<b>flow slot 0</b>, 16bits wide, with functional units: [[Functional Unit#cache|cache]]&nbsp;[[Functional Unit#con|con]]&nbsp;[[Functional Unit#conform|conform]]&nbsp;[[Functional Unit#control|control]]&nbsp;[[Functional Unit#ls|ls]]&nbsp;[[Functional Unit#lsb|lsb]]&nbsp;[[Functional Unit#nopf|nopf]]&nbsp;<br />
<br />
<b>flow slot 1</b>, 16bits wide, with functional units: [[Functional Unit#con|con]]&nbsp;[[Functional Unit#conform|conform]]&nbsp;[[Functional Unit#control|control]]&nbsp;[[Functional Unit#flowArgs|flowArgs]]&nbsp;[[Functional Unit#ls|ls]]&nbsp;[[Functional Unit#lsb|lsb]]&nbsp;[[Functional Unit#misc|misc]]&nbsp;<br />
<br />
<b>flow slot 2</b>, 16bits wide, with functional units: [[Functional Unit#con|con]]&nbsp;[[Functional Unit#conform|conform]]&nbsp;[[Functional Unit#control|control]]&nbsp;[[Functional Unit#flowArgs|flowArgs]]&nbsp;[[Functional Unit#ls|ls]]&nbsp;[[Functional Unit#lsb|lsb]]&nbsp;[[Functional Unit#misc|misc]]&nbsp;<br />
<br />
<b>flow slot 3</b>, 14bits wide, with functional units: [[Functional Unit#con|con]]&nbsp;[[Functional Unit#flowArgs|flowArgs]]&nbsp;[[Functional Unit#ls|ls]]&nbsp;[[Functional Unit#lsb|lsb]]&nbsp;[[Functional Unit#misc|misc]]&nbsp;<br />
<br />
<br />
<br />
[[Cores/Silver/Encoding|Operation Encoding]]</div>Janhttp://millcomputing.com/wiki/Cores/Copper.txtCores/Copper.txt2014-11-05T22:16:56Z<p>Jan: Created page with "**Belt**: beltSize&nbsp;&nbsp;**Morsel**: morselWidth&nbsp;&nbsp;"</p>
<hr />
<div>**Belt**: beltSize&nbsp;&nbsp;**Morsel**: morselWidth&nbsp;&nbsp;</div>Janhttp://millcomputing.com/wiki/Cores/Decimal16Cores/Decimal162014-11-05T22:16:55Z<p>Jan: Created page with "{{DISPLAYTITLE:Decimal16 Core}} <b>Cores:</b> Tin&nbsp;Copper&nbsp;Silver&nbsp;Gold&nbsp;Cores/Decimal8|De..."</p>
<hr />
<div>{{DISPLAYTITLE:Decimal16 Core}}<br />
<b>[[Cores]]:</b> [[Cores/Tin|Tin]]&nbsp;[[Cores/Copper|Copper]]&nbsp;[[Cores/Silver|Silver]]&nbsp;[[Cores/Gold|Gold]]&nbsp;[[Cores/Decimal8|Decimal8]]&nbsp;[[Cores/Decimal16|Decimal16]]&nbsp;<br />
<br />
The Decimal cores natively implement the [http://en.wikipedia.org/wiki/IEEE_floating_point IEEE 754] decimal floating point formats. They are primarily useful in financial and economic computations, so this is their intended area of use: trading and banking and bookkeeping mainframes and servers.<br />
<br />
Decimal16 natively implements both 64bit and 128bit decimal operations.<br />
<br />
<br />
<b>[[Belt]]</b>: 16&nbsp;&nbsp;<b>[[Decode#Morsel|Morsel]]</b>: 4bit&nbsp;&nbsp;<b>[[Operands|Scalar Width]]</b>: 64bit&nbsp;&nbsp;<b>[[Operands|Operand Maximum Size]]</b>: 16B&nbsp;&nbsp;<br />
<br />
<b>[[Pipeline]]s</b>: 25&nbsp;&nbsp;<b>[[Retire Station]]s</b>: 16&nbsp;&nbsp;<b>[[Scratchpad]]</b>: 256B&nbsp;&nbsp;<br />
<br />
<b>[[Spiller|Spill Buffers]]</b>: 16&nbsp;&nbsp;<b>[[Spiller|Spiller Stack Size]]</b>: 256MB&nbsp;&nbsp;<br />
<br />
<b>[[Memory#Instruction_Cache|iCache Line]]</b>: 32B&nbsp;&nbsp;<br />
<br />
<b>6 reader slots</b>, 10bits wide&nbsp;&nbsp;&nbsp;<b>5 writer slots</b>, 7bits wide&nbsp;&nbsp;&nbsp;<b>2 pick slots</b>, 13bits wide&nbsp;&nbsp;&nbsp;<br />
<br />
<b>exu slot 0</b>, 20bits wide, with functional units: [[Functional Unit#alu|alu]]&nbsp;[[Functional Unit#count|count]]&nbsp;[[Functional Unit#dfp|dfp]]&nbsp;[[Functional Unit#dfpm|dfpm]]&nbsp;[[Functional Unit#dfpmas|dfpmas]]&nbsp;[[Functional Unit#mul|mul]]&nbsp;[[Functional Unit#nope|nope]]&nbsp;[[Functional Unit#shift|shift]]&nbsp;[[Functional Unit#shuffle|shuffle]]&nbsp;<br />
<br />
<b>exu slot 1</b>, 20bits wide, with functional units: [[Functional Unit#alu|alu]]&nbsp;[[Functional Unit#dfp|dfp]]&nbsp;[[Functional Unit#dfpm|dfpm]]&nbsp;[[Functional Unit#exuArgs|exuArgs]]&nbsp;[[Functional Unit#mul|mul]]&nbsp;[[Functional Unit#shift|shift]]&nbsp;<br />
<br />
<b>exu slot 2</b>, 16bits wide, with functional units: [[Functional Unit#alu|alu]]&nbsp;[[Functional Unit#exuArgs|exuArgs]]&nbsp;<br />
<br />
<b>exu slot 3</b>, 16bits wide, with functional units: [[Functional Unit#alu|alu]]&nbsp;[[Functional Unit#exuArgs|exuArgs]]&nbsp;<br />
<br />
<b>flow slot 0</b>, 16bits wide, with functional units: [[Functional Unit#cache|cache]]&nbsp;[[Functional Unit#con|con]]&nbsp;[[Functional Unit#conform|conform]]&nbsp;[[Functional Unit#control|control]]&nbsp;[[Functional Unit#ls|ls]]&nbsp;[[Functional Unit#lsd|lsd]]&nbsp;[[Functional Unit#nopf|nopf]]&nbsp;<br />
<br />
<b>flow slot 1</b>, 16bits wide, with functional units: [[Functional Unit#con|con]]&nbsp;[[Functional Unit#conform|conform]]&nbsp;[[Functional Unit#control|control]]&nbsp;[[Functional Unit#flowArgs|flowArgs]]&nbsp;[[Functional Unit#ls|ls]]&nbsp;[[Functional Unit#lsd|lsd]]&nbsp;[[Functional Unit#misc|misc]]&nbsp;<br />
<br />
<b>flow slot 2</b>, 16bits wide, with functional units: [[Functional Unit#con|con]]&nbsp;[[Functional Unit#conform|conform]]&nbsp;[[Functional Unit#control|control]]&nbsp;[[Functional Unit#flowArgs|flowArgs]]&nbsp;[[Functional Unit#ls|ls]]&nbsp;[[Functional Unit#lsd|lsd]]&nbsp;[[Functional Unit#misc|misc]]&nbsp;<br />
<br />
<b>flow slot 3</b>, 14bits wide, with functional units: [[Functional Unit#con|con]]&nbsp;[[Functional Unit#flowArgs|flowArgs]]&nbsp;[[Functional Unit#ls|ls]]&nbsp;[[Functional Unit#lsd|lsd]]&nbsp;[[Functional Unit#misc|misc]]&nbsp;<br />
<br />
<br />
<br />
[[Cores/Decimal16/Encoding|Operation Encoding]]</div>Janhttp://millcomputing.com/wiki/Cores/Silver.txtCores/Silver.txt2014-11-05T22:16:55Z<p>Jan: Created page with "**Belt**: beltSize&nbsp;&nbsp;**Morsel**: morselWidth&nbsp;&nbsp;"</p>
<hr />
<div>**Belt**: beltSize&nbsp;&nbsp;**Morsel**: morselWidth&nbsp;&nbsp;</div>Janhttp://millcomputing.com/wiki/Cores/Tin.txtCores/Tin.txt2014-11-05T22:16:54Z<p>Jan: Created page with "**Belt**: beltSize&nbsp;&nbsp;**Morsel**: morselWidth&nbsp;&nbsp;"</p>
<hr />
<div>**Belt**: beltSize&nbsp;&nbsp;**Morsel**: morselWidth&nbsp;&nbsp;</div>Janhttp://millcomputing.com/wiki/Cores/Decimal16.txtCores/Decimal16.txt2014-11-05T22:16:53Z<p>Jan: Created page with "**Belt**: beltSize&nbsp;&nbsp;**Morsel**: morselWidth&nbsp;&nbsp;"</p>
<hr />
<div>**Belt**: beltSize&nbsp;&nbsp;**Morsel**: morselWidth&nbsp;&nbsp;</div>Janhttp://millcomputing.com/wiki/Cores/TinCores/Tin2014-11-05T22:16:52Z<p>Jan: Created page with "{{DISPLAYTITLE:Tin Core}} <b>Cores:</b> Tin&nbsp;Copper&nbsp;Silver&nbsp;Gold&nbsp;Cores/Decimal8|Decimal8..."</p>
<hr />
<div>{{DISPLAYTITLE:Tin Core}}<br />
<b>[[Cores]]:</b> [[Cores/Tin|Tin]]&nbsp;[[Cores/Copper|Copper]]&nbsp;[[Cores/Silver|Silver]]&nbsp;[[Cores/Gold|Gold]]&nbsp;[[Cores/Decimal8|Decimal8]]&nbsp;[[Cores/Decimal16|Decimal16]]&nbsp;<br />
<br />
Tin is the smallest viable general purpose chip. It only has each one properly populated exu and flow slot, so instruction level parallelism is quite limited. It also doesn't support wider vector operands. It is extremely low power though, so it would lend itself to ultra-mobile devices or it could serve as an overblown micro controller.<br />
<br />
<br />
<b>[[Belt]]</b>: 8&nbsp;&nbsp;<b>[[Decode#Morsel|Morsel]]</b>: 3bit&nbsp;&nbsp;<b>[[Operands|Scalar Width]]</b>: 64bit&nbsp;&nbsp;<b>[[Operands|Operand Maximum Size]]</b>: 8B&nbsp;&nbsp;<br />
<br />
<b>[[Pipeline]]s</b>: 13&nbsp;&nbsp;<b>[[Retire Station]]s</b>: 8&nbsp;&nbsp;<b>[[Scratchpad]]</b>: 128B&nbsp;&nbsp;<br />
<br />
<b>[[Spiller|Spill Buffers]]</b>: 8&nbsp;&nbsp;<b>[[Spiller|Spiller Stack Size]]</b>: 16MB&nbsp;&nbsp;<br />
<br />
<b>[[Memory#Instruction_Cache|iCache Line]]</b>: 16B&nbsp;&nbsp;<br />
<br />
<b>2 reader slots</b>, 9bits wide&nbsp;&nbsp;&nbsp;<b>2 writer slots</b>, 6bits wide&nbsp;&nbsp;&nbsp;<b>1 pick slots</b>, 10bits wide&nbsp;&nbsp;&nbsp;<br />
<br />
<b>exu slot 0</b>, 18bits wide, with functional units: [[Functional Unit#alu|alu]]&nbsp;[[Functional Unit#count|count]]&nbsp;[[Functional Unit#mul|mul]]&nbsp;[[Functional Unit#nope|nope]]&nbsp;[[Functional Unit#shift|shift]]&nbsp;[[Functional Unit#shuffle|shuffle]]&nbsp;<br />
<br />
<b>exu slot 1</b>, 7bits wide, with functional units: [[Functional Unit#cc|cc]]&nbsp;[[Functional Unit#exuArgs|exuArgs]]&nbsp;<br />
<br />
<b>flow slot 0</b>, 15bits wide, with functional units: [[Functional Unit#cache|cache]]&nbsp;[[Functional Unit#con|con]]&nbsp;[[Functional Unit#conform|conform]]&nbsp;[[Functional Unit#control|control]]&nbsp;[[Functional Unit#ls|ls]]&nbsp;[[Functional Unit#misc|misc]]&nbsp;[[Functional Unit#nopf|nopf]]&nbsp;<br />
<br />
<b>flow slot 1</b>, 9bits wide, with functional units: [[Functional Unit#con|con]]&nbsp;[[Functional Unit#flowArgs|flowArgs]]&nbsp;<br />
<br />
<br />
<br />
[[Cores/Tin/Encoding|Operation Encoding]]</div>Janhttp://millcomputing.com/wiki/Cores/Decimal8.txtCores/Decimal8.txt2014-11-05T22:16:51Z<p>Jan: Created page with "**Belt**: beltSize&nbsp;&nbsp;**Morsel**: morselWidth&nbsp;&nbsp;"</p>
<hr />
<div>**Belt**: beltSize&nbsp;&nbsp;**Morsel**: morselWidth&nbsp;&nbsp;</div>Jan