Architecture

From Mill Computing Wiki
Revision as of 18:22, 6 August 2014 by Jan (Talk | contribs)

Jump to: navigation, search

Introduction

The Mill architecture is a general purpose processor architecture paradigm in the sense that stack machine or a RISC processor is a processor architecture paradigm.

It is also a processor family architecture in the sense that x86 or ARM are processor family architectures.

To briefly classify the Mill architecture: it is a statically scheduled, wide issue, in order Belt architecture, i.e. like in a DSP all instructions are issued in the order they are present in the binary instruction stream.

This approach traditionally has problems dealing with common general purpose workload operations and flows like branches, particularly while-loop execution, as well as with hiding Memory access latency. Those problems have been addressed, and so the static scheduling by the Compiler offloads most of the work that had to be done in hardware on every cycle into once at complile time tasks. This is where most power savings and performance gains come from in comparison to traditional general purpose architectures.


Mill ArchitectureStreamerPredictionProtectionProtectionDecodeMetadataMetadataBeltBelt#Belt Position Data FormatExuCoreFlowCoreRegistersScratchpadSpillerMemory


Overview

This could be described as the general design philosophy behind the Mill: Remove anything that doesn't directly contribute to computation at runtime from the chip as much as possible, perform those tasks once and optimally in the compiler and use the freed space for more computation units. This results in vastly improved single core performance through more instruction level parallelism as well as more room for more cores.

There are quite a few hurdles for traditional architectures to actually utilize the large amount of instruction level parallelism provided by many ALUs. Some of the most unique and innovative features of the Mill emerged from tackling those hurdles and bottlenecks.

  • The Belt for example is the result of having to provide many data sources and drains for all those computational units, interconnecting them without tripping over data dependencies and hazards and without having polynomal growth in the cost of space and power for interconnecting.
  • The unusual split stream, variable length, very wide issue Encoding makes the parallel feeding of all those ALUs with instructions possible, in a die space and energy efficient way with optimally computed code density.
  • The Mill features an exposed latency pipeline, as per usual with static scheduling, the latencies of all operations are fixed and known to the compiler for optimal scheduling. There are no variably length instructions, except load. And while for load the latency may be different for each specific use, it is explicitly defined for each specific use. This makes it possible to statically schedule loads as well and hide almost all load latencies.
  • Metadata encapsulates normally global program state in operands, eliminating side effects, exposing far more instruciton level parallelism, as well as simplifying the instruction set and increasing code density.
  • This heavily aided statical scheduling furthermore enables the extensive use of techniques like Phasing, Pipelining, Speculation, branch Prediction over several jumps with prefetch and a very short pipeline all minimize the occurence and impact of stalls for unhindered Execution.
  • A new Memory access model with caches fully working on virtual addresses, and the Protection mechanisms uncoupled from adress translation makes you never wait for address translation unless it is masked by DRAM access anyway.
  • The Mill is a processor family with many different member processors with very different hardware. It still provides a common binary program format that gets specialized for every specific processor on install. Any operations that are specific to only some of the processors and might not be available in hardware on all members of the Mill family are emulated in software, for full compatibility.
  • This is also true for any operations that cannot be implemented in hardware with fixed latencies, like divisions. They are realized in terms of the other real hardware operations by the compiler, which often is even better in performance than traditional microcode implementations of such instructions.