Difference between revisions of "Memory"

From Mill Computing Wiki
Jump to: navigation, search
Line 5:Line 5:
 
<br />
 
<br />
 
<imagemap>
 
<imagemap>
File:Architecture.png|alt=Mill Architecture
+
File:Memory-Hierarchy.png|alt=Memory/Cache Hierarchy
 
desc none
 
desc none
rect 455 220 515 470 [[Streamer]]
 
rect 190 370 315 415 [[Prefetch]]
 
rect 190 290 315 330 [[Prediction]]
 
rect 90 260 150 330 [[Protection]]
 
rect 390 260 460 330 [[Protection]]
 
rect 0 240 370 290 [[Decode]]
 
rect 125 150 130 160 [[Metadata]]
 
rect 125 160 190 180 [[Metadata]]
 
rect 220 140 380 215 [[Belt]]
 
rect 110 110 220 200 [[Belt#Belt_Position_Data_Format]]
 
rect 260 80 380 140 [[ExuCore]]
 
rect 260 215 380 275 [[FlowCore]]
 
rect 380 140 460 215 [[Registers]]
 
rect 515 80 660 110 [[Scratchpad]]
 
rect 515 150 660 210 [[Spiller]]
 
rect 0 335 660 515 [[Memory]]
 
default [http://millcomputing.com/w/images/3/3e/Architecture.svg]
 
 
 
</imagemap>
 
</imagemap>
 
<br />
 
<br />

Revision as of 03:42, 31 July 2014

A lot of the power and performance gains of the Mill, but also many of its security improvements over conventional architecures come from the various facilities of the memory management. Most subsystems have their own dedicated pages. This page is an overview.

Overview


Error: Image is invalid or non-existent.


The Mill architecture is a 64bit architecture, there are no 32bit Mills. For this reason it is possible and indeed prudent to adopt a single address space (SAS) memory model. All threads and processes share the same address space. Any address points to the same location for every process. To do this securely and efficiently the memory access protection and address translation have been split into two separate modules, whereas on conventional architectures those two tasks are conflated into one.

As can be seen from this rough system chart, There is a combined L2 cache, although some low level implementations may choose to omit this for space and energy reasons. The Mill has facilities that make an L2 cache less critical.
L1 caches are separate for instructions and data already, and even more, they are already separate for ExuCore instrucions and FlowCore instructions. Smaller, more specialized caches can be made faster and more efficient in many regards, but chiefly via shorter signal paths.
The D$1 data cache feeds into the retire stations with load instructions and recieves the values from the store instructions.

Address Translation

Because address translation is separated from access protection, and because all processes share one address space, the translation and TLB accesses can be moved below the caches. In fact the TLB only ever needs to be accessed when there is a cache miss or evict. In that case there is a +300 cycle stall anyway, which means the TLB can be big and flat and slow and energy efficient. The few extra cycles for a TLB lookup are largely masked by the system memory access.
On conventional machines the TLB is right in the critical path between the top level cache and the functional units. This means the TLB must be small and with a complex hierarchy and fast and power hungry. And you still spend up to 20-30% of your cycles and power budget on TLB stalls and TLB hierarchy shuffling.

Reserved Address Space

The top 16th of the address space is reserved facilitate fast protection domain or turf switches with secure stacks. More on this there.

Virtual Zero

Caches

Media

Presentation on the Memory Hierarchy by Ivan Godard - Slides