Forum Replies Created

Viewing 15 posts - 1 through 15 (of 44 total)
  • Author
    Posts
  • imbecile
    Participant
    Post count: 48
    in reply to: Glossary #974

    specializer

    The specializer is a dedicated Mill specific library. It can be considered both the last step of compilation or the first step of execution. Since the Mill is a processor family with lots of members for different purposes and thus different parameters like belt slot count, number and kind of functional units, cache sizes etc. and the Mill heavily depends on exposing those details of the chip to the compiler for optimal static scheduling, the specializer was introduced. There is a universal, member independent kind of internal byte code software is distributed and deployed with and the specializer translates it into the actually executable binary code for the required chip. This could happen at install time or load time and also does caching and symbol resulution and similar tasks done by traditional dynamic linkers.

  • imbecile
    Participant
    Post count: 48
    in reply to: Glossary #973

    replay

    On all architectures unexpected things can happen which throw the pipeline and funcitonal units and all internal state into a pickle. Somehow the normal flow is suspended and must be continued later. This can happen through interrupts, branch predictions or normal reorders on out of order architectures. The Mill is not out of order, but normal calls there can present small temporary context changes just as interrupts (and interrupts are just unscheduled calls on the Mill). There are several strategies to return to the state before the interruption, all usually called replay.

    Result Replay is used on the Mill. Since all interruptions introduce a new belt context in the form of a new frame, all new operations drop their results in their new context (i.e. belt slots), while the already issued operations all finish and drop the results into their old contexts. This happens all via belt slot tagging/renaming with frame ids. On return to the previous flow, all results are presented as if there never was an interruption. It may have been necessary for the spiller to intervene and temprarily save and restore some results for that to happen.

    Execution Replay is used on pretty much all major hardware. When an interruption occurs all results and temporary transient state is thrown away, the issued instructions and arguments are remembered though, and on return they are reissued. This can be quite expensive on longer pipelines and with lots of complicated instructions.

  • imbecile
    Participant
    Post count: 48
    in reply to: Glossary #956

    implicit zero

    All new memory allocations happen in the cache first and are flagged as new for each byte. A load from such a location produces a zero. Only once something is stored in newly allocated byte the new flag is removes. As a result you get zero initialization for free, and often temporary buffers or stacks don’t even need to go into DRAM and exist all in the cache, only as virtual addresses.

  • imbecile
    Participant
    Post count: 48
    in reply to: Glossary #955

    phase

    One instruction on the Mill can contain many operatkions that are issued together. Those operations can have data dependencies among each other. For that reason operations were divided into distinct catgories called phases that permit an ordering in consecutive cycles for those operations to be executed in to account for those data dependencies.
    The Phases are:

    1 (Reader) operations that load or create values into the belt with hardcoded arguments
    2 (Operation) operations that take belt slots as arguments and produce results
    3 (Call) function calls
    4 (Pick) pick operation
    5 (Writer) stores and branches

  • imbecile
    Participant
    Post count: 48
    in reply to: Security #944

    I got 3 small questions.

    1. Considering stacklets are only 4kb in size, services probably can’t use to large data structures on the stack. But considering argument passing happens on the belt and call stacks in the spiller, the stack pressure on the Mill is vastly reduced. I’m assuming normal applications can have larger stacks than 4kb though.

    2. There are a few harware functionalities that could be useful if exposed to the programmer, like the lookups in the PLB and TLP. I’m not sure how feasible and secure this is, but could those lookup algorithms implememted in hardware be accessible to the programmer through a service/portal call? Or are they too tightly tied to their dedicated data tables?

    3. Are you aware of the MIT exokernels? I think the Mill architecture lends itself beautifully to it, and the service concept even makes some of the contortions they go through to avoid or secure task switches unnecessary, like dedicated languages that results in code passed to privilged kernel driver modules.

  • imbecile
    Participant
    Post count: 48
    in reply to: Glossary #943

    turf

    Is the collection of protection regions that share the same turf ID. This turf ID is held in a special register and provides the security context of the current thread. It can be changed for the current thread with portal calls. Memory access is granted as soon as the first region with the current turf ID (or thread ID, if the turf ID is wildcarded) and the required permission is found.

  • imbecile
    Participant
    Post count: 48
    in reply to: Glossary #942

    Translation Lookaside Buffer

    Maps virtual memory addresses to physical addresses. Resides below the caches, i.e. in the caches everything is virtual addresses. Virtual addresses are unique and refer to the same physical address in every context. They only need to to be referenced when there is a cache miss and a DRAM access becomes neccessary.

  • imbecile
    Participant
    Post count: 48
    in reply to: Glossary #941

    stacklet

    A hardware allocated segment of stack residing on the top of the address space, which is used for services. They are identified by the turf of the service and the thread the service is executed in. This prevents fragmentation of the turfs of applications and services.

  • imbecile
    Participant
    Post count: 48
    in reply to: Glossary #940

    spiller

    A part of the Mill hardware that is largely invisible to the programmer and can’t be directly accessed. What it does is manage temporary memory used by certain operations. It has it’s own separate caches and is ultimately backed by DRAM. Among other things it takes care of the scratchpad, of the call stacks, of the belts of frames down the call hierarchy, of contexts in task switches etc.

  • imbecile
    Participant
    Post count: 48
    in reply to: Glossary #939

    service

    Services are a kind of library, only that the calls happen across protection boundaries through portals. They can be used from applications or other services, and provide protection for both callers and callees from each other. They are the canonical way to provide “privileged” functionality on the Mill. It is not really privileged though. Services only reside in different turfs with different permissions than the code calling them. There is nothing fundamentally different between different turfs, only the set of permissions to different memory regions.

  • imbecile
    Participant
    Post count: 48
    in reply to: Glossary #938

    Single Address Space

    All processes and threads on the Mill share the same mappings of virtual addresses to physical addresses. This is made possible by using 64bit addresses which have an address space large enough for the forseeable future. Different programs are protected/isolated from each other with permissions in different turfs, not memory mappings. No memory remaps need to be done on task switches, and often task switches are entirely unneccessary due to this.

  • imbecile
    Participant
    Post count: 48
    in reply to: Glossary #937

    protection region

    A continuous region in the address space with a set of access permissions. It can be attributed to a turf, a thread or both with a turf and/or thread ID.

  • imbecile
    Participant
    Post count: 48
    in reply to: Glossary #934

    portal

    A portal is a special data structure of cache line size that holds all neccessary information to call into service code across protection barriers. This happens without context/thread switches and for that reason is fast. There are a few operations to manage access to portals themselves and to memory used if necessary to pass parameters both permanently and temporarily for one call.

  • imbecile
    Participant
    Post count: 48

    > It seems to me that the compiler will consider most loops pipeline-able for the *abstract* Mill and thus emit its intermediate representation for a fully-software-pipelined loop — only to have the specializer potentially need to (partially, or worse fully) “de-pipeline” it, due to the target’s constraint on the number of simultaneous in-flight loads it can do.

    I’m only speculating here, but I get the impression that this resource allocation problem is of the same class as the varying amount of belt slots on the different cores. And will be solved the same way: the compiler emits code as if there was no restriction at all, i.e. as if there is an unlimited number of belt slots and load/retire stations. It doesn’t even know of the spill and fill instrucions for example, since it wouldn’t know where to place them.

    The specializer knows those exact limits and then schedules/inserts loads and stores and spills and fills exactly to those limits at the appropriate place. Figuring out how many parallel loads and stores you can have and how much pipelining/unrolling loops you can do on a core is pretty much the same as figuring out how many belt slots you have and thus how to link consumers and producers together, except in one case you emit spills and fills at the limits and in the other you emit branches at the limits. The loads and stors in both cases are just placed according to best latency hiding behavior.

  • imbecile
    Participant
    Post count: 48
    in reply to: Security #950

    Yes, the exo-kernels are pretty much microkernels. The difference to other microkernels like the L4 is that the API has even lower level abstractions. They don’t even really have a concept of threads for example, they work by granting processor time slices to memory mappings.

Viewing 15 posts - 1 through 15 (of 44 total)