SIMS architecture

Tagged: cray, pipelining, SIMD, SIMT, software

Author
Posts
jstopard
Participant
January 11, 2026 at 4:04 am
Post count: 9
#4028 |
I would like to tell you about something called Single (Variable) Instruction Multiple Stream; but, before I do, I will need to explain a few things.
Benefits of SIMD: Low latency; moderate throughput; near isolinear performance increase in many cases. Drawbacks of SIMD: must recompile the module when increasing the vector size; only supports power-of-two vector sizes; little support for auto-vectorisation. Benefits of SIMT: High throughput; high capacity; supports a wide gamut of DSP applications. Drawbacks of SIMT: High latency; masking; requires hand optimization.
Benefits of CRAY vectors: variable vector size that can be changed at run-time. Drawbacks of CRAY vectors: difficult to implement; hard to integrate with software pipe-lining; costly. Benefits of SIMS: variable number of parallel streams (not power-of-two based) that don’t need their own thread; true isolinear performance increase on nearly any state machine; works with software pipe-lining; compute section might be able to act like spares, thus increasing yield somewhat. Drawbacks of SIMS: compute units are idle in some cases if one cannot convert some portions of the code base to use SIMS architecture; doesn’t work very well with functions like memcpy, memset, strcpy, strcmp.
SIMS is designed to replace SIMD, and work harmoniously with software pipe-lining. SIMS works whether or not software pipe-lining can be utilized, and works as follows: In the case where pipe-lining can be utilized, the graph structure of the loop body is collapsed, turning the loop body into a single node. In the case where pipe-lining cannot be utilized, the graph structure of the loop body is maintained, allowing SIMS to parallelise, but losing the benefits of pipe-lining.
SIMS is flexible and provides one with the best of both worlds, sometimes together. Given the difficulty if creating a CRAY/pipe-lining architecture, SIMS seems are more sensible choice.
Let me know what you think. Perhaps I’ve got this completely wrong, and need to rethink things.
jstopard
Participant
January 12, 2026 at 8:41 am
Post count: 9
#4029
Control-flow
While traditional control flow is used, this is intended only to support flexible flow of control around/between convergence points. A point of convergence is where two sub graphs connect. You can think of a sub graph as being like an EBB, although EBBs are not actually used in this architecture.
Data-flow
Data-flow is a term that only makes sense in terms of software pipe-lining; the data flow through the operations specified in the software pipe-lining kernel, but the instructions execute in parallel.
Execution modes
The modes are as follows: SIMS only; SIMS and software pipe-lining; Single stream and software pipe-lining
Graph-flow
Graph-flow occurs within a sub graph, but this type of control flow isn’t achieved by using JUMP instructions; instead, the CPU maintains a real graph representation of the program, and a binary array is used as a look up table to determine which graph node needs to be executed next within a compute unit.
Software pipe-lining kernel
The compiler emits a software pipe-lining kernel in cases where it’s possible; the CPU is required to respect program semantics, but can dynamically adjust execution behaviour within CUs, and can borrow functional units if necessary.
Vectorisation
There is no vector size in the SIMS architecture; instead, it is the program execution that is ‘vectorised’. The number of program execution streams active at once can be any number up to the CPU limit. If the user acquires a new CPU with a larger compute unit count, it is not necessary to recompile the program.
jstopard
Participant
January 13, 2026 at 3:50 pm
Post count: 9
#4030
Hardware pipe-line
The pipe-line stages are as follows: fetch, decode, virtual stream assignment, operation, memory, write-back.
Nested streaming
It is not possible to nest a SIMS streaming context within another context, if the inner context is conditional. Nevertheless, it would be prudent to permit nested for loops to be ‘streamed’, granting a significant increase in performance. Using a bizarre technique called ‘outlining’ to remove a loop code block and place it into a separate function body, it is possible to emulate streaming on nested loops when the inner loop is conditional.
Software pipe-lining
SIMS supports pipe-lining and uses manual register rotation, but it is not necessary to encode a prologue or epilogue; instead, one of two techniques might be used to mask instructions that would otherwise function within prologue/epilogue: the first involves simple instruction masking, the other involves side-effect masking using tagged values that indicate whether state should be modified.
Stackless programming with cached register rotation
It is not necessary to write register values to the stack when running out of registers; instead, the CPU can stream register values to and from a register caching system (which itself can dynamically stream to and from RAM); in addition, these cached values can be properly rotated as necessary to support software pipe-lining. Lastly, the floating-point padding bits normally found in registers can be cached along with register values.
Virtual static scheduling
VLIW is not used in SIMS. The compiler emits a static schedule, and the CPU is expected to conform to the semantics laid out in this schedule; however, the CPU can dynamically assign fewer or more functional units to virtual compute units, if needed.
Art
Participant
January 13, 2026 at 6:50 pm
Post count: 15
#4031
Why are you posting this?
Is there a question coming about the Mill?
It appears you are using our forum to advertise/discuss/explain ideas, but not give any relation to the Mill Architecture, which this forum exists to discuss.
If this is some other architecture you would like to interest somebody else to work on, I gently suggest that this is not the place.
jstopard
Participant
January 14, 2026 at 7:00 am
Post count: 9
#4033
This isn’t an advertisement, simply a suggestion that data vectorisation has no future, and that a move needs to be made to execution-level ‘vectorisation’.
I will comply with your wishes and stop posting about this.
Author
Posts

You must be logged in to reply to this topic.