Forum Replies Created

Viewing 11 posts - 1 through 11 (of 11 total)
  • Author
    Posts
  • Art
    Participant
    Post count: 11

    There are some differences in approach that don’t give the LLC (Last Level Cache) described in that paper any advantages over the cache we had envisioned for the Mill Architecture.

    The biggest justification for this cache methodology rests in an assumption that DRAM access times will continue to remain at a high multiple of processor clocks. Consider that Dennard scaling has decreased dramatically, leaving processor core clock rates not much higher than in 2004, yet DRAM clock rates have continued to climb since that time, reducing LLC miss penalties – especially with on chip memory controllers.

    The paper describes a transparent physical address cache. For a variety of reasons the Mill uses virtually addressed caches. The virtual to physical translation is done by TLB logic in the DRAM memory controller. Leaving the virtual to physical translation in the DRAM memory controller does at least a couple of good things: it reduces the number of TLB accesses and it allows the Mill to reduce needless writes to DRAM of stack allocated data from functions that have exited.

    The paper also describes the tag structure as an initial 4-way hashed tag lookup, with the tags having N-way associativity to the data memory they map to, followed by a hash chain to accommodate hash collisions. The presumption here is that cache miss rate improvement is enough to make the hash chain traversal worth the increased latency, but only a simple timing model is cited as evidence of this. This is then compared using trace analysis to 4-way associative with LRU replacement. While we could use a 4-way associative LRU cache as a Mill’s LLC, more likely would be to use an 8 or 16 way associative LLC with pseudo random replacement – where pseudo random replacement has been shown to have lower levels of replacement thrashing than LRU for 8 or 16 way associative caches. We also would recommend on-chip DRAM controllers for Mill processors, with DRAM access times well under the “Break-even miss latency” given in the last column of Table 3 in the paper.

    I could go on, but I think I have spent enough time looking at this to decide that this may be an interesting, but not a likely path to either higher performance or lower power than other, more conventional LLC approaches.

  • Art
    Participant
    Post count: 11
    in reply to: Any plans for 2024? #3988

    We would love to be selling Mill computers also.

    What needs to happen to get there: The investment of money to hire the skilled people needed to accelerate Mill development. It’s that simple and not a new need. Likely $12M USD over a 2 to 3 year period to some initial form of production silicon.

    Accelerating development will require some additional C++ developers familiar with a range of topics from micro-kernels to Posix libraries to C/C++ libraries to LLVM backend code generation, as well as developers familiar with both C++ and System Verilog as it applies to FPGA and ASIC usage. With more people we could also spare the time to refresh the website, do new videos, etc. When we did the videos some years ago it resulted in enough investment to file patents, but not much more. The result was we had patents, which we needed after the patent law changes, but the video production effort took time away from key parts of development.

    We are working on various approaches to securing the investment needed.

    In the meantime, we have a few people who are continuing on with development, and we are in no immediate danger of failing. We are to the point we can run thousands of test programs, as well as Coremark, on our internal cycle accurate instruction set simulator. While the scope of what we can compile and run still needs to be expanded, what we have so far indicates the Mill Architecture’s viability.

  • Art
    Participant
    Post count: 11

    We now have significant C/C++ code compiled with LLVM for a front end, with our own code generation for a back end.
    We can now run Coremark, for example, on a functional instruction set simulator of our Architecture.
    We still have optimizations to complete implementing in that code generation back end.
    I am also working on Hardware generation from our specification machinery.

    At this point all of the areas we are working on could progress much faster with money for hiring people. That was not always the case…

    Also, we could disclose a lot more if we could file patents for what we still have as internal ideas, but that requires money as well.

  • Art
    Participant
    Post count: 11

    I have thought avionics could be an application for Mill processors. Depending upon the customer, this could be an early application – especially if the customer has money and is patient (likely due to desperation), and is therefore willing to pay NRE charges up front for both the logic implementation and the circuit implementation.

    That said, there are a number of hurdles to overcome with such an application:

    1. Regulatory. Avionics is generally considered life-critical electronics, and is justifiably heavily regulated. Having worked in the aerospace electronics industry, I am very familiar with the types regulatory barriers to entry, even though my experience is from decades ago.
    2. Bleeding edge technology outside of Mill Computing’s main domain of expertise. The technologies you talk about concerning different memory types, packaging, radiation hardening, lightning strike hardening, crash survival hardening, unusual packaging thermal design, and others that you do not mention all fall more in the design space of an avionics systems house. I know from work experience that these technologies, while not insurmountable, are definitely not trivial either.
    3. Accomplishing the IP licensing needed.
    4. Getting potential customer avionics houses interested in Mill Computing.

    So in short, yes – such ideas are feasible (for some value of feasible) as an application for a Mill architecture processor core. Whether it is the ideal initial market for the Mill will depend on a suitable strategic partner being willing to help make that happen. Do you know of any potential strategic partners in that space?

  • Art
    Participant
    Post count: 11
    in reply to: Threads/Coroutines #738

    This is discussed, at least briefly, here: http://millcomputing.com/topic/the-belt/. The description starts at slide 40, at 00:30:46 in the video.

    In short, the state is saved/restored by a hardware mechanism. All the metadata is saved/restored with it.

    The “call stack” on a Mill is not the same as on a conventional. On a Mill, data that is not part of programmer visible state (such as function or interrupt return address) is saved/restored by the spiller hardware, and if enough levels of interrupt and/or function call occur to push this data to main memory, it is pushed to a completely different memory block. Normally, except for debugging, this “saved state” region is protected from all access except by the spiller hardware. The programmer visible “call stack” (such as C automatic local function variables) on a Mill is in a memory privilege region containing only programmer visible variables.

    This same spiller hardware mechanism is used by the OS to change thread contexts. How something like Unix fork is handled is NYF, and I will defer to Ivan to explain that when it is.

  • Art
    Participant
    Post count: 11

    Yes, this is good stuff indeed! I also read the paper, and the conclusions on standard CPUs are scary. I also note that there is specific AES hardware on current Intel E3, E5, and E7 series server processors – likely for both performance and vulnerability concerns, although I have no specific knowledge of an Intel claim that their AES hardware reduces vulnerability to timing encryption cracking.

    I also noted in the paper that the tables could be eliminated through the direct use of the operations the tables are meant to replace, which on a Mill could actually be faster than a table lookup (or sufficiently fast), and certainly could be fixed latency. I have not looked at the complexity of AES in detail to see if that is indeed the case, it might not be. The AES “performance contest” was held on the CPUs of the day, the Mill has characteristics that may lead to different implementations being optimal.

    As for a hardware AES box for the Mill, I suspect that dedicated (dynamically configurable) hardware to compute the algorithm may well be a better implementation than using tables in fixed-latency SRAM. At least such an implementation should be investigated, rather than assuming the implementation for old CPU ISA’s will also be optimal for direct hardware.

  • Art
    Participant
    Post count: 11
    in reply to: Memory #520

    Unfortunately, all of the places I have seen on line papers about multi-core interconnect have been on various university sites. I don’t see any of them with a proper forum for discussion.

  • Art
    Participant
    Post count: 11
    in reply to: The Compiler #2086

    If so, is there an advantage to issuing instructions late as opposed to issuing them as soon as their arguments are available? In this particular case one could, at least naïvely, think that larger part of the logic could have been clock gated on cycles 3 and 2, possibly leading to power savings if ‘sub’ and ‘mul’ would have been issued together.

    The TL;DR answer: there is no power difference as long as the number of belt value spills/fills remains the same.

    As far as power savings due to clock gating is concerned, there are 2 major factors:

    1. The power consumed in performing the operation itself (the add, sub, mul, etc.)
    2. The power consumed in maintaining operands on the belt

    The power consumed by the operation itself is independent of when the operation is performed. It is what it is, the same every time for a particular set of input values. When an operation is not being performed by a particular functional unit its clock is gated off. When the clock is gated off, the functional unit power consumption is only that due to static leakage current.

    The power consumed in maintaining an operand on the belt is nearly constant and depends greatly upon the number of new result values arriving each clock times the number of potential destinations for each result.

    The conclusion is that the biggest factor in reducing power is the number of belt spills/fills that must performed. The lower the number of spills/fills, the lower the power.

  • Art
    Participant
    Post count: 11

    Yes, Gold is a high end Mill member.

    Here is a more thorough enumeration of Gold’s 33 pipelines:

    • 8 pipes that do single output “reader” operations
    • 4 pipes that can do either integer (including multiply) or binary floating point (also including multiply) operations
    • 4 pipes that can only do integer (not including multiply) operations
    • 4 pipes that can do either immediate constant, load/store or control transfer (branch, call) operations
    • 4 pipes that can do either immediate constant or load/store operations
    • 4 pipes that can do pick operations
    • 5 pipes that can do “writer” operations

    There is a bit more to it than that, but those are the major units of interest.

  • Art
    Participant
    Post count: 11
    in reply to: Memory #517

    From following the link you give, I see the general outline of the idea. The challenge of implementing such a device will be to make it deterministic for larger numbers of cores and banks, which is where it would potentially have the greatest benefit. I did not see sufficient detail to make any determination regarding that aspect of the idea’s feasibility.

    That said, the problem being attacked is determining which core gets access to which bank. There are several sources of latency here: the arbitration mechanism latency, the “bank is busy” latency, routing latency and RAM access latency. This idea only directly addresses the arbitration mechanism latency. By allowing a larger number of banks, it appears to indirectly help the “bank is busy” or “bank access collision” latency. Unfortunately, a larger number of banks or cores will also increase routing latency. So in the end, routing latency may merely replace arbitration latency as being the performance limiting factor.

    • This reply was modified 10 years, 11 months ago by  Art.
  • Art
    Participant
    Post count: 11
    in reply to: Instruction Encoding #391

    A consequence of branch offset encoding that Ivan did not point out is that a branch to the entry of the current EBB always has an offset of zero, and therefore requires zero extra offset bits in the encoding. This compact encoding size is independent of the size of the EBB, and makes the encoding of the branch at the bottom of the loop very small for single EBB loops.

Viewing 11 posts - 1 through 11 (of 11 total)