Forum Replies Created

Viewing 8 posts - 1 through 8 (of 8 total)
  • Author
    Posts
  • Art
    Participant
    Post count: 8

    I have thought avionics could be an application for Mill processors. Depending upon the customer, this could be an early application – especially if the customer has money and is patient (likely due to desperation), and is therefore willing to pay NRE charges up front for both the logic implementation and the circuit implementation.

    That said, there are a number of hurdles to overcome with such an application:

    1. Regulatory. Avionics is generally considered life-critical electronics, and is justifiably heavily regulated. Having worked in the aerospace electronics industry, I am very familiar with the types regulatory barriers to entry, even though my experience is from decades ago.
    2. Bleeding edge technology outside of Mill Computing’s main domain of expertise. The technologies you talk about concerning different memory types, packaging, radiation hardening, lightning strike hardening, crash survival hardening, unusual packaging thermal design, and others that you do not mention all fall more in the design space of an avionics systems house. I know from work experience that these technologies, while not insurmountable, are definitely not trivial either.
    3. Accomplishing the IP licensing needed.
    4. Getting potential customer avionics houses interested in Mill Computing.

    So in short, yes – such ideas are feasible (for some value of feasible) as an application for a Mill architecture processor core. Whether it is the ideal initial market for the Mill will depend on a suitable strategic partner being willing to help make that happen. Do you know of any potential strategic partners in that space?

  • Art
    Participant
    Post count: 8
    in reply to: Threads/Coroutines #738

    This is discussed, at least briefly, here: http://millcomputing.com/topic/the-belt/. The description starts at slide 40, at 00:30:46 in the video.

    In short, the state is saved/restored by a hardware mechanism. All the metadata is saved/restored with it.

    The “call stack” on a Mill is not the same as on a conventional. On a Mill, data that is not part of programmer visible state (such as function or interrupt return address) is saved/restored by the spiller hardware, and if enough levels of interrupt and/or function call occur to push this data to main memory, it is pushed to a completely different memory block. Normally, except for debugging, this “saved state” region is protected from all access except by the spiller hardware. The programmer visible “call stack” (such as C automatic local function variables) on a Mill is in a memory privilege region containing only programmer visible variables.

    This same spiller hardware mechanism is used by the OS to change thread contexts. How something like Unix fork is handled is NYF, and I will defer to Ivan to explain that when it is.

  • Art
    Participant
    Post count: 8

    Yes, this is good stuff indeed! I also read the paper, and the conclusions on standard CPUs are scary. I also note that there is specific AES hardware on current Intel E3, E5, and E7 series server processors – likely for both performance and vulnerability concerns, although I have no specific knowledge of an Intel claim that their AES hardware reduces vulnerability to timing encryption cracking.

    I also noted in the paper that the tables could be eliminated through the direct use of the operations the tables are meant to replace, which on a Mill could actually be faster than a table lookup (or sufficiently fast), and certainly could be fixed latency. I have not looked at the complexity of AES in detail to see if that is indeed the case, it might not be. The AES “performance contest” was held on the CPUs of the day, the Mill has characteristics that may lead to different implementations being optimal.

    As for a hardware AES box for the Mill, I suspect that dedicated (dynamically configurable) hardware to compute the algorithm may well be a better implementation than using tables in fixed-latency SRAM. At least such an implementation should be investigated, rather than assuming the implementation for old CPU ISA’s will also be optimal for direct hardware.

  • Art
    Participant
    Post count: 8
    in reply to: Memory #520

    Unfortunately, all of the places I have seen on line papers about multi-core interconnect have been on various university sites. I don’t see any of them with a proper forum for discussion.

  • Art
    Participant
    Post count: 8
    in reply to: The Compiler #2086

    If so, is there an advantage to issuing instructions late as opposed to issuing them as soon as their arguments are available? In this particular case one could, at least naïvely, think that larger part of the logic could have been clock gated on cycles 3 and 2, possibly leading to power savings if ‘sub’ and ‘mul’ would have been issued together.

    The TL;DR answer: there is no power difference as long as the number of belt value spills/fills remains the same.

    As far as power savings due to clock gating is concerned, there are 2 major factors:

    1. The power consumed in performing the operation itself (the add, sub, mul, etc.)
    2. The power consumed in maintaining operands on the belt

    The power consumed by the operation itself is independent of when the operation is performed. It is what it is, the same every time for a particular set of input values. When an operation is not being performed by a particular functional unit its clock is gated off. When the clock is gated off, the functional unit power consumption is only that due to static leakage current.

    The power consumed in maintaining an operand on the belt is nearly constant and depends greatly upon the number of new result values arriving each clock times the number of potential destinations for each result.

    The conclusion is that the biggest factor in reducing power is the number of belt spills/fills that must performed. The lower the number of spills/fills, the lower the power.

  • Art
    Participant
    Post count: 8

    Yes, Gold is a high end Mill member.

    Here is a more thorough enumeration of Gold’s 33 pipelines:

    • 8 pipes that do single output “reader” operations
    • 4 pipes that can do either integer (including multiply) or binary floating point (also including multiply) operations
    • 4 pipes that can only do integer (not including multiply) operations
    • 4 pipes that can do either immediate constant, load/store or control transfer (branch, call) operations
    • 4 pipes that can do either immediate constant or load/store operations
    • 4 pipes that can do pick operations
    • 5 pipes that can do “writer” operations

    There is a bit more to it than that, but those are the major units of interest.

  • Art
    Participant
    Post count: 8
    in reply to: Memory #517

    From following the link you give, I see the general outline of the idea. The challenge of implementing such a device will be to make it deterministic for larger numbers of cores and banks, which is where it would potentially have the greatest benefit. I did not see sufficient detail to make any determination regarding that aspect of the idea’s feasibility.

    That said, the problem being attacked is determining which core gets access to which bank. There are several sources of latency here: the arbitration mechanism latency, the “bank is busy” latency, routing latency and RAM access latency. This idea only directly addresses the arbitration mechanism latency. By allowing a larger number of banks, it appears to indirectly help the “bank is busy” or “bank access collision” latency. Unfortunately, a larger number of banks or cores will also increase routing latency. So in the end, routing latency may merely replace arbitration latency as being the performance limiting factor.

    • This reply was modified 6 years, 8 months ago by  Art.
  • Art
    Participant
    Post count: 8
    in reply to: Instruction Encoding #391

    A consequence of branch offset encoding that Ivan did not point out is that a branch to the entry of the current EBB always has an offset of zero, and therefore requires zero extra offset bits in the encoding. This compact encoding size is independent of the size of the EBB, and makes the encoding of the branch at the bottom of the loop very small for single EBB loops.

Viewing 8 posts - 1 through 8 (of 8 total)