Mill Computing, Inc. Forums The Mill Architecture Optimizing Mill System Design: The view from Low Earth Orbit

  • Author
  • BobC
    Post count: 10
    #3015 |

    Ivan often talks about the view from 40,000 feet. I need to go a bit higher, outside the chip to the system level.

    I’m an embedded real-time developer who also does target system design that overlaps the digital, electrical, electro-mechanical, communications and robotic domains. I’m best at developing algorithms then implementing them in the smallest platform needed to get the job done “on time”. The targets range from an ARM M0 up to a multi-core ARM v8, sometimes with multiple instances connected together.

    My algorithm development often requires tons of simulation/modeling and big-data work (statistical analysis), from large static data archives to real-time streaming video, and this work can easily consume the small-ish Xeon cluster I use. Since I have a cluster available, I often start my implementation using QEMU to emulate ARM-based target system candidates. But it is a pain to do cross-architecture emulation.

    The Mill architecture presents some interesting opportunities for someone like me, with the view to using a single scalable architecture from my cluster all the way down to my embedded systems. In particular, the thought of emulating a number of Mill Minis on a Maxi seems like a dream come true: I’d tell the Mill compiler to emit code for a Maxi, but constrained to use only the resources and timing of a specific Mini specification. Basically, zero-overhead CPU emulation.

    Ideally, this *could* facilitate hardware-software co-design for the embedded target, making the process both faster and simpler, and thus far more practical. At present, the two are often separate project phases.

    Which leads me to the real topic of this post: Let’s say algorithm development and testing has been completed, and I have working code for my system: Can/will the Mill specification tools be expanded to help find/evolve the ideal target hardware solution?

    The only hard constraint would be the target system timing requirements. All other constraints could be expressed as “preferences”. For example, total system power could be expressed as “1-5 watts, with lower power strongly preferred.”

    A typical optimization would concern memory speed: Given the memory demands and the system timing requirements, what memory architecture works best overall? Multiple slow channels or fewer fast ones or something in-between?

    This, in turn, would push back into the Mill itself: For a specific system, would multiple simple Mill cores be preferred over fewer larger cores, or just a single big core? What are the trade-offs between a single Mill core per memory channel, versus multiple Mill cores sharing a single memory channel, versus a larger Mill core using multiple memory channels?

    We have lots of hardware terrain to explore to find the best target system. Can that search be simplified? Can/will the Mill architecture itself contribute to that simplification?

    A common problem involves I/O: Sometimes it is best for the CPU to read directly from the hardware, other times it is best for a DMA channel to move data from the hardware to memory. But using DMA means managing memory contention, resulting either in adding a separate memory channel for DMA, or increasing the bandwidth of a shared channel. Not to mention cache issues.

    With the Mill, I also have the option to scale the CPU itself. But this would require having a rich set of options available in the market, since I have no desire to roll my own silicon. And that pushes our optimization strategy out into the market, which in turn involves the Mill Computing business model.

    Will Mill SoCs be available from multiple competing vendors, like ARM, or will it be constrained to a single vendor, like TI DSPs? Will the Mill be available as a family of architecture and RTL licenses, in addition to whatever chips Mill Computing choses to fabricate?

    As someone in the embedded market, I very much hope the ARM approach is used, with silicon from Mill Computing existing mainly to 1) provide reference implementations, 2) seed the chip market, and 3) provide development systems.

    I’d like to see a rich and diverse Mill SoC market that will provide no only my next cluster, but also the wide range of embedded system targets I need to deploy.

    The key will be not just hardware diversity and availability (and the Mill business model), but also the tools to explore system architectures within that domain, first with emulation, then in hardware selection.

    Is Mill Computing working on such system-level tools?

  • Ivan Godard
    Post count: 689

    Wow! Deep questions. And few answers – but I’ll do my best in no particular order.

    The current business model is Intel (or TI if you prefer): mostly a chip vendor, with substantial side businesses in IP, custom, and boards. An ARM-like model is a fall-back option.

    We expect to expose our specification and sim tools to the customer base and research community, and quite likely to the public. With those tools and a chunk of NRE and we’ll give you your chip, or build hard macros for use in a chip of your own.

    We have no minimax searching tool such as you describe in view. Given customer demand, sure.

    Specializer-driven emulation of one Mill on another is possible, but won’t give you the modelling you are looking for: members differ to much is ways not driven by the code. Nothing we can do in the code can make one Mill act like another with half the icache. For accurate modelling you’d need to run our sim on your big Mill; the sim models the non-code differences.

    Currently our sum greatly abstracts at the system level. In particular we do not try to model anything across the pins. For example, we simple spec a fixed number of picos for DRAM access, and ignore the variation induced by charging, bank switch, and the like. Similarly we do not model i/o devices, so would be no help in trading memory for i/o.

    And all these will no doubt change as the company evolves. In particular, the funders of our next round will have a big say in the answers here.

  • BobC
    Post count: 10

    The embedded target environment does chunk well enough, so a handful of SoCs may cover much of it. Selling hard macros should service the gaps well enough, especially if you use standard interfaces, specifically AMBA. The key will be the timely conversion a Mill specification to a hard macro that will work on the client’s fab and process.

    I’m interested in where the smallest viable Mill CPU would fit relative to existing embedded processors (especially the ARM M0-4 family), primarily on the power/performance metric. If my embedded system can get more processing for less wattage, I win. The simplification of both the power supply and thermal management can easily offset a significant cost differential for the processor. Can’t wait to see data from the first complete FPGA implementation!

    Speaking of FPGAs: As a nearly incompetent FPGA designer, I have found it difficult to debug using testbench circuitry, or signal pins and a logic analyzer, or via JTAG. My preferred approach is to always include a rudimentary 8-bit processor within the FPGA to test each subsystem. Once the design works, I use the 8-bit processor for monitoring and logging. The resources used are small, the flexibility huge. And I can write 8-bit code better than I can create FPGA testbenches.

    There have been design situations where I’ve had to consider offloading some processing to an FPGA or GPU, though so far I’ve always been able to squeak by via careful optimization and/or a multicore bump. I’ll be very interested to see the workloads an embedded Mill SoC can support.

    When you mention the influence of the next round of funders, I presume you really mean the influence of the resulting Board of Directors. Even the greatest cash-equity deal can (and should) be sunk by poor BoD nominees or by unwarranted intrusions into the C-suite. I’ve seen the price of errors in this area. One tactic I’ve seen work is to identify BoD members and executives you’d like to have (but who won’t take your calls), then select funders who will recruit them. The same applies to lining up other partners and even early customers: Identify them in advance, then find the funders who can bring them aboard. Funders must provide far more than funding!

    Another thing to be wary of is VCs or banks partnering you with other startups. It multiplies your risk rather than reducing it, since their problems become your problems, and vice-versa. The sole exception is when synergy exists in a merger, which can be a great (optimal?) way to get lots of talent in a single transaction. I once helped with the engineering side of such a merger, and it turned out well for all involved.

    • Ivan Godard
      Post count: 689

      The Mill is by definition a SAS system, so the constraint on the low end is address space. A chip with no MMU (all addresses are physical) then that is effectively SAS, so if you can fit in that then you can fit in a Mill of the same size. On a 64-bit Mill the spillet matrix occupies 2^50 bytes of address, so the number of distinct thread and turf ids has to drop sharply as the space goes down, but for embedded the number of threads/turfs is probably statically known.

      There’s also no architectural need for caches if you are going straight to on-chip memory. You’d probably want to keep the I$0 microcache anyway. The FUs and all the operand paths can have arbitrary width, but should probably not be less than pointer size to avoid code explosion. The predictor could be dropped if frequent misses are tolerable.

      There are architectural overheads that are largely independent of address space and operand size: the specRegs, the decoders, others. As the Mill size shrinks these fixed costs start to be more important. It’s unclear when they would tilt a choice against the Mill.

You must be logged in to reply to this topic.