Mill Computing, Inc. Forums The Mill Markets Is binary Translating i386/x86_64 to mill code practical?

  • Author
    Posts
  • mikeakers
    Participant
    Post count: 2
    #1319 |

    Thinking about laptop/desktop applications for The Mill I’m wondering how practical it would be to do binary translation from something like i386 or x86_64 to mill instructions? Would there be too much of a mismatch between the register file that the source binary is expecting and the belt? Seems like lots of other problems would crop up too.

    I’m thinking of something along the lines of Apple’s Rosetta (PPC->x86) translator or DEC’s FX!32 (i386 -> Alpha)

    I recently heard about a project that converts from compiled X86_to LLVM IR, which can then be compiled into a new binary. Maybe this is an approach that would work if there’s an LLVM backend for The Mill. See https://github.com/trailofbits/mcsema for more.

    The Mill is a very interesting design, keep it up!

  • Ivan Godard
    Keymaster
    Post count: 689

    Binary translation is as practical to the Mill as to any other architecture; the task is essentially the same for any S-to-T translation, including the Mill on either end. You un-model the source to an abstract representation similar to a compiler’s IR, optimize that to a fair-thee-well to remove redundant state setting (like condition codes), and then do a normal target code gen from the result.

    The difficulty, for all translations, is when the source code embeds implicit assumptions about the platform. Self-modifying code is obvious, but code often assumes word and pointer sizes, endianness, the presence (and behavior) of I/O devices, and so on.

    There are platform dependencies beyond the ISA too. You may have a Windows x86 binary and translate it to execute on a SPARC or a Mill, but without a Windows OS to run on it’s not going to be able to draw pictures on the screen.

    So the issues with binary translation are roughly the same as for any other port even if you have the source to recompile. If a given code can already successfully run on big- and little-endian machines using both 32-bit and 64-bit models then it is probably port-clean enough to translate well. But even with perfect ISA translation, the port may not succeed.

  • eversl
    Participant
    Post count: 3

    If you manage to compile Qemu or Bochs or some other emulator you should be able to run a complete x86 operating system. It will cost you a great deal in terms of speed though, unless you go out of your way to produce heavily optimised mill code from the x86 instructions in a JIT-like fashion. That is essentially what Transmeta did with their Crusoe processors.
    (And maybe that Soft Machines CPU as well)

    From what I understand so far from the published materials the Mill seems designed well enough to not suffer from the same pitfalls that prevented the Transmeta chips from really performing as promised (Large VLIW instruction words, small 64MB translation cache, low memory bandwidth…). On top of that it would be possible to provide a ‘super speed mode’ that runs applications compiled to the native Mill instruction set right inside of an x86 (or x64) Linux or Windows.

    Sounds like that would make for an appealing desktop or server product. Maybe such a system would benefit from some extra instructions on the Mill CPU that take care of some performance critical stuff (TLB lookups or fast x86 register restores come to mind).

    Would you consider adding some instructions for this purpose?

    • Ivan Godard
      Keymaster
      Post count: 689

      Probably not.

      Directly simulating a general-register machine requires a way to preserve updateable state that in the target would be in registers. The only updateable state on a Mill is memory, so performance would be abysmal. Then there would be problems providing the x86 memory semantics, which are weaker than a Mill.

      But more to the point: binary translation has gotten pretty good these days, so there seems little reason to directly interpret any other chip’s native instruction set. We expect to include a (verrry slow) interpreter for use with device ROMs that contain x86 code when the device is needed by the BIOS. Or maybe we can avoid the problem some other way; hard to tell until we get further along.

      • eversl
        Participant
        Post count: 3

        Indeed, binary translation would be the way to get reasonable (or even good) speed while running a nonnative instruction set. Its just that there are probably lots of corner cases that would require complete emulation of the x86 cpu and system (precise exception semantics and probably the x86 memory addressing) at least some of the time.
        Anyway, this has been done before on multiple CPU architectures, so there is probably also a way to do it with a mill CPU. It just made me think that the wholly admirable goal on which Transmeta was based — running x86 code on a simpler, low power CPU by doing all of the instruction decoding and scheduling in software — might be within reach as just one of the possible applications of a mill CPU. Running code compiled for mill will be even better, but that requires recompiling from source and might not work out of the box, at least initially, so having the option of x86 binary compatibility probably helps to make inroads in markets like desktop and server systems.

        It’s not a trivial thing to build a fully compatible binary translator however, so I can imagine this is not yet on the radar for some time to come.

        • Will_Edwards
          Moderator
          Post count: 98

          Its possible that the lock-in of legacy proprietary apps is no longer the barrier to new ISAs that it used to be? The next generation of OS are being built around running a browser and nothing else (Chrome OS, Firefox OS etc). The other day I tried to run an old program on the latest version of Windows and it wouldn’t run, and I had to resort to running it in Wine on Linux (where it was quiet happy!).

          That aside, business is not my forte so lets keep things technical 🙂 Binary translation e.g. McSema from x86-64 to LLVM IR would then allow the code to be optimized and targeted by the Mill LLVM IR. For on-the-fly emulation of individual programs running in the host environment you could imagine something more like the hot translation that Valgrind does. But to emulate a whole OS (memory management and all) likely needs a conventional VM approach.

          • eversl
            Participant
            Post count: 3

            I agree that with web-apps running in javascript and wide use of java and other VMs legacy binaries are much less of an issue now. If Google or someone else decides to build a Mill-based chromebook or android device we would all be quite happy. And if x86 compatibility becomes an issue, it will be a matter of writing the code for it. It’s been done before…

            But I take it from both of your reactions that x86 (or any other CPU) compatibility is not part of the plan for Mill CPUs (like the SoftMachines guys clearly do). If the power and speed numbers come out as planned I guess there’s plenty of opportunity for a Mill anyway.

    • Witold Baryluk
      Participant
      Post count: 33

      About qemu on Mill. Obviously it would be trivial to run qemu or bochs on Mill, and it will probably compile out of the box with zero changes. However, qemu is not designed for emulation speed. It does only JIT (so no AoT), takes a long time startup and warmup, consume memory for both source and translated code, and the generated (JITed) code is of very poor quality (like 5 to 10 times worse than what normal compiler generate for the original code). There is very little optimizations in qemu to make JIT-ed code fast, only some minor things, like patching direct and indirect jumps, removing some condition code checks, but no code motions, no control flow recovery, no advanced register re-allocator, no instruction reordering, etc. The purpose of that qemu emulator code (tcg) is to be only reasonably fast, and VERY portable (tcg virtual machine has I think only 3 registers for example, which means you underutilize a lot of hardware, and loose data flow information, and add a lot of extra moves to memory, sure, it can be improved or recovered back, but again that is slow). So it will run on Mill, just like it runs on 20 other architectures. But don’t expect magic in terms of speed even on Mill.

      Valgrind is extremely slow. It purpose is debugging, not speed.

      There are other binary translation projects, but most of them don’t focus on speed or cross-emulation, more like changing (including runtime optimization) native binary on the fly for some purposes.

      Writing a proper translator (that could be later integrated into qemu) is obviously possible, and there were many hybrid optimizing AoT/JIT in the world that showed that one can achieve very good results. See FX!32, Rosetta 1, Rosetta 2, box86. Microsoft has also pretty decent x86-arm dynrec.

      It would be much better to reuse qemu where it makes sense (Linux virtio, chipset, usb, networking, storage, etc), but write specialized JIT module, or optimize a lot out of the tcg in qemu. Some target pairs in qemu do have some extra non-generic optimisations, so that is totally doable, to for example write amd64 to mill specific code.

      However, at the end of a day, is it really that important?

      If Mill is 10 times more efficient and faster, then for a lot of applications you don’t really need any binary translation, because you can just compile things for optimal performance. And well, you would want that to actually consider Mill anyway, because otherwise you are wasting hardware potential. That is why Mill is targeting generic server workloads, a bit of HPC maybish, and some generic Linux stuff. 99% of interesting workloads are developed in-house, so can be recompiled, or are open source, so also can be recompiled. Will you be able to run Oracle or Microsoft databases on Mill, probably not initially. Once the hardware is out, either important software will be ported (just look at how quickly Apple M1 got adopted by software developers, and already thousands of proprietary programs are ported to use M1 natively – all because, a) the hardware is fast, so there is incentive to do so, b) hardware is accessible to developers easily). Or open source community will do dynrec, or two or three. Mill Computing, Inc. doesn’t have resources of Apple to do it on their own on the product launch. Damn, even IBM doesn’t have resources to do that with their POWER and s390x systems.

You must be logged in to reply to this topic.