Forum Replies Created

Viewing 15 posts - 16 through 30 (of 674 total)
  • Author
    Posts
  • Ivan Godard
    Keymaster
    Post count: 689

    DFP is defined in the ISA, so it could be done in hardware. But only if the market demanded it.

  • Ivan Godard
    Keymaster
    Post count: 689

    Wow!

    There are conceptual issues to what you suggest, and implementation ones.

    The biggest problem is that you assume that that “add” can be disambiguated with a type. It can’t, because there are multiple flavors of “add” within each type. Floatint point adds have a rounding mode specification, while integer adds have an overflow behavior (saturate, fault, truncate, widen). To make that specification for any given op requires that the author, and later the generated code, assume a type for which the spec is meaningful. And if you make such a static assumption then you might as well put the type in the static opcode (addf vs addb vs addu vs …) and omit it in the (expensive) metadata.

    Then there’s implementation. Real code has an immense number of silent type transformations, which would need to become explicit ops (to change the metadata), which would explode the code size, and add to the latency of a dataflow.

    And there’s hardware issues. For area reasons op kinds are grouped so that related operations are done by related hardware, known as a Functional Unit or FO: ALU, FPU, LSU, etc. Instruction parsing breaks the binary apart to determine which FU of the several available executes the particular instruction, and then routes the rest of the instruction to that FU where further decode and execution happen.

    However, the nature of what the ops are to do breaks apart naturally by type: integer operations go to the ALU and so on. In a static system, once the FU is decided the rest of decode and setup can be done in parallel with data argument fetch, using data paths that go only to the selected FU. In a type-dynamic system, the decoder could not choose the FU until the arguments had been fetched and the metadata examined, and then the lot including the args would be shipped to the FU. I think you can see that this would add at least a cycle to every operation; you probably cannot appreciate what the rat’s nest it would make of data routing, but take my word that it;s bad.

    Meta-typing looks plausible for single-issue when performance is not an issue. It doesn’t work when there are large numbers of instructions being executed concurrently and performance matters.

    So much for tutorial 🙂 I’m going to let this topic drop at this point.

  • Ivan Godard
    Keymaster
    Post count: 689

    No idea what happened to the original post; it doesn’t show on the Forum logs. However, I can address the question implied by the title: types in metadata.

    A dynamic language like SmallTalk must keep a runtime representation of type in order to resolve the meaning of the executable. This gives second-order power and flexibility: with cleverness and SmallTalk you can brush your teeth. But type resolution is not fast.

    Most programs, even those in second-order languages, are in fact first-order and bindings can be fully resolved at compile time. The rare cases where you really need second-order you can kludge around by having (first-order) programs write (first-order) programs which are then compiled. Our software is full of such generators – it’s how we can retarget for different family members based on a specification. But all steps are compiled, and everything that can be done at compile time is done then, which includes all static typing. So there’s no reason to carry type in the metadata at runtime.

    The compile step also elides type transformations that are no-ops in the execution. With meta-typng every transform of an operand from int to unsigned would have to be an explicit runtime op to change the metadata even though it did nothing to change the operand value. Think how many place Foo is changed to const Foo. Ouch.

  • Ivan Godard
    Keymaster
    Post count: 689

    All operand widths of executing code are known at compile time; there are no dynamic-sized operands in the architecture.

    Widening ops widen unconditionally, not just on overflow. The result is a normal operand of the widened width, and subsequent operations take as long as they would for anything of that width. Latency varies by target member; our test targets are specified to do a and 8-byte add in one cycle and a 16 byte add in two.

  • Ivan Godard
    Keymaster
    Post count: 689

    There are no such instructions, and the Forum comment you cite still applies.

    SIMD extensions in general are of use primarily for bit polishing. The Mill design approach favors generality and elegance over specialized operations whose benefit may not be visible in the noise of general purpose code. Absent (paid) customer demand we’d wait until the overhead that such instructions would reduce is more than 1% of cycles.

  • Ivan Godard
    Keymaster
    Post count: 689
    in reply to: Carryless multiply #3854

    There’s no such instruction,and in general the base ISA avoids such specialized usage. Nothing prevents adding it to an FU and that FU to particular specialized member; you’d get at it via an intrinsic. The auto-generated include files have definitions of intrinsics for all instructions in the target, and the specializer schedules them into the correct slots.

    However, I suspect that anyone adding operations for such a purpose would add a more general Galois product. That would be straightforward for {b*b}[d], i.e. two byte values and a 64-bit rule. Where it gets harder in hardware is for larger arguments and correspondingly larger rules – {8*8}[4096] for example. Then you have tough problems with dynamically specifying the rule, and what to do if different processes both want to use the hardware with different rules and … So the base architecture stays out, and lets the customer with specific needs decide what they want and pay the NRE.

  • Ivan Godard
    Keymaster
    Post count: 689
    in reply to: pdep and pext #3884

    Find first one is in the ISA, which suffices for scalars. Bit arrays require code to iterate over the underlying array looking for a non-zero element, then FFO to locate the bit.

  • Ivan Godard
    Keymaster
    Post count: 689
    in reply to: Memory Allocation #3858

    It is not possible to OOM in the scratchpad, no more than it is possible to OOM when writing to the registers in a register architecture. Both scratch and registers are statically allocated and named; there is no dynamic allocation that you can run out of.

  • Ivan Godard
    Keymaster
    Post count: 689

    Quad (128-bit) is optional, transparent at the source language level. Forcing it to be mandatory would impact some attractive markets negatively. All descriptor operations can execute as discrete scalar ops just as well, and the scalar ALUs are useful for other things as well, which descriptor ops are not.

  • Ivan Godard
    Keymaster
    Post count: 689

    Mill is a general purpose architecture. The business model is for us to sell to the companies that build such products, not to build such products our self. Some of your listed companies might indeed be customers eventually; whether any would find a compelling reason to invest in building their own processor capability is questionable. Yes, Apple did- but CPUs are a competitive point for Apple. Most others have little competitive reason to innovate in something that is not a selling point, unless and until some of their competitors already have. They will wait until Mill reaches COTS status.

  • Ivan Godard
    Keymaster
    Post count: 689

    Thank you for the suggestions. What would be even more useful would be an introduction to a person within those institutions. Anyone?

  • Ivan Godard
    Keymaster
    Post count: 689

    There is no direct ISA support for {addr, len} style descriptors. Instead descriptor arithmetic would use the machine width to update both fields independently. The performance would be the same, but does not require 128-bit data paths into the ALU and on the belt. There are many specialized kinds of smart/fat pointers in use, hidden behind the abstractions of the languages. It’s easy to invent one that satisfies a particular set of usage; it’s hard to make it general. For example, how does one get a {addr, len} pointer to iterate backwards? How are they garbage-collected? The ISA is not the place for these questions; the languages are.

    There is a mechanism to check the validity of a pointer per the C/C++ rules: points within the object or one element beyond. Violations get you a NaR. However, this is a validity check, not a range check. NYF.

  • Ivan Godard
    Keymaster
    Post count: 689

    One could get the specializer to produce a single-instruction sequence without pipeline overlap. That could be QUEMU emulated, but we have a sim already. The other argument for QUEMU is that it can produce translations that are close to native performance on an alien architecture, saving the cost/effort of a port/retarget. But applied to the Mill that would give close to the performance of a single-instruction sequence with no parallelism, or truly awful in other words.

    No doubt someone will someday try to extend QUEMU itself to be able to handle Mill-like ISAs. But not us today on our very limited nickels.

  • Ivan Godard
    Keymaster
    Post count: 689

    I personally know next to nothing about QEMU, having never used it. I did write our present sim.

    In my ignorance I anticipate the most trouble with representing in-flight values. Because Mill explicitly separates instruction initiation from instruction retire, a sim must model instructions that are in-flight in the pipeline, and merge the effects of separate instructions that emerge from the pipes at the same time despite having been initiated at different times. Due to phasing, the init and retire can be in the same bundle cycle, with other execution in the middle. In particular, there can be control flow during the in-flight period.

    You can’t just start at an address and assume that everything before is in the belt/memory. There may be an in-flight multiply that will drop to the belt in two cycles, completely unannounced, with a whole function executed since the init. And you can’t just snapshot at every basic block either – in-flights can carry over the branches.

    I’m not saying it’s impossible. I’m saying that it will be difficult and we presently have neither the expertise nor the money to take it on. @QEMU experts – there’s a challenge here available if you’d like to join the Mill effort.

  • Ivan Godard
    Keymaster
    Post count: 689

    Not yet. We don’t expect any new filings until the reorg.

Viewing 15 posts - 16 through 30 (of 674 total)