Forum Replies Created

Viewing 15 posts - 1 through 15 (of 30 total)
  • Author
    Posts
  • Findecanor
    Participant
    Post count: 34

    I’m not sure what you mean. Do you mean bitfield insert/extract instructions, or SIMD instructions for operating on multiple elements of a short vector at a time?

    From previous info, I think The Mill is supposed to have both.

  • Findecanor
    Participant
    Post count: 34

    I read a rumour that someone from Mill Computing would have met the designers of STRAIGHT and Clockhands in person …

  • Findecanor
    Participant
    Post count: 34

    Are there still saveStack and restoreStack instructions as in the old Wiki, so that you could do
    saveStack(); allocStack(…); … ; restoreStack() ?

    Do they have the same semantics as LLVM’s alloca, llvm.stacksave() and llvm.stackrestore() ?

    And will the memory be reused and reset to zero if I do a new allocStack() after a restoreStack()?

    • This reply was modified 10 months, 3 weeks ago by  Findecanor.
  • Findecanor
    Participant
    Post count: 34

    3. Could the dynamic branch prediction analysis be communicated back to the vendor using some sort of telemetry system or aggregated in some cloud store so other users can benefit?

    I think you would meet a lot of resistance against such telemetry, citing reasons of security.

  • Findecanor
    Participant
    Post count: 34

    I did not find bitfield instructions mentioned in the Wiki so I assumed that you had omitted them.

    I’m sorry, my question was not if both operations could finish in one cycle, but if the second operation could finish in the next cycle. In other words, if each op would normally have a latency of four cycles each, so that if executed in sequence their total latency would be eight, could that be reduced to five cycles by bundling them together and have them execute in different phases?

    • This reply was modified 1 year, 2 months ago by  Findecanor.
    • This reply was modified 1 year, 2 months ago by  Findecanor.
  • Findecanor
    Participant
    Post count: 34

    On ARM64 and PPC, all the different bitfield operations are implemented by only a couple instructions that perform a rotation and then a bitwise-select using a constructed mask. The rotation amount and bit-range of the mask define whether it is an extraction or injection.

    Without dedicated bitfield instructions, extraction, and injection into zero can be done with two shifts: first to the top and then down.
    What’s missing is a quick and easy way to bitwise-select a shifted value (or zero) into an existing bitfield.
    The Wiki mentions a “Merge” instruction that performs bitwise select but you would first have to construct the mask.

    • Findecanor
      Participant
      Post count: 34

      I’m not entirely sure I understood how The Mill’s “phasing” works, so please excuse me if the following does not make sense:
      Could, in theory, the processor be organised to have a left shift be followed by a right shift of the result in the next phase with only a single cycle delay?

  • Findecanor
    Participant
    Post count: 34

    I stumbled upon this problem recently, and wondered how it was supposed to be done in the Mill (and other single-address space systems).

    Handling of local references from within dynamically loaded libraries has hardware support, NYF.

    Has it been filed yet, so we could read about it?

  • Findecanor
    Participant
    Post count: 34

    I’ve been following CHERI too for a while… and I would say: not similar at all.

    I think the article you linked to is all over the place. Let me summarise:
    CHERI has tagged memory and “capabilities” on top of ARM/MIPS/RISC-V’s traditional paging. Each memory address is tagged to tell if it contains a “capability” (address, upper bound, protection bits) or regular data. Capabilities are used as pointers, but each memory access using one is bounds-checked and checked for privilege (read/write/execute). The tags are stored in separate memory areas instead of in a special bit in 9-bit bytes (which is what historic capability hardware did) — this makes it possible to use off-the-shelf DRAM, and traditional OS’es with only small modifications, with one address space per process as usual.

    What this does in practice is adding bounds-checking to C/C++ programs, requiring only a recompile instead of having to be rewritten in a properly type-safe, memory-safe language.
    Temporal safety (protection against dangling pointers) needs a system or kernel service similar to a garbage collector though, and the overhead is not negligible.

    In particular they mentioned that their capabilities system makes it safe to run everything in one address space and this makes it possible to get speed ups from avoiding context switches.

    I think they mean that CHERI reduces the need to break up a program into multiple isolated processes (a type of “compartmentalisation”) to achieve better security, which is how e.g. Chrome (and web browsers based on it) are designed.
    So far, I’ve not seen any paper about using CHERI instead of the protection offered by traditional paging with protection but I’m sure that would be possible, and that is something that historic systems did.

    What (I got the impression that) The Mill does is to put all programs in the same address range, but not in the same protection domain. Protection is decoupled from paging. I’d think that a Unix-like system on The Mill would work largely the same as on other hardware just that the memory layout inside each process would not overlap that of any other process (except for shared objects).
    The Mill does have fine-grained address ranges, and support revocation so that it would be feasible to temporarily pass access to individual objects in a Portal call instead of sharing full pages like on other hardware. Revocation in the capability model can get complex and expensive, and I think that this in CHERI would also require a service scheme such as with dangling pointers.

    • This reply was modified 2 years, 6 months ago by  Findecanor.
    • This reply was modified 2 years, 6 months ago by  Findecanor.
    • This reply was modified 2 years, 6 months ago by  Findecanor.
  • Findecanor
    Participant
    Post count: 34

    You could do that with one bitfield extract instructions per bitfield you’d want to extract: in this case six.

    Those would all be independent, so max parallelism would be the number of execution units that implement that instruction.
    If a bitfield is at the top or the bottom of the word, then a shiftr{s|u} or andl respectively could be done instead, possibly in another execution unit freeing up a bitfield extract unit.

    If you have such a layout with one at bits 31:26 and one at 0:0, and there is a CPU with two execution units that do bitfield extract and one unit that does shiftru/andl, then full extraction should be possible with only two cycles of latency.

    (Speaking very generically of course. Someone from Mill Computing could fill in the details)

    • This reply was modified 3 days, 13 hours ago by  Findecanor.
  • Findecanor
    Participant
    Post count: 34
    in reply to: Any plans for 2024? #3991

    I wish there was a simple way in which it was possible to contribute.

    I’m looking forward to computers based on The Mill not just because of the possible performance improvements, but more because of its features, most of which are security-focused. That is an area where other architectures have not moved much for many years, until quite recently.
    I think this is also one area in which the Mill has an edge. Perhaps The Mill’s primary competitor won’t be x86 at all, but something like CHERI.

    • This reply was modified 5 days, 23 hours ago by  Findecanor.
  • Findecanor
    Participant
    Post count: 34

    What other rotating register file architectures are you aware of?

    I think the most famous example would be Itanium, which can be configured to rotate a section of its register file in inner loops. It allows a loop to function in some ways like an unrolled loop without actually having to be unrolled.

    The AMD 29K is also well-known, although rotation is used for a different purpose: as a way to implement register windows in the calling convention.

    As far as I know, there are no academic papers concerning the Mill, … As such, there is nothing to really cite

    Something doesn’t necessarily have to be an academic paper itself to be cited in an academic paper. The important property is that the citation gives you enough information that you could retrieve your own copy of it, and count on it being identical to what is cited in the paper.

    If you’d want something formal to cite in this case, you could very well choose the patent.

    • This reply was modified 1 year, 1 month ago by  Findecanor.
  • Findecanor
    Participant
    Post count: 34

    I agree that this should be at an OS level, and coarse-grained.
    Direct access to fine-grained cycle counters are well-known to be used for timing-based attacks. (Not just Spectre and Meltdown but also for reducing the time to guess passwords, encryption keys, etc, etc.). A whole field of “constant time algorithms” has arisen to combat it, which is a bit unnecessary IMHO.

    • This reply was modified 1 year, 6 months ago by  Findecanor.
  • Findecanor
    Participant
    Post count: 34

    IMHO, That sounds much better than the “Sutter method” of checking a flag after each function call, which you had mentioned before.

    I hope that you have considered also other languages than C++. For instance languages which push exception policies down the call chain, and languages with “catch”-clauses that perform closer inspection of the exception object before deciding to handle the exception.
    To handle different cases, some runtimes perform their stack unwinding in two passes: One to find the target handler, and one to perform the actual unwinding.

  • Findecanor
    Participant
    Post count: 34

    There is nothing that is necessarily slow with capability-based access control. I think you might be confusing it with slow message-passing in classic microkernels, for which capabilities are sometimes used for referring to the message ports. But that is just one way in which capabilities can be used, and it is not the use of capabilities that makes them slow.
    Capabilities is first and foremost a way to model access control as primitives.

    WASI got its model from CloudABI’s libc, which is a libc restricted to use only Capsicum capabilities. Capsicum is part of FreeBSD UNIX (implemented in its monolithic kernel) and have been used by several of its standard utilities for years, with a negligible performance impact compared to earlier.

Viewing 15 posts - 1 through 15 (of 30 total)