Forum Replies Created
- FindecanorParticipantSeptember 3, 2023 at 2:35 pmPost count: 30
Do they have the same semantics as LLVM’s alloca, llvm.stacksave() and llvm.stackrestore() ?
And will the memory be reused and reset to zero if I do a new allocStack() after a restoreStack()?
- This reply was modified 3 months, 1 week ago by Findecanor.
- FindecanorParticipantJune 17, 2023 at 12:40 pmPost count: 30
3. Could the dynamic branch prediction analysis be communicated back to the vendor using some sort of telemetry system or aggregated in some cloud store so other users can benefit?
I think you would meet a lot of resistance against such telemetry, citing reasons of security.
- FindecanorParticipantMay 25, 2023 at 2:37 pmPost count: 30
I did not find bitfield instructions mentioned in the Wiki so I assumed that you had omitted them.
I’m sorry, my question was not if both operations could finish in one cycle, but if the second operation could finish in the next cycle. In other words, if each op would normally have a latency of four cycles each, so that if executed in sequence their total latency would be eight, could that be reduced to five cycles by bundling them together and have them execute in different phases?
- FindecanorParticipantMay 11, 2023 at 11:53 amPost count: 30
On ARM64 and PPC, all the different bitfield operations are implemented by only a couple instructions that perform a rotation and then a bitwise-select using a constructed mask. The rotation amount and bit-range of the mask define whether it is an extraction or injection.
Without dedicated bitfield instructions, extraction, and injection into zero can be done with two shifts: first to the top and then down.
What’s missing is a quick and easy way to bitwise-select a shifted value (or zero) into an existing bitfield.
The Wiki mentions a “Merge” instruction that performs bitwise select but you would first have to construct the mask.
- FindecanorParticipantMay 25, 2023 at 1:26 amPost count: 30
I’m not entirely sure I understood how The Mill’s “phasing” works, so please excuse me if the following does not make sense:
Could, in theory, the processor be organised to have a left shift be followed by a right shift of the result in the next phase with only a single cycle delay?
- FindecanorParticipantFebruary 17, 2023 at 3:00 amPost count: 30
I stumbled upon this problem recently, and wondered how it was supposed to be done in the Mill (and other single-address space systems).
Handling of local references from within dynamically loaded libraries has hardware support, NYF.
Has it been filed yet, so we could read about it?
- FindecanorParticipantJanuary 21, 2022 at 10:11 pmPost count: 30
I’ve been following CHERI too for a while… and I would say: not similar at all.
I think the article you linked to is all over the place. Let me summarise:
CHERI has tagged memory and “capabilities” on top of ARM/MIPS/RISC-V’s traditional paging. Each memory address is tagged to tell if it contains a “capability” (address, upper bound, protection bits) or regular data. Capabilities are used as pointers, but each memory access using one is bounds-checked and checked for privilege (read/write/execute). The tags are stored in separate memory areas instead of in a special bit in 9-bit bytes (which is what historic capability hardware did) — this makes it possible to use off-the-shelf DRAM, and traditional OS’es with only small modifications, with one address space per process as usual.
What this does in practice is adding bounds-checking to C/C++ programs, requiring only a recompile instead of having to be rewritten in a properly type-safe, memory-safe language.
Temporal safety (protection against dangling pointers) needs a system or kernel service similar to a garbage collector though, and the overhead is not negligible.
In particular they mentioned that their capabilities system makes it safe to run everything in one address space and this makes it possible to get speed ups from avoiding context switches.
I think they mean that CHERI reduces the need to break up a program into multiple isolated processes (a type of “compartmentalisation”) to achieve better security, which is how e.g. Chrome (and web browsers based on it) are designed.
So far, I’ve not seen any paper about using CHERI instead of the protection offered by traditional paging with protection but I’m sure that would be possible, and that is something that historic systems did.
What (I got the impression that) The Mill does is to put all programs in the same address range, but not in the same protection domain. Protection is decoupled from paging. I’d think that a Unix-like system on The Mill would work largely the same as on other hardware just that the memory layout inside each process would not overlap that of any other process (except for shared objects).
The Mill does have fine-grained address ranges, and support revocation so that it would be feasible to temporarily pass access to individual objects in a Portal call instead of sharing full pages like on other hardware. Revocation in the capability model can get complex and expensive, and I think that this in CHERI would also require a service scheme such as with dangling pointers.
- FindecanorParticipantDecember 5, 2020 at 8:03 amPost count: 30
I believe that the most important property that a virtual ISA such as WebAssembly (or JVM, CLR, …) would have for cloud application developers is not to remove the need to recompile but to remove the need to test the code on each type of hardware that it is supposed to be deployed on. This determinism is something that has hitherto not been available for C/C++ developers.
If WebAssembly/WASI would develop in the right direction, I think that it would not pose a threat but rather help to reduce the cost of adopting new types of hardware, which would allow them to better compete on just such aspects such as performance/Watt where The Mill would excel.
Even though WebAssembly has limited performance (as it has been designed for fast code-generation rather than fast code-execution), I think that could be a lesser concern as a lot of an application’s run-time is often spent in imported modules anyway — which could still be in native code. But WebAssembly is also evolving.
I don’t see advantages of using WASM on the server/desktop for anything but C/C++ code though, but the Rust community seems to also have adopted it for some reason… Maybe just hype?
- FindecanorParticipantOctober 10, 2020 at 1:23 pmPost count: 30
Earlier posts about some of the things you mention:
“We don’t do SMT.”.
I think it has also been said as an answer at a Q&A session after a talk (?) that all multi-core Mills will be single-chip designs sharing the same cache.
Synchronisation primitives (such as those used for buffers for producer/consumer threads) are notoriously difficult to make right. As on any platform, those would be best left to the guys writing standard libraries I think.
A little has been said about synchronisation on The Mill though:
multi-cpu memory model
How to schedule communicating threads and when to move threads between cores is a question for operating system designers, I think. I would be surprised if that is significantly different on the Mill than on any other architecture.
- This reply was modified 3 years, 2 months ago by Findecanor.
- FindecanorParticipantJune 20, 2023 at 11:48 amPost count: 30
What other rotating register file architectures are you aware of?
I think the most famous example would be Itanium, which can be configured to rotate a section of its register file in inner loops. It allows a loop to function in some ways like an unrolled loop without actually having to be unrolled.
The AMD 29K is also well-known, although rotation is used for a different purpose: as a way to implement register windows in the calling convention.
As far as I know, there are no academic papers concerning the Mill, … As such, there is nothing to really cite
Something doesn’t necessarily have to be an academic paper itself to be cited in an academic paper. The important property is that the citation gives you enough information that you could retrieve your own copy of it, and count on it being identical to what is cited in the paper.
If you’d want something formal to cite in this case, you could very well choose the patent.
- This reply was modified 5 months, 3 weeks ago by Findecanor.
- FindecanorParticipantJanuary 23, 2023 at 1:26 pmPost count: 30
I agree that this should be at an OS level, and coarse-grained.
Direct access to fine-grained cycle counters are well-known to be used for timing-based attacks. (Not just Spectre and Meltdown but also for reducing the time to guess passwords, encryption keys, etc, etc.). A whole field of “constant time algorithms” has arisen to combat it, which is a bit unnecessary IMHO.
- This reply was modified 10 months, 3 weeks ago by Findecanor.
- FindecanorParticipantJanuary 23, 2023 at 1:23 pmPost count: 30
IMHO, That sounds much better than the “Sutter method” of checking a flag after each function call, which you had mentioned before.
I hope that you have considered also other languages than C++. For instance languages which push exception policies down the call chain, and languages with “catch”-clauses that perform closer inspection of the exception object before deciding to handle the exception.
To handle different cases, some runtimes perform their stack unwinding in two passes: One to find the target handler, and one to perform the actual unwinding.
- FindecanorParticipantDecember 6, 2020 at 12:19 pmPost count: 30
There is nothing that is necessarily slow with capability-based access control. I think you might be confusing it with slow message-passing in classic microkernels, for which capabilities are sometimes used for referring to the message ports. But that is just one way in which capabilities can be used, and it is not the use of capabilities that makes them slow.
Capabilities is first and foremost a way to model access control as primitives.
WASI got its model from CloudABI’s libc, which is a libc restricted to use only Capsicum capabilities. Capsicum is part of FreeBSD UNIX (implemented in its monolithic kernel) and have been used by several of its standard utilities for years, with a negligible performance impact compared to earlier.
- FindecanorParticipantSeptember 15, 2020 at 12:34 pmPost count: 30
I don’t see that Fuchsia/Zircon’s way of doing IPC: asynchronous message passing would be that easily mapped to the Mill’s Portal calls though. So, the Mill wouldn’t have an advantage over any other architecture on that point … or put another way: Zircon would have slow IPC anywhere.
(But if I’m wrong, please tell. 😉 )
I can imagine that synchronous IPC where the caller always blocks (such as in L4 or QNX) to be easier to do though, even though it is not a perfect model.
BTW. Another thing with Zircon that irks me is that capabilities are not revocable (except by killing the process/group). The Mill’s temporary grants are precisely what I’ve always wanted.
- FindecanorParticipantSeptember 10, 2020 at 6:25 amPost count: 30