- sakrasParticipantSeptember 10, 2020 at 8:32 pmPost count: 2
Hi, I recently saw this new architecture called the Tachyum Prodigy that as far as I can tell has a lot of similar goals as the Mill of being cheaper/cooler/faster than existing CPUs, and run existing software without rewrite (both by emulation and binary translation). I also saw somewhere in the news that they were planning for a 2021 release of their first CPUs. I was wondering if this architecture could be a potential threat to the business side of the Mill, as it seems to have the same goals but is coming to market much earlier?
- Ivan GodardKeymasterSeptember 11, 2020 at 3:25 amPost count: 607
I haven’t seen any technical material on their chip, nothing but market puffery, so it’s hard to say how much is real. It’s good that they are exploring translation; it makes the notion more believable to potential customers. We can assume that their translation will be about as good as ours, so the competitive situation will boil down to the native ISA underneath. We’re not worried 🙂
- VeedracParticipantSeptember 13, 2020 at 12:27 pmPost count: 25
The issue with Tachyum is that their sales pitch isn’t that compelling any more because of new Arm server chips. At best they might have slightly better throughput on a slightly smaller die at slightly lower power, but the Ampere Altra already comes with more cores (soon even 128 cores in the Altra Max), and a 3.3 GHz OoO will be more consistent than a 4.0 GHz Tachyum, especially as the Tachyum only looks 4-wide, even if Tachyum claim to win some benchmarks here and there. 20% differences won’t be enough to steal market share from established architectures.
This is not to say the Tachyum Prodigy isn’t extremely cool. They’ve progressed very quickly and I’m a fan of architectural diversity.
- stephenmwParticipantSeptember 14, 2020 at 8:12 amPost count: 6
I may be in the minority, but I am excited about Mill’s security implications more than the improvement in performance.
On the Mill stack smashing for ROP is impossible. Integer overflows can fault for free (if the programming model allows). Something like Rust can enforce integer overflow checking outside of debug mode with no performance hit. Micro-kernels can also work about as quickly as mono-kernels. This means something like zircon could be used for performance critical server work. I call out zircon specifically because it is the most likely commercially viable micro-kernel since it is being developed for practical applications by a large company with a large OS install base.
I also imagine that after some optimization the MIMD nature of the Mill will allow bounds checking to be free (in time, not power) in most cases.
Tachyum seems to only be concerned with performance. To be fair, that is probably the most important thing and will likely be the main and only factor customers use. I look forward to them actually releasing a thing and comparing it to Altra for general purpose (non-HPC/AI) workloads.
I am concerned that if they succeed the Mill won’t be able to bring enough to the table to be worth a switch. If Tachyum are on time, I imagine that Mill would be 5 to 10 years behind.
- This reply was modified 8 months ago by stephenmw.
- Ivan GodardKeymasterSeptember 14, 2020 at 11:55 amPost count: 607
Thank you for kudos.
A detail: bounds checking on the Mill is free in both time and power. It is not free in space (allocations are rounded up to a power of two), and not free in aggravation for C users who routinely violate bounds.
- stephenmwParticipantSeptember 14, 2020 at 7:20 pmPost count: 6
Maybe I don’t know enough about memory allocation, but it seems to me that a language like Rust or Go would not create a new allocation for every vector or slice. They also wouldn’t want you loading unintended zero-initialized data or “rubble”. This means that bounds checking instructions would still be necessary for all accesses.
Also, what is to stop arr[LARGE_NUM] from accessing memory that is in your turf but from another allocation? Would load(base, index, …) not allow index to be outside the allocation for base? That would be cool, although I am not sure I can come up with a practical use for it. Maybe the allocator itself can use that for when it gives out memory.
In the end, you are going to need to bounds check but a single comparison of index vs len and a pick should be cheap enough to squeeze into most situation without increasing cycle count.
- Ivan GodardKeymasterSeptember 14, 2020 at 10:08 pmPost count: 607
Virtual Zero (for globals) and Implicit Zero (for stack) preclude rummaging in uninitialized rubble; you always get a zero. What you get from the heap is up to the heap allocator you use; that’s not under hardware control, although there are cache-control ops that make zeroing out an allocation particularly efficient, for those allocators that care.
You will also get a zero for intra-struct and after-array padding. The bounds check is at power-of-two granularity. If your array is not power-of-two sized then there will be padding at the end to bring the allocation up to power-of-two. You can read (and write) into that padding, but will always read a zero (or whatever you wrote).
This hardware check is not as comprehensive as a check against the declared bound whatever it is. If you want the more complete check then you can turn on an option and will get the same kind of inline software checks as any other ISA, at similar cost in code space and power. You are right that the Mill width may save on the latency cost compared to other architectures, but wide superscalars with OOO can probably do as well for something like bounds checking, albeit at OOO power cost.
Many apps find software checking too expensive for production code, and enable it only in debug builds. For them the hardware power-of-two check does provide complete wild-address and exploit security at no cost, even though it does not completely guarantee conformity with language semantics.
- FindecanorParticipantSeptember 15, 2020 at 12:34 pmPost count: 18
I don’t see that Fuchsia/Zircon’s way of doing IPC: asynchronous message passing would be that easily mapped to the Mill’s Portal calls though. So, the Mill wouldn’t have an advantage over any other architecture on that point … or put another way: Zircon would have slow IPC anywhere.
(But if I’m wrong, please tell. 😉 )
I can imagine that synchronous IPC where the caller always blocks (such as in L4 or QNX) to be easier to do though, even though it is not a perfect model.
BTW. Another thing with Zircon that irks me is that capabilities are not revocable (except by killing the process/group). The Mill’s temporary grants are precisely what I’ve always wanted.
- Ivan GodardKeymasterSeptember 15, 2020 at 4:22 pmPost count: 607
Mill portals are not IPC (although they can be the entryway to IPC); they are secure service calls, and you remain in the same thread (though with different permissions) both in and out.
A trusted IPC service can define get and put portal functions and do the argument copying between threads, if that is what you mean by IPC. Unfortunately there is no standard semantics for IPC, so each system has to define its own semantics in terms of the underlying Mill primitives, and cross-system communication is likely to be questionable.
Revoke is a classic problem in all permission systems. As currently defined, Mill transient grants cannot be revoked, and persistent grants can be revoked by diddling the PLB. However, there are a host of semantic issues, especially if grants can be re-granted.
- goldbugParticipantDecember 28, 2020 at 2:34 pmPost count: 52
Technical details are very sparse, but from their presentations, they say they are not VLIW.
They sometimes compare their stuff with Itanium (VLIW), but they claim they don’t stall as much as Itanium. Supposedly it has dynamic issue but it is not out of order. From what I gather, their compiler generates instruction bundles that encode dependencies between instructions. That sounds an awful lot like an EDGE architecture such as the TRIPS and Microsoft’s E2.
- Ivan GodardKeymasterDecember 29, 2020 at 3:40 amPost count: 607
As far as I know TRIPS was abandoned.
The problem with dataflow architectures is the interconnect when random producer must be connected to random consumer. That true for wide non-dataflow too – the crossbar that feeds belt to FU is the clock constrainer on a Mill. Yes, Tachyum can package instructions into call-like blocks – but it’s rare for the blocks to be more than a few instructions. That’s because most code references globals frequently, and those references must preserve order because the languages don’t provide reliable ordering semantics. So one can turn a whole sqrt routine into one of their blocks – but trying to make a block out of a loop body, so separate iterations can be spun out to their own core – that’s much harder. The analysis is the same you have to do for auto-vectorization, and the result is not pretty albeit it can be lovely on cherry-picked micro-benchmarks.
You must be logged in to reply to this topic.