Forum Replies Created
- Witold BarylukParticipantJanuary 8, 2021 at 7:21 amPost count: 33
You can easily use AMD GPUs on any architecture, as long as you have some PCIe controller, and Linux support for this PCIe controller. AMD GPUs have fully stack open source support in Linux kernel, Mesa 3D drivers (for OpenGL, OpenCL, Vulkan, and other stuff, like video decode/encode), etc. And Intel new dGPUs too. The Nvidia state in Mesa is reasonably good too, and can be definitively be used, but not close to performance of proprietary drivers. However, there are no technical reasons it can’t be supported on other architectures. (I.e. Nvidia GPUs work on x86, POWER and ARM64, and more are possible if there is a will, it is mostly software problem).
Most PCIe controllers are of the shelf IP blocks, that are integrated into the other silicon. This is because PCIe is mostly complex high speed analog / RF signals, and it is not just digital logic. I am pretty sure Mill will be using one of the known IP blocks for this, like Synopsys or Cadence, or such. They are battle tested, certified, have drivers for the host in Linux, comply with PCIe standards, etc. When the time is right it will be there. Mill without PCIe host controllers, would be unusable for anything interesting.
- Witold BarylukParticipantJanuary 9, 2021 at 5:13 amPost count: 33
About qemu on Mill. Obviously it would be trivial to run qemu or bochs on Mill, and it will probably compile out of the box with zero changes. However, qemu is not designed for emulation speed. It does only JIT (so no AoT), takes a long time startup and warmup, consume memory for both source and translated code, and the generated (JITed) code is of very poor quality (like 5 to 10 times worse than what normal compiler generate for the original code). There is very little optimizations in qemu to make JIT-ed code fast, only some minor things, like patching direct and indirect jumps, removing some condition code checks, but no code motions, no control flow recovery, no advanced register re-allocator, no instruction reordering, etc. The purpose of that qemu emulator code (tcg) is to be only reasonably fast, and VERY portable (tcg virtual machine has I think only 3 registers for example, which means you underutilize a lot of hardware, and loose data flow information, and add a lot of extra moves to memory, sure, it can be improved or recovered back, but again that is slow). So it will run on Mill, just like it runs on 20 other architectures. But don’t expect magic in terms of speed even on Mill.
Valgrind is extremely slow. It purpose is debugging, not speed.
There are other binary translation projects, but most of them don’t focus on speed or cross-emulation, more like changing (including runtime optimization) native binary on the fly for some purposes.
Writing a proper translator (that could be later integrated into qemu) is obviously possible, and there were many hybrid optimizing AoT/JIT in the world that showed that one can achieve very good results. See FX!32, Rosetta 1, Rosetta 2, box86. Microsoft has also pretty decent x86-arm dynrec.
It would be much better to reuse qemu where it makes sense (Linux virtio, chipset, usb, networking, storage, etc), but write specialized JIT module, or optimize a lot out of the tcg in qemu. Some target pairs in qemu do have some extra non-generic optimisations, so that is totally doable, to for example write amd64 to mill specific code.
However, at the end of a day, is it really that important?
If Mill is 10 times more efficient and faster, then for a lot of applications you don’t really need any binary translation, because you can just compile things for optimal performance. And well, you would want that to actually consider Mill anyway, because otherwise you are wasting hardware potential. That is why Mill is targeting generic server workloads, a bit of HPC maybish, and some generic Linux stuff. 99% of interesting workloads are developed in-house, so can be recompiled, or are open source, so also can be recompiled. Will you be able to run Oracle or Microsoft databases on Mill, probably not initially. Once the hardware is out, either important software will be ported (just look at how quickly Apple M1 got adopted by software developers, and already thousands of proprietary programs are ported to use M1 natively – all because, a) the hardware is fast, so there is incentive to do so, b) hardware is accessible to developers easily). Or open source community will do dynrec, or two or three. Mill Computing, Inc. doesn’t have resources of Apple to do it on their own on the product launch. Damn, even IBM doesn’t have resources to do that with their POWER and s390x systems.
- Witold BarylukParticipantJanuary 9, 2021 at 4:51 amPost count: 33
Thank you for quick and detailed elaboration!
That is really fascinating, and interesting approach. Definitively ultimately better, but harder to implement. Having it in abstract ISA, and then defer to software emulation in cAsm or via traps, as you said is definitively interesting and better long term option. I do agree with all what you have said really, but I don’t know how common are these things in real code to determine if it is worth doing as optimal as possible. But it looks you are pretty close, and by having this in ISA, you actual open many interesting new options in language and compiler design.
The NYF part is definitively interesting to learn one day. I have trouble imagining how one would implement this in this scheme, but I didn’t think long enough about it yet 😀
PS. I hope you and the team are doing fine, and there are no major roadblocks beyond time. Hope to hear some updates or talks in 2021.
- Witold BarylukParticipantDecember 28, 2018 at 11:53 amPost count: 33
I had the same issues with post editing. Sometimes I will edit the post minute after posting, and after saving it will then disappear.
Ability to make some operations privileged (not just by OS, but by any process) seams nice on the surface (i.e. i.e. in some virtualization scenario it would be nice to disable wide vector FPUs ops – not just on side of specializer, but actually to trap in CPU when used; or to disable explicit instruction/data cache flushing). I understand the concern of hiding performance counters. But even with performance counters being privileged and not accessible to normal programs, there are ways to get accurate timings from user space and execute side channel attack successfully. Example: Two threads, one spinning and updating a counter in memory or L3 cache. Calibrated using normal timers or real clock (i.e. with just milisecond or second accuracy even!). Another thread reading this data back on start and end. Aka you just recreated very precise (accurate to even few cycles probably) timer. Even more accurate if you make your threads have thread affinity and be close by on the chip and sharing L2 or L3 in specific way, and not migrating often. This is very easy to accomplish.
Trying to hide timers is just a workaround, not a real solution. Making access to very accurate hardware counters will just makes more trouble in normal use.
Also, notice that I do not ask for “absolute” timer. I am specifically asking for a timer that is only progressing during execution of specific thread (turf), or a facilities to do so. This way it is immune to what other threads are doing to big extent, even if the turf was context switched to something else, and then back. It will still see L1/L2 cache latencies if it was touched by other thread during its own execution of course. I see absolutely no way to prevent that in general. Meltdown is easy to fix by other means. Spectre can be addressed by compiler and hardware too, and Mill does do few tricks to make it work well without any impact on performance.
- Witold BarylukParticipantDecember 27, 2018 at 1:58 pmPost count: 33
Thanks goldbug. I see they do admit in the paper that indeed compiler hoisting loads to do speculative loads before checks, can make Mill be vulnerable to Spectre. “In our analysis we found and fixed one such bug in the Mill compiler tool chain.” And the solution was to improve compiler. I am not sure how it figures out where to make these speculative loads and where not, as it is really hard to predict where untrusted data is being used, without annotations or barriers in code. (and without killing performance in some other workloads that benefit from compiler doing speculative loads).
The fact that Meltdown-like load doesn’t pollute the cache is nice, and it puts a NaR as a result. This isn’t that much different than what AMD is doing on their CPUs, where they also do protection/TLB check soon enough, and do not put stuff in cache. Intel was doing this too late in the pipeline, and it was polluting cache. So nothing extremely special about that here.
For the Spectre variant 2, it is interesting that Mill actually does restore entire CPU, including caches, to a correct state even on missprediction in branch predictor. I guess this is doable because the misprediction latency is low, and only few instructions would be fetched and decoded, and it is easy to restore data cache back (because if it was L1 miss, the data would not arrive from L2 anyway in that time). Similarly if the misprediction targeted instructions that are not in instruction cache, (which would make it a bad branch predictor anyway, and is unlikely to be a miss), they will not arrive on time from L2 either, so it is easy to cancel the load and go back into proper path.
There are examples in the paper, exactly discussing the same example I was pointing on.
It appears the solution was to actually not do speculative loads before all checks that would skip the load, are executed, if possible in single instruction, i.e.
lsssb(%3, %2) %5, loadtr(%5, %0, 0, %3, w, 3) %6;
This is nice, and has very little performance penalty. In both cases the load here will produce a result on a belt, but if the condition is false, it will put a NaR on belt. So, rest of the code can still do speculative stuff, or use this value (real or NaR) as input to other stuff (including other loads indexes for example). I really like NaRs.
And people who write more performance oriented stuff, can probably pass a flag to do more aggressive (and Spectre prone) scheduling.
I guess this is not bad.
- Witold BarylukParticipantDecember 27, 2018 at 1:42 pmPost count: 33
Yes. That is the purpose. It will except on most divisions, but not all. As of additions and multiplications, hard to say. I would say it depends on application. I.e. 2*3, or 0.5+1 will not except, but when adding vastly different values in magnitude, or ones that have a lot of nonzero in significant digits, it will except.
It is a useful tool in some applications. I never used it personally tho, even in implementation of interval arithmetic.
- Witold BarylukParticipantDecember 27, 2018 at 1:38 pmPost count: 33
I am not talking about implementing arbitrary arithmetic. This is easy. I am talking about optimization for small values. This can’t be done on Mill right now AFAIK (even when using expecting or widening operations).
- Witold BarylukParticipantDecember 23, 2018 at 7:41 amPost count: 33
That is completely off topic.
My comment was about the claim that Mill is immune to Spectre because it doesn’t do speculation as OoO machines, and during talk there was mention that fixing OoO require labor-intensive and error-prone annotations of the code, and claim that Mill doesn’t require that.
How is that true, in the context of my code example and compiler hoisting loads before branches, which WILL happen in 95% of open code and loops.
- Witold BarylukParticipantDecember 23, 2018 at 7:37 amPost count: 33
Oh, I was on that page, but I did not see information about cycles counter before. My bad.
Indeed, there is a
cycleCounterregister, that can be read using
rdoperation. It is spilled on task switch (turf?), so that would mean it is per thread, and basically if I read this register it automatically deals with task switches and core migrations, and it should read only cycles spent in specific task/thread/turf, which is exactly what I would have to do (and x86/Intel/AMD performance counters DO NOT provide).
So, it is possible that this is exactly what is needed to cover the real low overhead per-thread CPU cycles/CPU time accounting from user space.
The documentation doesn’t specify if this are issue cycles, or actual cycles. I.e. does it increments when the core and pipelines are stalled (i.e. during cache misses, or poor instruction scheduling / parallelism).
Operations counter (per thread, from decoder for both instruction stream sides), would also be extremely helpful, to compute IPC.
The per-thread stats for cache loads/hits/misses would be also extremely helpful.
- Witold BarylukParticipantDecember 21, 2018 at 2:09 pmPost count: 33
Dave, that sound like a solution that could be implemented on any architecture, and adds additional complexities in terms of cache coherency, and adds latency to check in both caches probably.
I am all for QoS or minimum-maximum ranges of allocated cache space (in LLC) per turf, so meory intensive and cache trashing other processes do not completely kill applications that want low latency and some small amount of cache (mostly for code, and a bit of data), that is not constantly evicted on context switches. But I do not think L1 has enough space do do the same for data.
- Witold BarylukParticipantDecember 21, 2018 at 2:06 pmPost count: 33
I am aware of expecting integer arithmetic in Mill. Unfortunately it is NOT enough to implement arbitrary precision arithmetic, especially with the properties I mentioned (efficient handling and encoding of small signed integers). Please prove me wrong, if you think otherwise.
- Witold BarylukParticipantDecember 19, 2018 at 5:07 amPost count: 33
Kahan is wrong 🙂 I hope it was just a joke.
I think the intention is that, if you use /Qlfist, then you make sure to manually set rounding modes in relevant code (to chop, so the FIST comforms to C semantic of float to int conversion), or in main, and you are aware that this also will change normal floating point operations rounding (but that is of much smaller importance in many cases). This way the code emitted by compiler doesn’t change rounding modes all the time, and you are supposed to make sure that the modes are correct instead.
Anyway, fortunately modern machines, have better facilities for dealing with the problem (FISTTP and CVTTSS2SI).
- Witold BarylukParticipantDecember 19, 2018 at 4:32 amPost count: 33
Thanks again Veedrac. I will read on branch belt reordering, (I am already reading on
rescue). I am guessing it is very similar to what call with multiple arguments and “new belt” is doing, but without actually creating new frame, but instead dropping multiple new things on the belt during branch (when taken).
Thanks for the
eqlhint. I am just free styling most of this code, as I didn’t really go deep into wiki, because I wanted more conceptual ideas first clarified. 🙂
As for the phasing and ganging, I am still eluded by exact semantic, but I will do check a wiki and talks again!
As for the
rd, immediate constants can be done with
rdlooks like a good choice for quick copying belt values (more compact than
or, and works with vectors / floating points values on belt).
Still not sure, why
rdhas ability to read from scratchpad, because
filldoes the same apparently. One is called “extended” scratchpad tho, so maybe
rdon scratchpad is just for debugging, and thread state saving, without touching spilled stuff.