Forum Replies Created
- AuthorPosts
- in reply to: The Compiler #3369
I think the talk could be a bit cleaner.
Ivan used word “schedule” in few places to talk about about very different things.
For example, the schedule at the bottom of slides 16 to 21, in “scheduling pass”. It is not a schedule, nor a scheduling pass. I would call it “prioritization pass”, and result a “operation priority list”. This is an input to a next pass, creation of a tablau, that is actually a scheduling pass and the result is a schedule (aka the list of instructions and ops in their intended order). The intent of the prioritization pass, is to schedule ops within highest length chain first, so they are going to use FUs preferentially. This works, because the other ops, that are not in the highest length chain, can be moved around (scheduled later, in the sense of being considered later during scheduling pass, they can be schedule later or sooner in the actual tablau), without changing the total execution time, otherwise they would be in the longest chain by definition.
As for the number of in-flight loads, Ivan kind of answered it, but not extremely clean. The number of retire stations can be smaller or bigger than maxLatency or the size of the belt. I.e. you can have 2 load units, 32 belt positions, maxLatency 32, and yet have 10 retire station, or 2, or 50, or 64. Having less than 1 per load unit would be extremely stupid, and probably actually hard to implement, so 2 would be minimal. As for the maximum, in theory it can be even higher than 64 (2*32), in the range of 80-100, but it would be hard to actually use due to the way the delays are expressed. (From my benchmarks and tests on some Intel hardware Sandy Bridge era core has ability to do about 70 loads in-flight, even issued by a single core, but some of load queues are shared by all cores, in the uncore and memory controllers, so there are limits, and you can’t do more loads in parallel just by using more cores. also there are limits in how many loads requests you can actually send to DRAM/DDR chips, but it helps with accessing caches).
Scratchpad. The “static” part of the size of the scratchpad, is that each function (frame) is analyzed by compiler, and on the entry to the function, the known constant value (for this function) is supplied to the CPU core. That is on the entry to the function foo, CPU is informed that foot needs a scratchpad of size 17 bytes, that is it wants to be able to use up to 17 bytes of memory for its spill and fill operations (many functions will use 0 bytes of scratchpad). The byte-addressable part means, that “spill b1, 3”, for example will put value from b1 (which can be more than one byte) into address 3 of this function scratchpad. The actual position IN the scratchpad hardware buffer, will depend on the context, and the offset 3. I.e. it can be 170 (determined on the entry to the function by the hardware), and upper bound of 187 (170 + 17), and offset 3 will put data in 173 in SRAM. This obviously ignores the fact that there are other facilities in the Mill, to also spill to DRAM, and to make OS do store and restore of scratchpad for context switching. The main idea for the static size, (like 17 here, that will be determined by compiler, and made as small as possible), is that it allows for cheap execution of function calls and interrupts from within another functions, i.e. nested call of a new function, will get a new scratchpad starting at position 187. If we do deeply recursive call, we might end up going into the end of the SRAM buffer, (lets say 8kB), and what will happen it will be acted like a circular buffer, the values from the start of the circular buffer, will be asynchronously be copied to DRAM by separate piece of hardware (even before we actually reach end of the buffer), and scratchpad addresses will wrap around SRAM addresses, and core will start using the start of this SRAM as a new scratchpads. On returns back from functions, the opposite will be happening, and hardware will be loading back data asynchronously as we moved toward the beginning of the SRAM. One can think of it as a two level circular buffer, with in and out queues, and some logic to make this operations basically have zero impact on latency of spill and fill.
For the question “are there optimizations on tablau done by specializer?”. Well, no, and there is no need for that. The specializer and the scheduling to a target machine, produces nearly optimal code, and optimal schedule in a lot of cases. So by construction the tablau is already optimized. As for the other optimizations, there is no need to do them. Specializers job is not to optimize the code (that is a job of the compiler and middle end), but to generate the code for specific operations. I.e. if middle end creates some redundant operations, or operations that are ultimately ignored and not used by anything, specializer is not going to care, and will still schedule it with all operations in the end code still present. And this is fine too, because this makes specializer faster, and whatever is feed into specializer will be produced by compiler that already get rid of such stuff in practice.
As for the other questions and answers, it is obvious for Ivan, because he sits in this every day, so sometimes it might be hard for him to know that other people have problems with terminology or even most obvious answers. Usually giving more context, is the best options, not just simply giving yes/no answer.
- in reply to: Floating Point Rounding #3361
I think it is really important to a) have control over rounding modes, b) not make them a global state (this makes performance slower, and makes calling functions tricky, and context switch need to save and restore the state).
I favor a explicit rounding modes in the instructions. I.e. default is round to nearest (as dictated by specifics of IEEE 754), but each individual instruction can specific a bit to force rounding toward infinity, toward minus infinity, or towards zero. random rounding support definitively optional. It can be implemented using fixed 2 bits if standard modes are to be supported, or using 1+2 bits, with 1 bit being zero, to indicated round to nearest, and 1 indicating non-standard rounding, with two remaining bits indicating which other mode to use – this would of course make the instruction decoding a bit more hard and make instructions variable.
x87, SSE, SSE2, AVX all use global flags to control rounding modes. AVX-512 uses explicit rounding control in the instruction itself. This actually make code much faster, because there is no global state to care about, and if you are frequently switching rounding mode (i.e. for interval arithmetic), there is no cost involved with switching rounding modes at all.
- in reply to: Floating Point Rounding #3371
Adding to my previous reply that explicit per-instruction rounding modes are important, and good. I was wondering if it would be possible to extend it even further for vector operations. Even in AVX-512, the rounding mode specified in instruction applies the same way to all operands (vector elements). However, ability to do 3 different roundings in one vector op (towards minus infinity, towards plus infinity, and towards nearest), could make the interval arithmetic implementations even nicer and so much faster. 2 additional bits per operand might be a bit too much to fit into op, but a special “ALL” mode, where for two-element vectors is interpreted as (-INF, +INF), and for 4 element vectors as (-INF, NEAR, +INF, ZERO) for respective operands would be awesome. This combined with vector shuffling / permutations and expands, would make implementations of interval arithmetic rather straightforward, and really fast.
- in reply to: Mill vs. Spectre: Performance and Security #3358
Hi,
I was following Mill developments for few years, and I just watched this video about Spectre vs Mill today, and I still have an issue. Yes, Mill is not speculative in hardware, but as you said, compiler will do a speculation in software by reordering loads to get best performance and hide stalls.
So C code like:
int foo(int x) { if (x < 100) { return 0; } return A2[A1[x] * 64]; }
will produce op codes like this:
compare x, 100 return if false load A1[x] to k shift k load A2[shifted k] to r return r
any optimizing compiler will convert this into:
load A1[x] to k, delay=2 compare x, 100 return if false shift k load A2[shifted k] to r return r
or even more aggressively into
load A1[x] to k, delay=0 shift k load A2[shifted k] to r, delay=2 compare x, 100 return if false return r
semantic is the same, even if loads accesses invalid memory, and produce NaR, and ultimate a fault in ‘return r’.
So, the loads will still pollute the cache, and a timing attack can be performed. And I do not see how Mill is any safer, because code will still do speculation due to compiler.
To prevent speculation by compiler, you will need to either tell compiler not to do so (loosing some performance), or annotate specific fragments of code as critical and not be speculated by compiler, which is as error prone as mitigation and barriers you would put manually on OoO speculative machines.
- in reply to: Mill vs. Spectre: Performance and Security #3356
Hi,
I was following Mill developments for few years, and I just watched this video about Spectre vs Mill today, and I still have an issue. Yes, Mill is not speculative in hardware, but as you said, compiler will do a speculation in software by reordering loads to get best performance and hide stalls.
So code like:
int foo(int x) { if (x < 100) { return 0; } return A2[A1[x] * 64]; }
will produce op codes like this:
compare x, 100 return if false load A1[x] to k shift k load A2[shifted k] to r return r
any optimizing compiler will convert this into:
load A1[x] to k, delay=2 compare x, 100 return if false shift k load A2[shifted k] to r return r
or
load A1[x] to k, delay=0 shift k load A2[shifted k] to r, delay=2 compare x, 100 return if false return r
semantic is the same, even if loads accesses invalid memory, and produce NaR, and ultimate a fault in ‘return r’.
So, the loads will still pollute the cache, and a timing attack can be performed. And I do not see how Mill is any safer, because code will still do speculation due to compiler.
To prevent speculation by compiler, you will need to either tell compiler not to do so (loosing some performance), or annotate specific fragments of code as critical and not be speculated by compiler, which is as error prone as mitigations and barriers you would put manually on OoO speculative machines.
- This reply was modified 6 years ago by Witold Baryluk.
- in reply to: Floating Point Rounding #3379
Reading even more into IEEE 754-2008, some operations can also be marked as having 6th rounding mode, called Exact. They will produce results normally as any other rounding mode (i.e. nearest), but if the result is not exact (i.e. sqrt(2) will produce result that is not exact, it will be rounded to something that is different that real result, but 2*3 or 1/2 will produce exact results), it will throw FP exception. Really cool feature, because it allows to write a fast path without carrying too much about handling all rounding, and handle the inexact results in some slow handler. It would be most useful in integer to float conversions, and float to int conversions. I.e. conversion of 2.5 to int will produce inexact value, and throw exception, but 2.0 to int will produce exact result, and continue execution.
- in reply to: Loop compilation #3378
Hi Veedrac. This is really helpful! Thanks.
So, when we also include the ‘y’, and not use spiller it will be something like this?
L("init") load(&A1, 0, 0, 4, 0) %A1, or(x, 0) %x, xor(0, 0) %i, load(&n, 0, 0, 4, 0) %nsubi, xor(0, 0) %tmp1, xor(0, 0) %tmp2, or(y, 0) %y; % please ignore the exact order of ops here, load delays, or exceeding number of AGU and ALU FUs. <stalling few cycles for loads of A1 and nsubi to finish> eql(%nsubi); brtr("epilog", %A1 %x %i %nsubi %tmp1 %tmp2 %y); % %A1 and %nsubi are not needed, but whatever. L("loop") %A1 %x %i %nsubi %tmp1 %tmp2 %y; load(%A1, 0, $i, 4, 4) %a, % shouldn't be "load(%A1, 0, $i, 4, 5)"? mul(%x, %i) %xi; nop; nop; mul(%xi, %i) %xii; nop; nop; mul(%a, %xi) %tmp1inc, mul(%a, %xii) %tmp2inc; nop; nop; add1u(%i) %i_, sub1u(%nsubi) %nsubi_, eql(), add(%tmp1, %tmp1inc) %tmp1_, add(%tmp2, %tmp2inc) %tmp2_; brtr("loop", %A1 %x %i_ %nsubi_ %tmp1_ %tmp2_ %y); L("epilog") mul(%y, %tmp1) %tmp1y_; nop; nop; mul(%tmp1y_, %tmp2) %r_; nop; nop; ret(%r_);
(with mul with y done first, so it is closer on belt, which might be useful if the loop is even larger).
Is
brtr("epilog", ...)
even allowed like that? The entire thing is no longer EBB AFAIK, but I think hardware should be ok with this.I am thinking to write a simulator / dynamic recompiler for Mill-like ISA to be simulated (for research mostly), with input machine code mostly written by hand at this time.
- in reply to: Floating Point Rounding #3372
One more thing, IEEE 754-2008, actually defines 5 rounding modes for single and doubel precission floating point number to integer conversions: tieEven (nearest), towardsZero (chop), Positive (to +infinity), Negative (to -infinity), Away (away from zero). So, that would require 3 bits. And if same 3 bits would be used in other ops (like adds and multiplications), maybe the remaining 11 combinations can be used for something constructive, like some combinations meaning random/stochastic rounding, or up-and-down-and-near for vector operations (which would be useful for interval arithmetic A LOT).
- in reply to: Floating Point Rounding #3370
gideony2, the /QIfist is not optimization, it completely breaks semantic of C. There is a reason why it is deprecated.
However, the example is definitively a good. x87 FIST/FISTP instruction when used in C to convert float to integer, can’t be used on their own, and rounding modes must be changed to achieve correct semantic. And as indicated it is SLOW.
The SSE3 still has global rounding modes, but new instruction FISTTP ignores the rounding modes and it always chops (from x87 80-bit floating point stack into integer). Yes, the FISTTP is x87 instruction, yet defined in modern extension.
This is different than the FIST and FISTP instructions, which do use rounding modes set in x87 control register, and were defined in 8087.
SSE2 using cvtsd (double precission float to 64-bit signed integer) do similar and ignore rounding modes.
And it is worth knowing that all these extensions SSE2, SSE3, etc, they all use control flags (that are separate from x87 control flags) to control rounding modes globally.
So, in general on modern x86 CPU, you have:
x87 control registers (2 bits) that control rounding modes of all x87 FPU ops, including FIST/FISTP
SSE2 control registers (2 bits) that control rounding modes of most of SSE2+ (no idea about MMX and SSE) instructions.
CVTTSS2SI (SSE2) that ignores rounding modes of SSE2 FPU.
FISTTP (SSE3) that ignores rounding modes of x87 control register.As of the latency, yes, the changing of rounding modes is slow, but the main design problem was that the float->int conversion was using rounding modes in the first place, and C uses a different default for this (round to zero) than default of floating point arithmetic (which is round to nearest). How Intel done a so big mistake (in 1980) is beyond me, as the FORTRAN and C was already well established languages (and they use rounding toward zero for float to integer conversions; Fortran 77 has also NINT, that rounded to nearest, but it wasn’t used that often probably), and it was obvious that doing float->int and normal float op float, stuff will require different rounding modes by design. Maybe penalty for changing rounding modes was smaller in the days of shallower pipelines? Or the fact that the float->int conversion was expected to happen infrequently (which is the case often, in scientific computations, you often do just crunch numbers all the time, without converting anything into integers back). Or the fact that Algol only real way to convert from real to int (long) was using ’round’ function, that was doing round to nearest. Maybe Ivan or his IEEE 754 friends (William Kahan maybe?) now better about history behind this. 🙂
- in reply to: Mill vs. Spectre: Performance and Security #3360
Hi, Ghadi. The particular exploit, was actually exploiting the bug in the OpenSSL. It was actual bug (non constant-time operation in crypto code) in OpenSLL, that got patched. SMT made it easier to exploit, but even without SMT it was likely it can be exploited.
As of SMT, SMT makes a lot of things easier indeed. It is very unlikely Mill of any kind will ever by SMT enabled, because it would be really hard to do. SMT is beneficial on OOO because it is pretty easy to implement if you already have OOO available. Adding SMT to in-order core, with no OOO is relatively speaking more expensive, and at this point it is often better to simply add more cores for even better performance gain. Obviously SMT is still probably beneficial even on a Mill like design, if the SMT is wide (i.e. 8 way), and the code you are executing is limited by the memory latency. Time will tell, maybe you are better off by using GPU / Xeon Phi like CPU, then, or doing software smt, and try to load as much stuff in parallel, by rewriting the code, or develop SMT enabled Mill, but I am sure this is really not worth discussing at the current stages of development.
- AuthorPosts