Forum Replies Created
- AuthorPosts
- in reply to: Garbage collectors #1931
Yes, there are. A pointer has three bits which, in conjunction with some mask registers can be used to detect and trap on up-level references in generational or concurrent GC. The details are in the Memory talk; MillComputing.com/docs/memory.
You can only DOS yourself, just like you can by coding an infinite loop. The coroutine is not an OS process, it’s part of your program, and they collectively run under your timeslice. If you want a true OS process you have to ask the OS for it as usual. Or you can get five processes from the OS and run 20 coroutines in them; mix and match.
- in reply to: Deferred loads across control flow #1913
It is easy to find cases where OOO is better, and cases where static is better. The only fundamental resolution of your question is measurement of real programs, and we’re not far enough along for that.
To take your particular example: MEM++ will only have visible impact if MEM is in DRAM and MEM++ is executed frequently. That combination is something of an oxymoron: if it is accessed frequently it will be in cache, and if it isn’t then the memory stall will happen but its effect will be in the noise. So the case we have to worry about is when there are lots of different MEMs that are individually infrequently accessed but collectively matter. That’s a description of a “working set”; it is important that the working set fit in cache, or performance dies on any machine. The Mill pays a lot of attention to working set: the high entropy code, backless memory, streamers, whatnot.
Now if MEM is in fact coming from DRAM then OOO doesn’t buy much. Yes, it can execute ahead of the stall, but in practice in general-purpose code an OOO machine runs out of one resource or another quickly, and the DRAM stall happens anyway. (An exception is when there is a ton of independent unrelated ops available, as in BLAS benchmarks or any other code that SIMDs well – but that’s not your case and we do as well as OOO on that kind of code anyway).
So let’s get more realistic and say that both the OOO and the Mill have the working set in the (LLC) cache, and that MEM is unrelated to what is following after the return. Then yes, Mill deferred loads won’t help, because the function is doing nothing but the increment, while the OOO deferral of execution will help because the increment can be overlapped with the caller – so long as nothing happens to drain the pipe, like an interrupt or mispredict. Like I said, it’s easy to find bounded cases where OOO is better.
But customers don’t care about cases, they care about whole programs. You hypothesize that the MEM++ function cannot be inlined. Well, sure, but if you cared about your performance why would you prevent inlining? And there are system-level tradeoffs to consider: in this case OOO might save a 10-cycle LLC delay, but then the cost is a 15 cycle mispredict stall – evey time – instead of the Mill’s five cycle, and a heat/power budget that prevents adding another core or two that the Mill can have.
In some cases CPU engineering provides absolute guarantees that one approach is better than another: more instruction cache at the same power and latency is better, for example. But other matters are complex tradeoffs, for which the design answers often boil down to a seat-of-the-pants feeling that “real programs don’t do that”.
We can’t wait to publish our tool chain and sim so people can actually measure real programs. No doubt we will find that some of the things we thought were cool ideas turn out to not buy much, and we will find unsuspected bottlenecks. Measurement will tell. As we have long said as a sort of inside joke: “Only Mr. Sim knows!”
- in reply to: Fundamentals & References #1888
There is a remarkably good overview of the public Mill at http://liberomnia.org/wiki/Computer/The_Mill.en.html. Does anyone have a clue who the author might be? If it’s you, please stick up your hand!
- in reply to: Stochastic rounding for machine learning #1862
For training 16-bit perceptrons you shouldn’t need crypto-quality random; 16 bits sampled from the middle of a 32-bit LFSR should be fine. What I don’t understand (haven’t read the cites yet) is why regular FP rounding doesn’t work.
- in reply to: Stochastic rounding for machine learning #1859
We once had a stochastic rounding mode defined; the idea was to explore the rounding sensitivity of an algorithm. The other members of the IEEE 754 (FP) committee dumped all over the idea, for reasons too technical for here even if I remember them correctly (I am not a math analysis guy),
However, your note is the first I have heard of stochastic being of use in neural networks. Could you post a few links for (elementary, tutorial) material on the subject?
- in reply to: Garbage collectors #1933
The Mill has separate operations for pointers, distinct from those that would be used for integral operands of the same length. There are three masks, under application control. Two are eight bits and are used by normal load and store operations; the three-bit GC field in the address indexes a bit in the mask and traps if set. For storep (store pointer), the GC bits from the address and the bits in the pointer being stored are concatenated and index a bit in a 64 bit mask, again trapping if the bit is set.
Explicitly coded stack barriers are unnecessary on a Mill given this support; the hardware does the barrier checking. We not have measurements of the resulting gain yet, but expect it to be significant given the frequency with which GC language store pointers.
- in reply to: The Compiler #1911
The Mill members are load-module (.exe) compatible, but not bitwise encoding compatible. The load module contains member-independent genAsm and a cache of previous specializations for particular members. If you run that program on a member for which there is no binary in the cache then the loader specializes it to the target, as well as doing the dynamic linking, relocation, and the rest of what loaders do, and also writes the new specialized code back to the load module cache so it doesn’t have to be specialized again.
We expect that specialization will be a routine part of the install process for most software, but you can omit the install step and just get the specialization the first time you run it. Unless you are debugging the specializer you won’t need to look at the actual binary; this is essentially the same way that Java and other systems that distribute machine-independent code work.
- in reply to: The Compiler #1907
Exactly. The lowering will happen; the question is whether to do it in the compiler or the specializer, and that gets a case-by-case decision.
- in reply to: The Compiler #1901
Yes, function-scale builtins could be emitted by the compiler and then get member-specific substitution in the specializer. The difficulty is in the compiler and host language: how would a program ask for one of these builtins? In all compilers (including those from third parties) and all languages?
Generally we consider the host language to be off limits. If something is needed by a language’s user community then let the standards process add it to the language and we will support it. Some such additions already exist: std::memcopy and others, and so for those we can do what you suggest. There are other possible candidates in libc and POSIX. However, it’s not clear that there would be much gain over simply having the compiler emit a function call and letting the specializer inline that.
Mind, it might be a good idea; the problem is that we can’t know until measurement. There is a lot of measurement and tuning in each member. It won’t get integer factors the way the basic architecture does, but a few percent here and there and it adds up too.
- in reply to: The Compiler #1892
Microcode give me hives 🙂
Microcode is a bad fit with any wide-issue machine. A micro-copy routine won’t saturate the width (it’s just a loop, and internal dependencies will keep you from using all the FUs). When the same loop is macrocode then other app code can be scheduled to overlap with at least some of the ops of the copy, giving better utilization.
Most people (Linus included I expect) who prefer micro over a function call or in-line macrocode do so because they expect that a call is expensive and the in-line code will consume architectural resources such as registers, while the micro is cheap to get into and runs in the micro state so it doesn’t collide with app state.
Guess what: Mill call are cheap and give you a whole private state that doesn’t collide either.
We’re not opposed to putting complex behavior in the hardware, for which the implementation will look a whole lot like a small programmable device: the Spiller is a glaring example, and the memory controller and the exit prediction machinery are too, and there are more NYF. But these are free-standing engines that are truly asynchronous to the main core execution and do not contend for resources with the application. But classical microcode? No, our microcode is macrocode.
- in reply to: The Compiler #1891
There are two different I/O questions: pin-twiddling I/O, and DMA. We expect all the twiddling to be memory-mapped, so (as with the rest of the Mill) there’s no need for privileged ops and all access control is via the existing turf-based protection mechanism.
As for DMA, the question is coherence and real-time constraints. We expect that a DMA device will appear to be just another core to the program model. That said, how inter-process communication (other than shared memory) works is still NYF.
You see, nothing up this sleeve… 🙂
- in reply to: Stochastic rounding for machine learning #1867
The problem with a hardware random number generator is that programs want a repeatable sequence, and there may be more than one program concurrently using it. Hence each app must have its own RNG state, which must be saved/restored at switch. That save/restore cost is paid by everyone, whether they use RNG or not, and the Mill tends to avoid such things in the basic architecture.
The alternative is to give the RNG the semantics of an i/o device: you open it, you own it. That works for a bunch of embedded type uses, but not for the general case. Or you can treat it as a functional unit, like an ALU, and the app maintains the state in its own space. That also works, but the load/store cost of calling the unit would swamp the actual RNG compute, and having the state in the app invites back-channel attacks..
We have considered a seed generator in hardware, that would use a physical process (back-biased diode or some such) to give low-precision value that could be used as a seed for a conventional software RNG. Repeatability of sequence is then available by using a fixed seed, while the physical seed would be unique each time and so not need save/restore. That still might happen, although it’s not in the current sim.
- in reply to: Specification #1866
Nope, it’s the next revision to the standard in general, and may already have happened 🙂
- AuthorPosts