Forum Replies Created
- AuthorPosts
- in reply to: Wide Data / Multiple Return Values on a Short Belt #2064
Sorry Dave; somehow I missed your question and didn’t reply; only picked it up when serprex answered for me 🙁
Serprex has it right: too big or too many arguments, and varargs too, are passed in memory. The tricky part is how to pass them through a RPC (remote procedure call) in which the callee cannot address the caller’s space. That protocol is in hardware, but it needs pictures to explain. The short answer though is that the compiler produces generic calls in the .gen file, and the specializer decides how to pass the arguments based on the signature and the target Mill family member as part of producing the .asm file. Ditto for returned values. In both directions, neither party can browse in the other guy’s stack rubble; they see only the passed/returned data, not even pad bytes in large objects.
- in reply to: Equity crowdfunding? #2061
Extensively 🙂
While the newly available Title III crowdfunding is attractive in general, there are a host of sub-categories, each with their own attractions and drawbacks, that must be decided among. In addition, the longer we are able to hold off the better the evenual deal will be. At this point we have decided to hold off any fundraising beyond the current Convertible Notes offering (sign up at http://millcomputing.com/investor-list/ if interested) until after we have numbers on the major benchmarks. Stay tuned 🙂
- in reply to: Living without a Stack #1976
Yes, we expect that for static languages (C, COBOL etc) only excess and overlarge arguments, and locals with address taken, will reside on the data stack.
In general we expect that a GC language will treat the belt as it does registers in a general-register machine: the residence of temporaries, including pointer temporaries. A GC event must identify these as roots. As with registers, one can map each instruction to table entries that identify roots, in the registers or on the belt. However, mapping every instruction gives a lot of table. Some GC systems instead have arrangements by which a GC event can only be triggered at specific points in execution, and maintain maps only for those points; that would also work on a Mill.
Another possibility is for the GC to assume that anything that looks like a pointer is a pointer, and to treat all such as roots, accepting the occasional false positive. This approach works better on a Mill than a conventional, because pointer-sized data can be distinguished from other data by the width tag on every Mill operand. In addition, if only some combinations of the GC-control bits in a pointer are valid then the root finder can rule out candidates with invalid control fields.
Note that the root-finder must also examine the scratchpad, and also the saved belts and scratchpad of frames further down the stack.
- in reply to: When should we expect the next talk? #1972
October has been preempted for the company stakeholders meeting (the Mill is not all tech – there’s a business side too). So November at the earliest, but definitely not definite.
- in reply to: Arrakis and the Mill #1971
Not “somewhat”. Extraordinarily easier. 🙂
- in reply to: LLVM and pointers #1959
Yes others have had this problem too. We have a work-around that works for us on simple cases. Most of the trouble seemed to be in the back-end framework, so we have abandoned that (at unfortunate schedule cost) and are working direct from the IR from the middle-end. However, we are reasonably confident that there remain pointerhood-losing gotchas in passes that we haven’t exercised yet. We know that others are cleaning up the delinquent code, so it’s a race – do they get the holes filled before we fall into them?
- in reply to: Shared position-independent code #1946
Handling of local references from within dynamically loaded libraries has hardware support, NYF. Handling of inter-library references (the app is just another library) depends on whether hot-swap capability is requested by the ELF file. Hot-swap uses a PLT and GOT for such references, so that the indirection can be changed at run-time. Non-hot-swap uses quasi-static linking at load-time, but no code modification at run-time. Hot-swap code can reference non-hot-swap and vice versa transparently. We expect that the default will be that any explicit use of dlopen will use hot-swap, while dependent libraries (both dependents of the app and dependents of dlopen’d libraries) will be non-hot-swap by default.
- in reply to: Garbage collectors #1945
There is no loadp. It would require checking the loaded value after it came in, which is difficult in hardware. In our conversations with GC builders they have not indicated a need for it, given what else is there. If we found we needed the ability to check then it would be an idiom, with a normal load (checked like any plain load), followed by a trap-check of the value on the belt. Such an op would be easy to define and implement.
As for storep of a None (or of normal store and a non-pointer), that’s a really good question and I had to go look at the sim code to see what it does. It’s more complicated even than a None issue, c.f. the following comment in the code:
/* There's some question about where to do barrier checking. We don't want to throw a barrier trap on a store that won't actually store (because something was a NaR), so it should be after we have vetted all the arguments. However, we should only trap once even if the store is unaligned, and should deal with the possibility that the trap result is unaligned. Here seems the best place. */
So the answer in the present sim is that we trap even if everything is a None (note that vectors can be part None), and don’t trap on a NaR because the NaR has already aborted the store.
This means that a GC can get a barrier trap on an empty store, and will just have to figure it out. It also must deal with storep with a vector of pointers, where some elements would trap and some would not.
I’m reasonably certain that this area will be rethought before it appears in hardware 🙂
- in reply to: Shared position-independent code #1935
The explanation is too complex for here; it needs pictures. We expect that the next upcoming talk will cover it; please be patient.
- in reply to: Prediction #2057
Maybe – we’ll only know with actual measurement of big programs. The new compiler is coming up, so those measurements will be available, although we’re short people to do them.
- in reply to: Prediction #2055
The prediction in the patent addresses data prediction in OOO machines, which we are not 🙂 For data access, the Mill does not have a better answer than the patent. Instead, the Mill has changed the question, and the patent’s answer is irrelevant.
As for code prediction, the Mill innovations also change the question: we ask not whether a branch is taken, but where control flow will go in the future. This can be wrapped around any mechanism for recognizing similarity with the past; we don’t have anything novel in the predictor itself. Our initial predictor, for testing purposes, is the dumbest possible two saturating-bit local predictor, known for 30 years. But that predictor could be replaced by a global predictor, or a perceptron, or a … – regardless of the prediction mechanism, the Mill can run fetch ahead without needing to see the code.
Two-bit local predictors run 90-95% accurate (the best modern predictors run 98-99%), so ours will miss every 10 or 20 taken transfers. But that means that the Mill can run 10-20 EBBs ahead, more than enough to hide memory latency. Should be interesting 🙂
- in reply to: The Compiler #1947
It is done in the specializer, based on extra info (not yet) produced by the compiler. The specializer wants to know the initiation interval and which variables will be vectorized and which are loop-scalar. It has proven complicated to get this info out of the compiler, and it doesn’t work yet.
- in reply to: Deferred loads across control flow #1941
NYF 🙂
Often what makes sense from a software point of view makes no sense from the hardware. And vice versa; the Mill design reflects years of work balancing those forces. Here conceptually one could indeed squeeze out belt positions, and thereby cut belt pressure and the need for longer belts and/or more scratchpad traffic.
We have in fact considered self-deleting belt operands, and even more bizarre ideas than that 🙂 Here the problem is hardware: removing the holes requires rather complex circuitry; the complexity increases super-linearly with longer belts; and it has to be done in the D2 logical-to-physical belt mapping stage, which is already clock-critical. The Mill now simply increments all the logical numbers by the number of new drops. The cost of that grows roughly as the log of the length of the belt (fanning out a signal costs more with increasing destinations, even though the increment itself is constant), and we expect that cost will be the major limiter for larger Mill members than Gold.
One might think that the Mill already has logic to reorder the belt (for conform and call), and that logic could be used to remove the holes. However, those ops contain the new mapping to use as a literal, precomputed by the compiler, which can be simply dropped in place in the D2 mapper; they don’t have to figure anything out, as is required by a hole compressor.
So your idea is sound from an program-model view, but runs foul of circuit realities. I’m a software guy, and a great many of my ideas have hit the wastebasket for the same reason. It’s only fair though; I have shot down my share of the ideas the hardware guys have come up with 🙂
- in reply to: Garbage collectors #1939
Both Azul and Mill were from-scratch hardware designs, unconstrained by prior literature. It’s interesting that we independently came up with similar solutions for GC barriers. Of course GC was critical for Azul as a Java machine, while it’s a minor feature for the more generalist Mill.
You are right that I answered about our read/write barrier. Those barriers are the only GC-specific part of the Mill. There is no specific support for a stack barrier, and none needed; the return sequence is controlled by the spiller, and suitably privileged software can alter that sequence, including altering the return point to a trap, using the API that is intended for debuggers. We are not GC experts, but be thought that was sufficient; we may learn better when we port a GC language 🙂
- AuthorPosts