Forum Replies Created
- AuthorPosts
- in reply to: Inter-process Communication #3550
There’s two kinds of potential fragmentation: of the physical memory space, and of the virtual address space. They are tied in a legacy architecture, but separated on a Mill. The physical space is managed by the TLB, which can do paging and physical consolidation in a manner very similar to a legacy. In contrast, the virtual space is never consolidated; this is a consequence of the fundamental design choice to have caches in virtual.
There’s no obvious advantages to having a 32-bit virtual space in a 60-bit physical space. True, pointers would be four bytes rather than eight, but one can use 32-bit indices just as easily. There’s the problem of programs that need more than 4G, but those could use large mode. But the big problem is mixed mode. Sandbox code will need to use facilities from outside the sandbox, and those would be reached by 8-byte addresses. Keeping track of near-vs.-far addresses is something that we left behind with the 8086.
So yours is an interesting question: could it be done in the hardware? I think the answer is yes, the hardware could do it. But it would cause a great gnashing of teeth, both among our software teams and customers that used it. Would it sell enough more chips to justify the pain? I don’t think so.
- in reply to: Mill vs MXP #3521
Mill SIMD is intended for fixed length short array data types, such as 4-color pixels. It is not intended for use with vectors or other long and/or unbounded data structures. For those Mill has streams, a generalization of the limited variable vector notions of other architectures.
And in anticipation of your next question, yes, Mill streams are NYF. Sorry.
- in reply to: I was at it again. #3516
In case this is helpful to contrast your own work, here is the Mill tool chain assembler output for Hello World on the Copper configuration. It is two bundles here because Copper is slot-limited on the flow side; on Silver or Gold it is the same instructions but only one bundle.
Reminder: Mill conAsm is really C++; this file is compileable by Clang/LLVM given appropriate #include’s.
s8:~/mill/build/testsuite$ cat hello*asm // THIS FILE IS GENERATED BY THE specialize PROGRAM // DO *NOT* HAND EDIT OR PLACE IN REPOSITORY #include "absParse.hh" int main(int argc, char* argv[]) { getOpt(argc, argv); targetMember("Copper"); Section( .data ) { } Section( .rodata ) { const char* __x_str = "Hello, World!\12\0"; assembler::defData("__x_str", private$, 0, over(__x_str, 15)); } Section( .bss ) { } Section( .text ) assembler::defFunc("memcpy", unknown); Section( .text ) assembler::defFunc("memmove", unknown); Section( .text ) assembler::defFunc("memset", unknown); Section( .text ) assembler::defFunc("printf", external); Section( .text ) assembler::defFunc("main", external); F("main"); lea(cpp, gl("__x_str")) %1 ^13; // V%1 rd(w(0)) %0 ^12, // L6C2F0=/home/ivan/mill/testsuite/hello.c call0("printf", b1 %1) ^14, // L5C2F0 retn(b0 %0) ^15; // L6C2F0 // V%0 ^%1 ^%0 Section( .init_array ) { } Section( .fini_array ) { } return 0; }
Would it be a good idea? Yes. Right now? Well, maybe not.
We were all set to make our move this past spring; shutting down down the Convertible Notes was the last preliminary. A “state of the company” talk would have been part of the active solicitation. You know what happened then. At some point we’ll have to say enough is enough and just go do it, but when? There’s an argument for not waiting, and there’s an argument for waiting, not until everything is better, but at least until everything is stably bad.
Either way, we won’t do anything during the summer, which is when the whole finance industry goes on vacation.
And yes, we’re frustrated too.
- in reply to: Inter-process Communication #3548
I see you have worked out a fair amount of the details 🙂
A part you are missing is that the local address space as a whole is present in the global space, so no translation is needed to convert between a local and global address. The mapping is trivial, and does not require tables or look-up machinery such as a MMU. We have stolen a single bit in each pointer to distinguish whether the pointer refers to a local or global address. Locals are converted to global as part of the effective address calculation in memory-reference operations, and the memory hierarchy sees only global addresses.
A 32-bit Mill is possible, or even 16 bit, so long as that’s enough memory for the application; embedded usage for example. The rest of the Mill is really too heavy-duty for microcontroller use though, so the Z-80 and 6502 markets are safe from us.
- in reply to: Prediction #3544
On a Mill an indirect is the same as a direct: Mills predict exits, not branches, and the prediction includes the address of the predicted target. The predictor neither knows or cares how control will get there.
A legacy machine usually has two predictors, one taken/untaken for direct branches, and another that includes the target address for indirect branches. The latter is often called a Branch Target Table. How these hook together varies.
- in reply to: Prediction #3542
Excellent response!
You are right that the congruence arguments to branch instructions provide a replacement for the result specifiers of legacy ISAs. You are wrong when you assume that the costs are equivalent.
Legacy result specifiers are needed on any instruction that has a result. A congruence argument is needed only for values that are still live across a branch at the end of a basic block, and then only if the branch is going to a control flow join point. The counts of specifiers and of congruence arguments can and do diverge dramatically in real code. I see codes like sha1 where there are hundreds of results dropped in a block, and only two are passed to the next block. I’ve never actually instrumented the tools to measure, but by eyeball I’d guess that there is one congruence argument for every eight drops on average.
Just to give you a better guess I took a look at the code I happened to have been working on before writing this note: it’s five functions from a buddy-allocator package and I had been chasing an case of less than optimal asm coming from the specializer. In the whole package there were exactly three congruence arguments, one each on three branches; as it happened all three were control variables of loops. The code of the five functions dropped 52 times total.
This ratio is so high for a mix of three reasons. First is that many values die before they get to a branch; only the live ones need congruence. Second is that the specializer doesn’t use congruence for values that are live across loops but not used within the loop: instead it spills at point of creation and fills at point of use. That allocator code has five spills and six fills on the Silver config. A legacy machine with lots of registers could have just left the values in register. Of course those would have to be spilled across calls; as Heinlein said, TANSTAAFL. Third, the specializer goes to a lot of work to eliminate branches. That’s to reduce the cost of prediction and mispredicts, but as a by product increases the benefif of congruence vs. specifiers.
There isn’t any conform op any more; the functionality was moved to branch ops so a taken branch can rearrange the belt to match the destination. However, the same question applies: the rearranging takes zero cycles and has essentially no cost. Nothing is actually moved; all that happens is that the mapping from belt number (to decode) to operand location (dynamic) is shuffled.
There’s no real conflict/race. The “drop to the belt” doesn’t involve any real movement. The result stays in the producer’s output latches until the space is needed for something else (when a real move will happen). “Drop” just enters the new value’s physical location into the mapping from belt number to location. It tells the rest of the machine “when you want b3 it’s over there”. You can have multiple drops from different sources; each goes into its own output location, and each gets its own mapping. Yes, there are necessarily multiple updates to that mapping happening, but the values are 3-bit belt numbers, not 128-bit data, and the hardware can handle it.
The belt has no byte size; it’s a logical list of locations. The size is the number of locations it tracks; current configs exist with 8, 16, and 32 positions. It doesn’t have to be saved, because each operand self-identifies with a timestamp which is its modulo index in an infinite sequence of drops; the leading numbers of that are the current belt.
It’s a bit to get your head around, I know 🙂
We talk about the belt as if it were a shift register, and dropping as if they were copies, but that’s all “as if” metaphor.
I wonder too 🙂 Realistically, the virus has to settle down a little before doing meetings.
As for the project, the tool chain is usable against our four test configurations; it it no longer on the critical path to product. Software is working on the micro-kernel, and hardware is working on getting the C++ expansions make the right Verilog for all. There’s a lot to do, but we are out of research ind into development, and money and talent we can use both of now.
Well, life is what happens when you were planning something else. We were starting to search for our next funding round, as announced, when the virus hit and that whole industry put on its hat and went home.
In some ways the plague is much less a problem for the Mill project than for other businesses. We have always been a distributed virtual company, so we already had work-from-home worked out. And as a sweat-equity organization with a burn rate of zero we have an infinite runway, while so many others are shut down and going bust.
So no news from us is good news, sorta. Thanks for your encouragement.
I suppose it depends on what you include in the kernel. Our OS framework is a set of cooperating services, but the great majority of that is app code with no particular privileges – things like a math library, but including a lot that in legacy CPUs has to run in the kernel.
We define the real kernel as that code which has to be trusted because it can unilaterally change the state and condition of other code that is not trusted. This trust is different from code that is relied upon: if your app uses sqrt in some calculation, you rely on the math library to in fact give you a square root. But sqrt cannot change the state of its caller (courtesy the Mill protection model) so the app does not have to trust it in this sense.
So what has to be trusted? Not very much on a Mill: some initialization code for boot; the top-level interrupt handler; the dispatcher; a few allocators; and most importantly, the code that updates the protection state. We project ~3k LOC total. And nearly all of that code exists to deal with the way the Mill works, so it can’t be shared with any other platform, in either direction.
Of course, surrounding that microkernel there will be a ton of untrusted (but relied upon) libraries that we expect to lift from L4 and anywhere else. That will include a lot of what the original source thought was part of the kernel, but we don’t.
We anticipate considerable terminological confusion.
- in reply to: Division software and hardware implementation #3527
No, you should use email, not the forum, and email to me directly, not to contact. Looks like you have been treating mail from ivan@millcomputing.com as spam too. It isn’t 🙂
- in reply to: I was at it again. #3518
Yes, the ‘%’ and ‘^’ operators are overloaded, as is the ‘,’ operator. C++ lends itself fairly well to the creation of Application Specific Languages such as our conAsm assembler format.
Mill bundle execution extends over three physical cycles on which seven phases are overlain. Each op has a spec which tells in which phase each argument is evaluated (which may differ for different args), and which phase it drops its results. If the drop phase of one op is earlier than the eval phase of another then data can be passed among ops in the same bundle; that’s what’s happening in the lea->call dependency you noted.
The listing includes a comment for each bundle that shows the belt events that happen during the cycle that contains the opPhase of that bundle. These comments are to help the reader understand what is going on, because so much is happening all at once. For example, the comment on the bundle with the call is “// V%0 ^%1 ^%0”. This says that (in that cycle) %0 was dropped (“V”) and %0 and %1 were evaluated (“^”). Looking at the ops, you see that V%0 came from the rd(), while ^%1 went to the call0() and ^%0 to the retn(). The items in the comment are in time order, so a ^%n can see and use a V%n to its left or in the comments of earlier bundles. Hence the retn can use the result of the rd, and so on.
As for the rest, the “section” stuff is organization material for the linker, and is irrelevant if you are not doing separate compilation. The “main” at the top and “return” at the bottom is to make the whole thing into an executable C++ program. In the assemble step in the tool chain, that C++ program is compiled and executed. The execution builds an internal representation of the program and processes it to build ELF binaries or the source which when compiled is the simulator for the program. We’re working on going direct from specializer to binary, making the assembler step optional, but that’s not up yet.
BTW, it’s always wise and prudent to ask 🙂 Most companies seem to feel that it’s not prudent to answer, but we differ.
- in reply to: Benchmarks #3513
We do all of those, and all the other code we can find that doesn’t require an OS underneath. There’s a couple of thousand now.
- AuthorPosts