ivan
As with other very-long-latency operations, RNG does not fit well as an opcode. There is also the problem that RNG algorithms may be expected to improve with time. Consequently we see such a box to be optional per-member, and attached through a different interface that is not fixed-latency like regular ops.
We do plan to provide a physical seed operation. This is not a RNG, but will provide a few bits of completely unpredictable entropy to start things off.
akohdr
I'm not sure I'd consider a RNG a 'very-long-latency' operation. Suitably pipelined it becomes a generating function producing a value on each tick at the expense of area. Conditioning from an entropy source is an algorithm though, so I see where you are coming from. If you have the budget I'd urge at least a PRNG I think it would be of general value.
In the case of Monte Carlo I see a potential improvement from the temporal addressing provided by the belt. For maximum impact the trial state space must fit on the belt and there would need to be a local source of randomness. If the inner loop reverts to memory access it's no better than streamed vector approach using GPU/DSP.
ivan
The issue with a RNG and other hardware of similar character is not the bandwidth, it's the latency: how long between first request and first result. That needs to be no more than 4-5 cycles, and uniformly consistent for every use. Otherwise the hardware can exist, but it needs to be accessed in a different way or the whole machine suffers whether the RNG is used or not. So the right place for a RNG is as a side hunk of hardware, not reached by the usual operator interface. The timing, bot latency and bandwidth, would be as good as if it were an operation, but without the penalty to machine performance of other code.
Similarly, the belt cannot be indexed or have an address taken. Again, even the possibility of doing do would cut everybody's performance 2X or more, whether they used the facility or not. So this is potentially another hunk of out-of-line hardware for users with special needs. We've identified several others of the sort: large look-up tables, Galois Field operations, etc. They would all use the same interface.
squizzle
It may have been obvious to others, but it's only just "clicked" for me what is meant by an "external interface". My assumption was something like an external PCIe card performing some accelerating functionality.
If my "click" is correct, then a better view would be like an ADC on an embedded microcontroller. It's still part of the chip, but it's not accessed with a "readadc" opcode, it's accessed by setting the ADSC bit in the ADCSRA register, then reading ADC register at some point in the future (atmel AVR example)
So instead of a RdRand opcode like x64, you could have something like (paraphrasing, not certain of the syntax) ...
load(MM_rand,,)
... and if the RNG takes more time than you expected to provide the data, then you stall, and that's ok, because loads are designed to handle stalls. That doesn't mean that you'll have a 300 cycle latency to get a random number, but it does mean the RNG is allowed to be flexible on its timing.
Is my understanding correct?
ivan
Your understanding is correct.
There are further subtleties that make me say "like an i/o device". For testing, programs often need to have repeatable random number sequences, so that it is important that a program be able to glom onto a RNG and take all the numbers generated, without some other thread or process grabbing one in the middle of the sequence. That is, the RNG needs to be (able to be) single-user exclusive use. Most of the other bulk-algorithm hardware has the same characteristic: you don't want someone else getting the result of the MD4 that you started, and so on.
This exclusive use capability is characteristic of many i/o devices. Even those we often think of as having multiple concurrent users (a disk for example) actually has only one user (the driver) when you look closely at the hardware.
In contrast, operations are usable by all, and the hardware does the save-restore that gives the illusion that everybody has their own belt. It is impractical for hardware to provide the illusion that everyone has their own RNG. So the Mill provides an interface that supports exclusive use (using MMIO and the standard memory protection system) and asynchronous (or semi-synchronous) access.
PeterH
A RNG need not be long latency, depending on the requirements. Requirements for cryptography vs. a Monte-carlo simulation are different. For cryptography you ideally want a generator that can't be predicted, even knowing the last N numbers produced. If this takes 300 cycles to get a result, so be it, you aren't asking for than many random numbers. A Monte-carlo simulation, on the other hand, can accept less random results but likes them fast.
A hardware LFSR based generator should be faster than an adder, but is completely unsuitable for cryptography, far too predictable. A software based LFSR in the same code using the numbers I'd estimate running 1 vector of results/cycle on the mill.
Reading from a bank of asynchronous oscillators is fairly fast and pretty good if the sampling is slow compared to the oscillator rate. But this takes power to run the hardware. So high power consumption, slow sampling, or weak randoms. If combined with another independent method you can get top grade randoms.
ivan
We believe that the width of the Mill (on the sorts of family members one would expect to use for sim) makes a software implementation practical for low-grade RNGs like a LFSR. As mentioned, we expect to use an i/o like interface for a crypto box on those members that have one. The purpose of the hardware random seed generator that I mentioned is not for use as a RNG - not enough entropy even for low grade - but to provide a physics-based starting point that cannot be predicted even if an attacker has all the code and the machine circuits to study.
Whether the seed generator uses oscillators, back-biased diodes, or the howling of cats is up to the hardware guys and likely will vary by member. :-)
However, all this projection is very premature, and will no doubt be changed as we get closer to actual market.
jwm
I really like the protection domains and the PLB, I could really use this on my x86 right now. :-)
At about 1:03:40 there's an explanation of passing a graph structure by allocating an arena and passing the arena. There's a comment that the callee can "tromp on your arena structure". Shouldn't we be able to pass() the graph read-only to avoid the callee making any changes? Or does pass implicitly grant read and write access? It seemed that earlier at about 1:01:30 we see a write system call using pass(ptr, len, r), I assume that "r" is read and the absence of "w" means it's read-only.
ivan
You assume correctly; you can pass any subset of the rights you have, and the "r" is explicitly not passing "w".
About the arena: the concern is, if the arena is passed with "w" rights and contains both data and administration (which is the way many heaps are organized) the the recipient can tromp on the administration. However, if the administration is disjoint from the contained data (which is the way some heaps are organized) then only the data and not the administration is exposed. Of course, if the callee needs to add or remove nodes in the arena, the latter approach would require also giving the callee portals to the client routines that do the arena malloc and free, because the service would not be able to play with the administration directly itself.
It's clear that the Mill primitives can be used in many different ways to support many different models, and we expect the users and the field as a whole to spend happy hours exploring the possibilities and discovering what policies and protocols provide the best of both function and performance. Have fun :-)
LarryP
Non-forgeable turf/thread IDs: What can we assume re their size and properties?
In the security talk (circa slide 19, around 21 minutes in), a turf ID is shown as one field in a region descriptor. Turf IDs are also said to be non-forgeable, though how that's achieved isn't stated. If region descriptors must fit into the PLBs, the PLBs must either support variable sized descriptors (hard to do, while keeping the H/W fast and cheap) or must be fixed size.
If region descriptors are of fixed size and are constrained to be a single cache line each (true, or have I over-assumed?), then turf IDs are de-facto constrained in their length by the size of a cache line (on that Mill model), less the bits needed for the lower bound, upper bound and rights. If so, what is the minimum width (#bits) guaranteed for the turfID?
In general, short keys/hashes are less secure against attack, and a key length sufficient to be non-forgeable today is unlikely to remain so for very long. Are Mill turfIDs/ThreadIDs made non-forgeable via hash functions (as are the payloads in NaRs), or is there a different mechanism used to make turf (and thread) IDs unforgeable? What may we assume about the (minimum guaranteed) length and security properties of turf and thread IDs?
ivan
There is considerable member-dependent variation possible in the formats of these things, and there are probably variations that we haven't thought of yet that will be used some day. Consequently you should see this answer as representative rather than graven.
ID sizes (turf, task) are member dependent. The most important constraint on their sizes is that the product of a task and a turf is used as the implicit address to locate the root stacklet in a portal call, and hence must fit together in a virtual address (60 bits) together with a stacklet size and enough bits for non-stacklet memory as well. All these bit fields are per-member fixed, and the member designer can increase one only by shrinking another. A representative number might be 23 bits each for turf and task ids.
We don't seem to be constrained by the size of a portal data structure, although I suspect that when we implement that part of an OS that we'll find we want to put more, or more kinds or, data in the portal than we have now thought of. There is some need for the access to be power-of-two sized and aligned, and for hardware reasons it will probably be a multiple of the vector size even if we don't need the bits.
In general, such member-specific details are of interest only to the OS and will sit behind a member-independent API.
Will_Edwards
The IDs are unforgeable not because they are unguessable but because they are managed by hardware.
indolering
Providing a seed really is good enough to solve most of the
system level problems that we see with random number generators. DJB has actually
argued against adding output from a HRNG to the entropy pool on Linux. Although I don't fully agree with his conclusion (the threat model is pretty unrealistic) the theory behind his reasoning is sound.