Forum Replies Created
- AuthorPosts
I am not sure that my specific question has been answered. I am wondering if it is conceivable that there could be a mechanism whereby data from the running program that indicates how many iterations the loop plans on doing could be shipped over to the predictor. I know it often runs on historical data, but my question is whether “live” data could be used to override the history-based prediction. E.g. let’s say I have a loop that is supposed to run 5 times, this time. Maybe it ran 10 times last time. For simplicity let’s say I can only run one iteration of the loop at a time, and each iteration takes 4 cycles. So we know the loop will run for 20 cycles, then exit. We know we are going to run the loop 5 times because there is a counter on the belt with
5
in it. Because the loop is running for longer than the pipeline length of 5 cycles, we can send some live data over to the predictor and tell it, “when I sent the value, we had 5 iterations left to go”. Now because X cycles must have passed since that value was sent, maybe we have 4 iterations left to go, but still, the predictor now knows with certainty when an exit should occur, right? This could potentially be a side effect to some of the Mill’s loop-specific instructions. My apologies if this is completely untenable to implement, I do not know the pains of designing hardware.- in reply to: pdep and pext #3882
Looks like this got restored at some point after I tried again with:
Anyway, building off the example given here, is there an efficient way to implement the
select
function on the Mill, which gets the index of the k’th set bit? - in reply to: Why not encode data types in metadata? #3877
I don’t have a any ideas that would be a silver bullet to disambiguate between different overflow/rounding behaviors. You’re right that you’d need at least some info bits either way.
However, with a little creativity and a 100% complete and total ignorance of how to implement hardware, I have a complete shot in the dark to take.
What if you could track the types in the decoder and leave them there? Is there any way you could perform like a type-transformation equivalent to the transformation the instructions do? E.g. if I have a vector and turn that into a bitmap, you could figure out what the resulting type is without needing to compute the actual answer. Then you could decide which functional unit to delegate instructions to in the decoder phase of the pipeline. Then, casts would exist solely in the decoder phase and would not need to take up precious functional units. It could theoretically affect how many instructions you can decode per cycle, but what kind of code have you seen that uses a ridiculous amount of casts between totally incompatible types? I doubt it would be much more common than explicit no-ops on the Mill. I know doubles get abused for NaN-tagging tricks, but I wonder if it’s still possible there is a tradeoff that could be worth it if you can reduce the opcode space by another large factor. Maybe that would enable even more ops to be decoded simultaneously and it could pay for itself.
- in reply to: Why not encode data types in metadata? #3865
What happened to this post? Did anyone receive the contents of my post in an email and would be willing to paste it here?
- This reply was modified 1 year, 8 months ago by Validark.
That’s good to hear! The wiki says “When a result value overflows, it is widened.” That’s why I was scared it might be dynamic.
- in reply to: array bound checking #3859
I am not suggesting that quad be mandatory, just that machines which do support quad could allow
addp
et al to work on quads in the way I’ve described. The specializer could break those up for lower-end Mills, couldn’t it? - in reply to: Memory Allocation #3851
If it’s possible to OOM upon writing to the scratchpad, is it possible to recover from that? Are there ways to guarantee you won’t OOM? What are the things to avoid?
- in reply to: array bound checking #3849
Don’t higher-end Mills already support native 128 bit operations? Of course, machine width can support this feature as well but supporting {addr,len} pointers directly would allow one to run more operations simultaneously (on higher-end Mills), which could be a big win in some scenarios where you want to do two {addr+1,len-1} ops in a cycle while also doing something else. It would also make things fall off the belt slower when using multiple fat pointers. The {addr,len} scheme doesn’t seem terribly specialized to me. To me it seems quite general-purpose and applicable to a wide variety of programming languages. I think most languages probably use {addr,len} all over the place. In many cases things have a
capacity
field too but that doesn’t change during iteration so that can sit elsewhere.With regards to backwards iteration, I’d assume one could use
subp
, although I don’t particularly care what the semantics are as much. I have personally only used fat pointers for forward iteration and I do not know how most languages support backwards iteration, if they even do (in Go, C#, and JavaScript, for example, one cannot go backwards with a blessed for..of/foreach loop, you have to manually index in or reverse the data in an array). The languages that do support backwards iteration appear not to give access to any data besides the current element (and perhaps whether there is another element to iterate over next). I think supporting backwards iteration through this mechanism is thus less important, but a language could theoretically construct a fat pointer which points to an element and the things to its left rather than to its right, and then iteration could be done the by doingsubp
which would do{addr-1,len-1}
. It wouldn’t support going forwards and backwards randomly (unlesslen
is just how many elements one wants to touch before stopping), but I’ve never used an iterator like that anyway and most iterators can’t do that anyway. If I were doing something that weird, I wouldn’t expect the benefit of operating directly on pointers, I’d have an index into an array. The point of allowingaddp
to work on 128 bit fields directly (on high-end Mills) is to make the machine wider in the most common case, and iterating forwards is by far the most common case. Of course, on Mills with less width and no 128-bit data paths, yeah, they will have to issue two ops and two drops on the belt to do each fat pointer operation. Also, in the case where reversed iteration is abstracted away: great! You can just do it in whatever way is most efficient for the Mill (at least in 99% of cases).I doubt anyone would want to garbage collect a Fat pointer that’s been modified by iteration (although one could, but it would take more bookkeeping and effort). E.g. in Zig when you deallocate you are expected to give back the full slice / fat pointer that was given to you by the allocator originally, without changes achievable by forward iteration (there is no blessed way to move a fat pointer backwards in Zig, you have to reconstruct it yourself). I think this problem currently is accounted for by most languages and the
addp
overloading idea I had doesn’t conflict with garbage collection whatsoever.Anyway, this is just an idea that I thought might be a nice optimization for high-end Mills to improve width and belt space in practice. Thank you for your consideration.
- This reply was modified 1 year, 8 months ago by Validark.
- in reply to: A suggested market for the Mill… #3848
A well-placed
tweetwoofe (is that what sound doge’s make?) at Elon Musk might be enough to get it in Teslas, haha.I think another great market would be e-ink devices (e-readers and note takers). Elimisteve remarked in the forum that phones and tablets spend most of their battery on screens and that a more efficient CPU wouldn’t necessarily give massive gains. Although I’d still prefer to have a Mill-backed phone, e-ink devices are an interesting, niche market where screens generally take a lot less power. Battery life is typically measured in weeks, not hours or days. From what I can surmise, power consumption of the CPU is difficult to balance against the ideal high performance chips that ideally would be in these devices. I follow what devices come out and what companies are in this space and I have seen a lot of marketing materials for high-end devices where the beefier CPU is the second selling point, behind the screen type, of course. These devices all want to reduce input latency (and improve speeds in general) but none of them can compete with the dedicated monitors from Dasung due to CPU/power limitations (Boox has standalone monitors too but aren’t able to squeeze the same performance out of an identical panel). This is also a space where a lot of devices have custom Linux OS’s, although many do use Android, for those who want to “side-load” Android apps to get more functionality outside of what the device ships with by default. Could this be an opportunity for the Mill?
Companies include SuperNote, Onyx Boox, ReMarkable, Kobo, Pocketbook, BigMe, of course Kindle and Nook, Sony, Huawei, Fujitsu Quaderno, Bookeen/Vivlio (just learned of this one just now), Lenovo, and if you check goodereader you’ll find there are even more companies that cater exclusively or at least mainly to the Chinese market. Dasung has released some non-monitors too but thus far they are known for not investing in good software.
I’d be surprised if all of these companies passed on the Mill. I think the Mill should be a good fit for these devices that want more horsepower but better power efficiency.
- in reply to: array bound checking #3843
Does the Mill have any instructions that can directly support fat pointers? E.g. in Zig I can do
cur = cur[1..];
, wherecur
has aptr
andlen
field, each a machine word, and this will add1
to theptr
and subtract1
fromlen
. It would be cool ifaddp
e.g. could work directly on fat pointers. I.e. In the case thataddp
is given a fat pointer as an argument, it would simultaneously add to the pointer and subtract from the length, giving back a 16 byte fat pointer with the pointer half increased and the length half decreased (assuming you are on a Mill member with where each slot on the belt can hold 16 bytes). If the length goes below 0 it should return a NaR. - AuthorPosts