Forum Replies Created
- AuthorPosts
- in reply to: spiller work optimisation #2114
Thanks for the reply, yes longer than I expected but useful – maybe I had to say that I saw all the videos to make it a little shorter.
Yes, the rescue instruction is a good killer.I have the impression that the spiller with 16 buffers on a Gold with 32 belt positions suffers from a lot of pressure, because also the inner instruction produces “a new belt” and causes work for the spiller. Considering that 80% of belt values are used immediately and only once, it seems that there are a lot of belt positions with values that can be killed to prevent work for the spiller. Oh well, you are right and we have to wait for the hardware to see if the pressure on the spiller is an issue or not. My patience is just a fraction larger than yours 🙂
- in reply to: spiller work optimisation #2129
I did not have a fast deep recursion in mind. I raised the question about the 16 buffers that the Mill Silver and Gold have. Which seems a small number that may not be sufficient to cope with a small number of loops (using inner) and calls.
Ivan told that the spiller has SRAM that is used between the buffers and the L2$ so the Mill works fine if the SRAM is sized properly and has sufficient bandwidth. There will be less pressure on the buffers and SRAM of the spiller if (software) dead operands are declared dead for the hardware (for example using the rescue or suggested forget operation), but it is not yet known if this is necessary. I think it is wise to look at gate-level sim runs to decide if something must be done or not.
If it would pay off to implement a forget operation and if the encoding bits may be an issue, one may use a forgeth1 for the first half of the belt positions and a forgeth2 for the second half of the belt positions to reduce the number of encoding bits. A forget operation does not have to be executed in the instruction immediately before a call or inner, it can be executed earlier: when an operand becomes dead.
- in reply to: spiller work optimisation #2124
I understand that the Mill has a notion of a value in altch being live, i.e. “on the belt”. I like to call this “hardware live”. There is also a “software live” notion: the compiler knows for example that after a sequence of instructions, belt positions b10, b11, b14 and b15 are dead. I understand that if the next instruction contains a
call
orinner
, the spiller will save these four and a number of other latch values in its buffers and that the assumption is that this will not put too much pressure on the spiller’s performance (thanks for the extra explanation about the skid buffers).There is not much information in the talks about what exactly happens on a
retn
. I assume that the spiller puts all hardware live values of the “frame that will become current” back in its buffers. But since the spillers buffer on a Gold CPU has only room for 16 values, it seems that having 2 nested loops could easily make the CPU stall because the spiller is getting values back from L2$. Question is also how many belt positions are hardware live and hence saved/restored. Based on the statistic that belt values are used only once in 80% of the cases, I speculate that the number of hardware live values can be on average two times the number of software live values, which implies that the number of values to restore could be reduced significantly.Simply stalling on a
retn
due to putting many values back from L2$ into the spiller buffers seems too naive, so I assume that a piece of the spiller puzzle is missing which explains that performance is okay here. Ivan, can you explain?- This reply was modified 8 years, 6 months ago by mhkool. Reason: fix grammar
- AuthorPosts