- GrantParticipantJuly 19, 2017 at 3:41 amPost count: 6
I recently saw a couple of blog posts that had a toy program (removing all ‘ASCII whitespace’ or less-than-33 values from a known-length byte sequence) and measured the performance (per cycle, and compared to a memcpy). On x64 , they managed 2.5 bytes per cycle (memcpy being 12 bytes per cycle), vectorized. But only 0.6 bytes per cycle, non-vector, and only 0.2 bytes per cycle with the most obvious implementation. On ARM , they managed 0.4 bytes per cycle with wide (but not vectorized) processing and 0.6 bytes per cycle with vectorization (NEON). They didn’t measure the basic program on ARM and the second blog post mentions 4 bytes per cycle for x64, without citation.
I’m interested in how the Mill performs on this (and in most anything Mill). I glanced through the instruction set and couldn’t find instructions that match the NEON and SSE3 ones that are used. Particularly, I couldn’t find something that could gather together the non-whitespace parts of a block (which could easily be turned into zero or None, in preparation), so that they can be sequentially written and counted. It looks like the handling of None is biased towards skipping bits of memory on writes, rather than compacting.
So, anyone got any ideas as to how it could be written and what the performance of it (per cycle, and compared to memcpy) would be? Also, the performance of the trivial version would be interesting, as I’d imagine it would manage at least 1 byte per cycle. (I’d expect we’d use only instructions that were already planned before this point, of course, and avoid anything not-yet-filed)
- Ivan GodardKeymasterJuly 20, 2017 at 7:48 pmPost count: 629
This is a nice little test; thank you for suggesting it. We might work it up into a talk like the switch talk. However, it really needs both pipelining and auto-vectorization, both of which are torn apart in the tool chain right now; plain scalar code isn’t really that informative. So let us beat the tools back into shape and I’ll post the result here when it’s ready.
- GrantParticipantJanuary 26, 2018 at 5:37 amPost count: 6
Would it then depend on the compiler spotting and optimizing the pattern, without obvious programmer feedback to let them know whether or not the optimization has been applied? With the ARM/x64 versions, one can at least tell by looking at the code that it has been heavily optmised, though competence can be hard to gauge and the code is hard to understand.
- Ivan GodardKeymasterJanuary 26, 2018 at 11:55 amPost count: 629
That kind of optimization is a middle-end problem, which in our case means LLVM, and LLVM doesn’t seem to know anything about such things; auto-vectorization in general is weak to absent.
You are right that the first step for a vectorized version on the Mill would be to None out the whitespace; that’s easy. There’s no reduction-compaction operation in the ISA at this point. One possibility would be to turn the None-laden vector into a bitmask using the mask() op, and then use the mask as the control in a switch that would execute an appropriate shuffle() op for each mask to do the compaction.
However, for low-height members (vector heights of 8 or 16 bytes) the cost of the armwaving to do vector compaction (absent a new op) probably exceed the cost of doing it in the naive scalar loop, which trivially gets one byte per cycle (var. 1 c/b)on a Mill.
Another approach might be to use the machine width to do several bytes at a time MIMD. Two-way would involve two loads and two stores per cycle, which larger Mills can do. Two compares, two adds, a pick and a branch (needed for the rest of the loop) would also fit in the same instruction in those members, so you’d get two bytes/cycle in scalar MIMD. As this is simple unrolling the compiler might be able to find it.
However, absent a compaction op in the ISA the right way to do this is streamers, but they are NYF.
You must be logged in to reply to this topic.