Mill Computing, Inc. Forums The Mill Architecture UTF-8 Decode Routine

  • Author
    Posts
  • Joe Taber
    Participant
    Post count: 25
    #3416 |

    I (think I) made a ConAsm UTF-8 decoder. It can decode a 1-4 byte UTF-8-encoded codepoint in 14 cycles. I posted it on my blog with some additional context and my thoughts if you’re interested. Also here’s a GitHub Gist with just the code if you want that. I decided to forgo the forum because I wanted to be able to update it as I get feedback and so I had better formatting options.

  • Veedrac
    Participant
    Post count: 25

    This is really neat, nice. Some comments on the things I’ve found out about the arch:

    con(v(0xe0, 0xf0, 0xf8)) is a length-3 vector? This is a little confusing, it needs to be 128 bits IIUC.

    andlu(%first, %prefmask) won’t work since the Mill doesn’t splat automatically.

    I don’t think you can return immediates as per retntr(%onebyte, %first, 1).

    andlu’s immediate is morsel-sized, so andlu(%cont, 0xc0) won’t fit, and I don’t think the Mill will splat immediates either.

    smearx(%picked) will return two elements, so you can dump the any(%picked).

    con(v(0, 0, 0, 0)) can be a rd() of the appropriate constant.

    Overall I don’t know if SIMD was the right choice; using pick and interleaving the different paths would probably be faster.

You must be logged in to reply to this topic.