Mill Computing, Inc. › Forums › The Mill › Architecture › UTF8
- mhkoolParticipantNovember 4, 2015 at 3:34 pmPost count: 7#2065 |
First of all, congratulations with the way Mill Computing works and what you have showed so far. Very well done!
The problem domain of my work is strings and doing fast lookups of small strings in large tables. Strings were for a long time ASCII but UTF8 is the default character set on major Linux distros. We did not see much use of UTF8 characters, but this is changing and now we see German URLs with u-umlaut (ü) in domainnames and chinese and arabic charaacter in a URL.
The C++ standard for UTF8 has a lot of compiler-dependant implementations which is not good for portability. The questions that I have with regard to UTF8 characters on the Mill, are:
– what is the LLVM compiler going to support ?
– which version of the Mill ISA will support a native type called utf8 (*) ?
(*) I imagine that a native utf8 type can contain all UTF8-encoded characters, so its width is between 1 and 4 bytes and has a meta-property which is the width.
With this native type, the utf8-character added to a pointer increments the pointer with the width of the utf8-character, and the load/store are *magic* since there is no width specified: the load looks at the bytes, interprets them and loads the appropriate number of bytes (1-4) and sets the meta-property (width).
- Ivan GodardKeymasterNovember 4, 2015 at 4:42 pmPost count: 679
Thank you 🙂
About UTF-8: in general we do hardware, and hardware doesn’t deal with character sets. The hardware deals with bytes, and has no knowledge of what those bytes hold. We deal with, or don’t deal with, character sets only in our software. There will not be any “native” character type in the architecture itself, UTF-8 or otherwise.
For llvm you will get whatever llvm gives us; it’s the same with other possible third-party software for the Mill, such as gcc or Linux itself. In the diagnostics and listing of our own house-developed software I’m afraid that we have been very lax in worrying about localization; we use the standard ASCII that C++ gives us. That will have to change, we know, but we have more pressing matters right now.
As for reading UTF8, what you need in the architecture is a funnel shifter attached to a streamer rather than a specialized load. We have some ideas in that direction, but all NYF.
- benke_gParticipantAugust 18, 2016 at 10:23 amPost count: 2
There could be a point in having a utf8 <-> utf32 pair of instructions as the hardware is trivial and the software isn’t given that this encoding is winning terrain and bogs down string searches everywhere. Obvoiusly these would be two output operations as we need to know how many bytes were consumed/produced. Just the count output would be useful for strlen applications. That said it is really a wart for an instruction set…
- Ivan GodardKeymasterAugust 18, 2016 at 11:13 amPost count: 679
It is not clear whether such conversion should be a scalar operation (which seems to be what you are suggesting), or a stream node; or perhaps both. Variable length stream parsing is a bear (and bit-stream parsing is worse), but we are trying for more general mechanisms rather than spot solutions. However, Mill streams are NYF.
You must be logged in to reply to this topic.