KDevelop-PG-Qt development is continuing slowly, and I need some creative ideas. You do not have to know anything about KDev-PG-Qt or lexers, it is an algorithmic issue about Unicode, I am failing at arranging my thoughts in a way that could be transformed into (maintainable, understandable) code.
The problem is quite simple: Given a range of Unicode characters (e.g. “from a to z”, a-z, or 0xef-0x19fe3), I want to represent this range in UTF-8 or UTF-16 encoding, using 8-bit or 16-bit ranges respectively (e.g. 0×14-0×85 is a 8-bit range). For example any ASCII-range (like a-z) stays the same encoded in UTF-8 or UTF-16, for a more sophisticated example you should know how UTF-16 (the encoding used by QString) works:
- Any UTF-32 codepoint (character) between 0×0 and 0xd800 (including 0×0, excluding 0xd800) and between 0xe000 and 0×10000 is simply converted into a 16-Bit integer, it does not get changed.
- For any larger codepoint subtract 0×10000, now you have got a 20-Bit number, split it into two 10-Bit numbers, add 0xd800 to the first one and 0xdc00 to the second one. Now you have got two 16-Bit numbers, the high-surrogate and the low-surrogate, together those numbers represent the character in UTF-16.
- Any surrogates (between 0xd800 and 0xe000) are invalid.
The example: 0xffef to 0×20000 should be transformed into UTF-16. The starting-codepoint is within UCS-2 (smaller than 0×10000 and it is not a surrogate), the end is not within UCS-2, it has to be encoded with a surrogate pair. So in UTF-16 the range would be 0xffef to (0×80+0xd800, 0×0+0xdc00) = (0xd880, 0xdc00). But I want to have 16-Bit ranges, a range from a single 16-Bit number to a pair of 16-Bit numbers is simply nonsense. Thus the range has to be split into two parts:
- The UCS-2 part: 0xffef-0×10000
- The non-UCS-2 part: 0×1000 = (0xd800, 0xdc00) to (0xd880, 0xdc00), resulting in (0xd800 – 0xd880) followed by (0xdc00 – 0xe000)
In that example there are exactly two parts, but that may be different, a range may have to be split because 0xd800-0xe000 are invalid, the starting codepoint may be non-UCS-2, then there can be up to three ranges, e.g.:
We want to encode the range 0x1000f-0x200ff. 0x1000f is (0xd800, 0xdc0f) in UTF-16, 0x200ff is (0xd880, 0xdcff), so we need three ranges:
- The first few surrogate pairs with the same high-surrogate: 0xd800 followed by (0xdc0f – 0xe000)
- The majority of surrogate pairs in the middle, where every low-surrogate is allowed: (0xd801 – 0xd880) followed by (0xdc00 – 0xe000)
- The rest with the same high-surrogate: 0xd880 followed by (0xdc00 – 0xdcff)
I hope it is clear which output-format I would like to have: a set of sequences of 8/16-Bit ranges (ranges may have only one element). Unfortunately there are a lot of cases, I have implemented the conversion (commit 1dbfb66ca66c392c6ffdaa3ff9d00d399370cf0e) for UTF-16 using a lot of if/else for handling all special cases, involving 130 lines of ugly code, well that was not too bad, but now I need the same for UTF-8: up to four ranges in sequence, and much more special cases, 500 lines of unreadable, boring, unmaintainable code just for UTF-8? I do not like this idea. I need some creative ideas how to implement it in a more abstract way without handling all cases explicitly, and without brute force (checking all codepoints within the range separately). There is one simple idea: do not care about 0xd800-0xe000 and remove it at the end. But I have currently no idea how to simplify the rest, information how many UTF-8 surrogates are needed to represent the start and the beginning and if the surrogates have the maximum or the minimum value should definitely be used. Read this for a comprehensive description of the UTF-8 format, it is not more complicate than UTF-16. Consider it a contest, the winner will recieve eternal glory, I hope somebody has a nice idea.