## Algorithmic ideas needed for Unicode

Hi!

KDevelop-PG-Qt development is continuing slowly, and I need some creative ideas. You do not have to know anything about KDev-PG-Qt or lexers, it is an algorithmic issue about Unicode, I am failing at arranging my thoughts in a way that could be transformed into (maintainable, understandable) code.

The problem is quite simple: Given a range of Unicode characters (e.g. “from a to z”, a-z, or 0xef-0x19fe3), I want to represent this range in UTF-8 or UTF-16 encoding, using 8-bit or 16-bit ranges respectively (e.g. 0×14-0×85 is a 8-bit range). For example any ASCII-range (like a-z) stays the same encoded in UTF-8 or UTF-16, for a more sophisticated example you should know how UTF-16 (the encoding used by QString) works:

• Any UTF-32 codepoint (character) between 0×0 and 0xd800 (including 0×0, excluding 0xd800) and between 0xe000 and 0×10000 is simply converted into a 16-Bit integer, it does not get changed.
• For any larger codepoint subtract 0×10000, now you have got a 20-Bit number, split it into two 10-Bit numbers, add 0xd800 to the first one and 0xdc00 to the second one. Now you have got two 16-Bit numbers, the high-surrogate and the low-surrogate, together those numbers represent the character in UTF-16.
• Any surrogates (between 0xd800 and 0xe000) are invalid.

The example: 0xffef to 0×20000 should be transformed into UTF-16. The starting-codepoint is within UCS-2 (smaller than 0×10000 and it is not a surrogate), the end is not within UCS-2, it has to be encoded with a surrogate pair. So in UTF-16 the range would be 0xffef to (0×80+0xd800, 0×0+0xdc00) = (0xd880, 0xdc00). But I want to have 16-Bit ranges, a range from a single 16-Bit number to a pair of 16-Bit numbers is simply nonsense. Thus the range has to be split into two parts:

• The UCS-2 part: 0xffef-0×10000
• The non-UCS-2 part: 0×1000 = (0xd800, 0xdc00) to (0xd880, 0xdc00), resulting in (0xd800 – 0xd880) followed by (0xdc00 – 0xe000)

In that example there are exactly two parts, but that may be different, a range may have to be split because 0xd800-0xe000 are invalid, the starting codepoint may be non-UCS-2, then there can be up to three ranges, e.g.:
We want to encode the range 0x1000f-0x200ff. 0x1000f is (0xd800, 0xdc0f) in UTF-16, 0x200ff is (0xd880, 0xdcff), so we need three ranges:

• The first few surrogate pairs with the same high-surrogate: 0xd800 followed by (0xdc0f – 0xe000)
• The majority of surrogate pairs in the middle, where every low-surrogate is allowed: (0xd801 – 0xd880) followed by (0xdc00 – 0xe000)
• The rest with the same high-surrogate: 0xd880 followed by (0xdc00 – 0xdcff)

I hope it is clear which output-format I would like to have: a set of sequences of 8/16-Bit ranges (ranges may have only one element). Unfortunately there are a lot of cases, I have implemented the conversion (commit 1dbfb66ca66c392c6ffdaa3ff9d00d399370cf0e) for UTF-16 using a lot of if/else for handling all special cases, involving 130 lines of ugly code, well that was not too bad, but now I need the same for UTF-8: up to four ranges in sequence, and much more special cases, 500 lines of unreadable, boring, unmaintainable code just for UTF-8? I do not like this idea. I need some creative ideas how to implement it in a more abstract way without handling all cases explicitly, and without brute force (checking all codepoints within the range separately). There is one simple idea: do not care about 0xd800-0xe000 and remove it at the end. But I have currently no idea how to simplify the rest, information how many UTF-8 surrogates are needed to represent the start and the beginning and if the surrogates have the maximum or the minimum value should definitely be used. Read this for a comprehensive description of the UTF-8 format, it is not more complicate than UTF-16. Consider it a contest, the winner will recieve eternal glory, I hope somebody has a nice idea.

### 16 Responses to “Algorithmic ideas needed for Unicode”

1. zwirbeltier Says:

I implemented the algorithm in about 70-80 lines of readable (never more then two layers of nested if’s) Python code. The algorithm “scans” through all the different ranges and uses the fact that the exact values for “start” and “end” are only needed when we are in the same range as the value. Perhaps the code speaks for itself (hoping for a nice formatting by the blog software):

def utf16_range(start, end):
# if start > end:
# 	return utf16_range(end, start)
ranges = []

start_in_range = False
end_in_range = True

if start < 0xD800:
start_in_range = True
if end < 0xD800:
return ranges + [(start, end)]
if end < 0xE000:
raise Exception("end has a surrogate value")
ranges += [(start, 0xD7FF)]

if not start_in_range and start < 0xE000:
raise Exception("start has a surrogate value")

if start_in_range:
start = 0xE000

if start < 0x10000:
start_in_range = True
if end < 0x10000:
return ranges + [(start, end)]
ranges += [(start, 0xFFFF)]

if start_in_range:
start = (0xD800, 0xDC00)
else:
start = surrogates(start)
end = surrogates(end)

if start[0] == end[0]:
return ranges + [(start[0], (start[1], end[1]))] # (fixed_high, (low_from, low_to))
else:
ranges += [(start[0], (start[1], 0xDFFF))] # (fixed_high, (low_from, low_to))
start[0] += 1
start[1] = 0xDC00

if start[0] == end[0]:
return ranges + [(start[0], (start[1], end[1]))] # (fixed_high, (low_from, low_to))
else:
e0 = end[0] - 1
ranges += [((start[0], end[0]-1), None)] # ((high_from, high_to), arbitrary_low)

if end[0] == 0xDCFF:
ranges[-1][0][1] += 1
return ranges

return ranges + [(end[0], (0xDC00, end[1]))] # (fixed_high, (low_from, low_to))

def surrogates(utf32):
x = utf32 - 0x10000
# 1111 1111 1100 0000 0000 for the top ten bits
# F    F    C    0    0
# 0000 0000 0011 1111 1111 for the low ten bits
# 0    0    3    F    F
return ((x & 0xFFC00) + 0xD800, (x & 0x003FF) + 0xDC00)

Perhaps there still is a smarter way to handle this. There surely are some optimizations that can be implemented if the code that handles the ranges is modified somewhat.

regards,
Roman

2. zwirbeltier Says:

So much for the nice formatting

A better version of the code can be seen here: http://nopaste.info/f23760e5a5.html

3. hate-engine Says:

Do you really need it? 99.5% of programming languages use basic latin characters, except for string constants and comments => no need for extra complexity.

4. The User Says:

@hate-engine
Well, it should be a Unicode-aware-parser, and more and more languages are starting to support it. It already works with Unicode, that code is only needed if you do not want to convert the UTF-8/UTF-16 into UTF-32, such a conversion is already integrated, but the implementation should be “complete” and allow direct usage of UTF-8/UTF-16 – I want to try it in practice if the UTF-32-approach ist faster or slower. Btw I would not consider Java to be less than 0.5%.

@zwirbeltier
Try <pre lang=”python”>. Fixed it.
But your code already looks nice, start_in_range is a nice flag, which could be used for each UTF-8 surrogate. Every variable/whatever is nice if it is the same for each surrogate. Thank you.

5. Francesco R. Says:

something along these lines?

RANGE_START=0
RANGE_END=1

def utf16_range(start, end):
ranges = [[0x1, 0xD7FF], [0xE800, 0xFFFF], [0xA0000, 0xFFFFF]]
a = b = -1
for i in xrange(0, len(ranges)):
if ranges[i][RANGE_START] &lt;= start &lt;= ranges[i][RANGE_END]:
a = i
if ranges[i][RANGE_START] &lt;= end  start:
raise Exception("start is broken")
if b == -1 or end &gt; ranges[b][RANGE_END]:
raise Exception("end is broken and 2012 is near")
ranges = ranges[a:b+1]
ranges[0][RANGE_START] = start
ranges[-1][RANGE_END] = end

return ranges

print utf16_range(0x11, 0xD7F)
6. Francesco R. Says:

oops, again


RANGE_START=0
RANGE_END=1

def utf16_range(start, end):
ranges = [[0x1, 0xD7FF], [0xE800, 0xFFFF], [0xA0000, 0xFFFFF]]
a = b = -1
for i in xrange(0, len(ranges)):
if ranges[i][RANGE_START] <= start <= ranges[i][RANGE_END]:
a = i
if ranges[i][RANGE_START] <= end  start:
raise Exception("start is broken")
if b == -1 or end > ranges[b][RANGE_END]:
raise Exception("end is broken and 2012 is near")
ranges = ranges[a:b+1]
ranges[0][RANGE_START] = start
ranges[-1][RANGE_END] = end

return ranges

print utf16_range(0x11, 0xD7F)
7. The User Says:

@Francesco
That may be useful to “select” the ranges. Well, the code does not implement the pair-generation, but the idea may be useful. 0xE800 should be 0xE000, right? But what is 0xA0000 and 0xFFFFF?

8. Tim R Says:

Would a QMap do the trick ? Making the correspondance between utf-8 -> utf-32 and an other one for utf-16 -> utf-32 ?

9. Esben Mose Hansen Says:

*Surely* there exists a library that can handle this? I’d suggest looking for such.

10. Kevin Kofler Says:

I’d not validate anything. If it’s invalid, what you do with it is undefined. (Well, the standard says it’s not, but do you really need to care about e.g. lookalike character attacks in a piece of code? If something is an overlong sequence for ‘i’ and you treat it as an ‘i’, what’s really going to break?) Garbage in, garbage out. Without validation, UTF-8 parsing is very straightforward, you can use a loop instead of case distinctions.

11. zwirbeltier Says:

I was able to cut another ten lines off the code.

Actually $start_in_range is not needed at all, but it makes one concept clearer: Once you’ve considered the range of$start you don’t need to know it’s exact value anymore. All you need to know is that you already passed \$start.

For the first two ranges (0×0 – 0xD7FF and 0xE000 – 0xFFFF) you could use a loop. But I think it does not make the code clearer or faster. It may be a good idea for UTF-8 though – I didn’t take a look yet though.

So here we go (version 2):

def utf16_range(start, end):
# if start &gt; end:
# 	return utf16_range(end, start)
ranges = []

if start &lt; 0xD800:
if end &lt; 0xD800:
return ranges + [(start, end)]
if end &lt; 0xE000:
raise Exception(&quot;end has a surrogate value&quot;)
ranges += [(start, 0xD7FF)]
start = 0xE000 # start of the next range
else:
if start &lt; 0xE000:
raise Exception(&quot;start has a surrogate value&quot;)

if start &lt; 0x10000:
if end  0xDC00: # do we need a separate first line
ranges += [(start[0], (start[1], 0xDFFF))]
start[0] += 1

if end[1] = start[0]: # is there still a block left
ranges += [((start[0], end[0]), (0xDC00, 0xDFFF))]

return ranges

def surrogates(code):
x = code - 0x10000
# 1111 1111 1100 0000 0000 for the top ten bits
# F    F    C    0    0
# 0000 0000 0011 1111 1111 for the low ten bits
# 0    0    3    F    F
return ((x &amp; 0xFFC00) + 0xD800, (x &amp; 0x003FF) + 0xDC00)
12. zwirbeltier Says:

Sorry, still no useful formatting (quotes and are filtered badly): http://nopaste.info/ae48dbfbe7.html

13. The User Says:

@Esben
I am not sure, it is not a very common task. I am not familiar with ICU, but neither Quex nor JFlex seem to need such functionality, because (not 100% sure) Quex uses UTF32 und JFlex uses UCS2.

@Kevin
I am not using any normalization (currently), thus overlong sequences looking like an i wil not be treated as an i. When there are invalid codepoints I usually do not care a lot, I just skip them, yes, garbage in, garbage out. But this problem is not about UTF-8 parsing, I am currently trying to implement it like zwirbeltier, I hope that will work.

@Tim R
No.

14. The User Says:

I have got a solution for UTF-8 similiar to zwirbeltier’s code using templates such that I can reuse some code for each surrogate. Will commit it later…

15. The User Says:

I have commited it, well, variadic templates, not very nice, but the UTF-8 critical code is not duplicated, and over all it is a little bit shorter than the UTF-16 implementation. Thank you all.

16. The User Says:

@Tim R
No, there are so many possibilities, a QMap could not help.

XHTML: Use <blockquote cite="name"> for quotations, <pre lang="text    ∨ cpp-qt ∨ cpp ∨ bash ∨ other language"> for code, [latex] for formulas and <em> for em. Contact me if the comment does not get published, it may have accidentally been marked as spam.

Anti-Spam Quiz: