Archive for the ‘Unicode’ Category

Switched to LuaLaTeX

Monday, November 28th, 2011


Have you ever tried to write a little bit complex command in LaTeX? I did at some occasions, and finally it somehow worked, but it has always been ugly. However, there is LuaTeX/LuaLaTeX, it provides real scripting within your documents:

  for i=0, 15 do
    tex.print("Math: $x_{" .. i .. "}$")

That is just awesome, in plain LaTeX this would be ugly, but it gets even more ugly if you have to deal with floating points, external files etc. Well, for now I do not need any complex macro, so I cannot talk about actual experiences with Lua stuff, but I encountered some problems when translating my LaTeX document to LuaLaTeX two weeks ago.

unicode-math does not work properly

When enabling the unicode-math package for using Unicode characters in formulas (⊂∀∃∂ etc.) I have to select a font using the \setmathfont command, I tried “Latin Modern Math”, “STIXGeneral”, “XITS Math”, “Neo Euler” and “Asana Math”, otherwise Unicode symbols will not get displayed. However, with all of these fonts formulas do not look as good as with standard LaTeX lmodern-package, which is usable from LuaLaTeX, too, \setmathfont will override it. Some of them are too bold, some have ugly ℝ, ℚ, ℂ (\mathbb) and \mathcal symbols etc. Thus I decided to port the uniinput package provided by the Neo project (they create the keyboard layout I am using) to LuaLaTeX. I thought it would be nice to check the Lua capabilities that for, however, I faced the next problem.

Lua is not Unicode aware

That is really annoying, LuaLaTeX’s claim is to support a) sophisticated scripting and b) native Unicode support. However, they choosed Lua as scripting language, which does not support Unicode natively. I could not find the functions I needed to write a succinct macro for declaring a Unicode character in math mode (for examble ℝ should be replaced with \mathbb{R}), simply something to split a 8-bit-UTF-8-string into its Unicode characters and to do conversions between character codes and strings. I did not want to write it myself. Thus I choosed a quick and dirty way: using some regexp-magic and a Ruby script to convert uniinput.sty into uniinput-lualatex.sty. It works now, you can use it if you want to…

Making it working with KileIP

KileIP currently has the latex-command hard coded to display previews of formulas. I was too lazy to fix that and I wanted to be able to fall back to LaTeX if there were unexpected problems, thus I made my document working with both LaTeX and LuaLaTeX:


Well, next time I need a complex macro I will certainly use Lua and it will hopefully work with my setup. :)

Graphical KDevelop-PG-Qt Output

Tuesday, April 26th, 2011

KDevelop-PG-Qt is really boring, it is just generating some boring C++-code. Well, I have not implemented QML-based animations or multitouch-gestures for KDevelop-PG-Qt, but now you can get .dot-output, graphs which can be visualized using GraphViz, e.g. dot on the command-line or KGraphViewer. That way you can visualize the finite-state-machines used for generating the lexer. I guess everybody knows what this is:

utf8 dfa overview

Overview of the DFA

You have not got it? Let us zoom in:
utf8 dfa, rectangular excerpt

Rectangular view

utf8 dfa, square exceprt

Square view

You can also download the .dot-file and browse it using KGraphViewer, or browse the .svg-file (generated by using dot) with Gwenview or whatever. What is this automaton about? It as actually quite simple: The automaton will read single bytes from UTF-8-encoded input and recognize if the input represents a single alphabetic character (e.g. A or ? or ? or whatever). That is quite complicate, because there are many ranges in Unicode representing alphabetic characters and UTF-8 encoding makes it more complicate. This DFA is an optimal (minimum number of states) Moore-automaton (previous versions used Mealy for output, but Moore for optimization, that was stupid), and it needs 206 states, it is really minimal, no heuristics. You are right: Unicode is really complicated. Unfortunately it took 65 seconds to generate the lexer for this file:

%token_stream Lexer ; -- necessary name
%input_encoding "utf8" -- encoding used by the deterministic finite automaton
%token ALPHABETIC ; -- a token
%lexer ->
  {alphabetic} ALPHABETIC ; -- a lexer-rule

I think such automatons like {alphabetic} should be cached (currently unimplemented), because they are responsible for most of the runtime. The implementation of the .dot-output is still both buggy and hackish, that has to be changed. But the graphs look nice and they help with spotting errors in the lexer.

Algorithmic ideas needed for Unicode

Monday, March 21st, 2011


KDevelop-PG-Qt development is continuing slowly, and I need some creative ideas. You do not have to know anything about KDev-PG-Qt or lexers, it is an algorithmic issue about Unicode, I am failing at arranging my thoughts in a way that could be transformed into (maintainable, understandable) code.

The problem is quite simple: Given a range of Unicode characters (e.g. “from a to z”, a-z, or 0xef-0x19fe3), I want to represent this range in UTF-8 or UTF-16 encoding, using 8-bit or 16-bit ranges respectively (e.g. 0×14-0×85 is a 8-bit range). For example any ASCII-range (like a-z) stays the same encoded in UTF-8 or UTF-16, for a more sophisticated example you should know how UTF-16 (the encoding used by QString) works:

  • Any UTF-32 codepoint (character) between 0×0 and 0xd800 (including 0×0, excluding 0xd800) and between 0xe000 and 0×10000 is simply converted into a 16-Bit integer, it does not get changed.
  • For any larger codepoint subtract 0×10000, now you have got a 20-Bit number, split it into two 10-Bit numbers, add 0xd800 to the first one and 0xdc00 to the second one. Now you have got two 16-Bit numbers, the high-surrogate and the low-surrogate, together those numbers represent the character in UTF-16.
  • Any surrogates (between 0xd800 and 0xe000) are invalid.

The example: 0xffef to 0×20000 should be transformed into UTF-16. The starting-codepoint is within UCS-2 (smaller than 0×10000 and it is not a surrogate), the end is not within UCS-2, it has to be encoded with a surrogate pair. So in UTF-16 the range would be 0xffef to (0×80+0xd800, 0×0+0xdc00) = (0xd880, 0xdc00). But I want to have 16-Bit ranges, a range from a single 16-Bit number to a pair of 16-Bit numbers is simply nonsense. Thus the range has to be split into two parts:

  • The UCS-2 part: 0xffef-0×10000
  • The non-UCS-2 part: 0×1000 = (0xd800, 0xdc00) to (0xd880, 0xdc00), resulting in (0xd800 – 0xd880) followed by (0xdc00 – 0xe000)

In that example there are exactly two parts, but that may be different, a range may have to be split because 0xd800-0xe000 are invalid, the starting codepoint may be non-UCS-2, then there can be up to three ranges, e.g.:
We want to encode the range 0x1000f-0x200ff. 0x1000f is (0xd800, 0xdc0f) in UTF-16, 0x200ff is (0xd880, 0xdcff), so we need three ranges:

  • The first few surrogate pairs with the same high-surrogate: 0xd800 followed by (0xdc0f – 0xe000)
  • The majority of surrogate pairs in the middle, where every low-surrogate is allowed: (0xd801 – 0xd880) followed by (0xdc00 – 0xe000)
  • The rest with the same high-surrogate: 0xd880 followed by (0xdc00 – 0xdcff)

I hope it is clear which output-format I would like to have: a set of sequences of 8/16-Bit ranges (ranges may have only one element). Unfortunately there are a lot of cases, I have implemented the conversion (commit 1dbfb66ca66c392c6ffdaa3ff9d00d399370cf0e) for UTF-16 using a lot of if/else for handling all special cases, involving 130 lines of ugly code, well that was not too bad, but now I need the same for UTF-8: up to four ranges in sequence, and much more special cases, 500 lines of unreadable, boring, unmaintainable code just for UTF-8? I do not like this idea. I need some creative ideas how to implement it in a more abstract way without handling all cases explicitly, and without brute force (checking all codepoints within the range separately). There is one simple idea: do not care about 0xd800-0xe000 and remove it at the end. But I have currently no idea how to simplify the rest, information how many UTF-8 surrogates are needed to represent the start and the beginning and if the surrogates have the maximum or the minimum value should definitely be used. Read this for a comprehensive description of the UTF-8 format, it is not more complicate than UTF-16. Consider it a contest, the winner will recieve eternal glory, I hope somebody has a nice idea. :D

What is special about the number 36?

Saturday, March 5th, 2011

It is the smallest natural number which is not an element of the list of special numbers at the German Wikipedia. By the way: The smallest special number seems to be -2, while -0.5 seems to be the smallest value of a Unicode-digit (༳, u0f33, TIBETIAN DIGIT HALF ZERO).