The KDev-PG-Qt Lexer-Generation-Algorithm

Notice: This post may require some background-information about lexers/regexp-parsing/automatons.

Today I want to describe the essential ideas for lexer-generation in KDevelop-PG-Qt. Each lexer-rule is described with a regular expression. The generated code has to determine which regular expressions match the following input-characters and which one to choose. A common approach for choosing one of multiple matching regular expressions is choosing the longest match (eat characters greedily), if there are multiple longest matches, priorities from the definition are used. That approach is easy to implement and very appropriate for real-world-languages, that is not the interesting part.
The usual approach for detecting the regular expressions is a translation to finite-state-machines. Each regular expression can be easily transformed into a nondeterministic-finite-automaton (NFA), accepting the words described by the regular expression, wich can be transformed into an equivalent deterministic-finite-automaton. For multiple tokens you can simply create a NFA accepting each of the expressions, while taking care about the rules which may match the current input using some additional states. Now the interesting part about KDev-PG-Qt’s implementation: The usual approach for handling the transitions between the states of the finite-state-machines is a table mapping the input character to the next state:

Foreach state store: table (character → next state(s))

But KDev-PG-Qt should support Unicode, it should be possible to process even UTF-32 input, more than a million possible inputs, memory usage, size of generated tables? You can imagine what would happen with a few hundred states. But it is possible to take advantage of a property of Unicode: similar characters are grouped in blocks, and usually tokens will consist of similar tokens. So it is a good approach to save the transitions blockwise, e.g. A-F, X-Z, a-i and m-p, such representations are obviously much more efficient than table-based approaches. But KDev-PG-Qt should support both approaches, but the used algorithms should always be the same (generic programming, yeah!), so I need some kind of abstraction: sets of characters which may be represented using either bits or blocks/sequences of characters. That for I have also choosen a different internal format for transitions:

Foreach state store: table (next state → set of characters for this transition)

The algorithms are now using set-based-arithmetics (union, difference, intersection, complement) to work with any representation of character-sets. For tables it can be easily implemented using bit-operations, for the interval-based representation it is a bit more complicated, but possible in Ο(n), where n is the number of intervals in the sets you are currently dealing with. So it can run without using Gibibytes of memory and within reasonable time for creating a Unicode-aware lexer.
However, not only KDev-PG-Qt should not require too much memory, but also the generated lexer should be short and efficient. That is why there should be differnt code-generation-strategies for the different representations of sets of characters. For the tables you can simply generate tables or switch-statements for the state-transitions. But for the sequence-based-approach you can do better: The sets of characters get translated into nested if-else-statements, some kind a static binary search, they can decide in which interval the character from the input is. So the lexer will use a very small number of comparisons to decide which state to choose next instead of using a large table, that makes the generated code even more readable (although that is not the purpose).
Everything is programmed using a lot of templates, some of you might not like this approach, but it is really nice to be able to create one algorithm accepting different character-encodings (ASCII, Latin1, UTF-16, UTF-32, …) supporting different internal storage representations and different output-formats for the decisions.

Well, that is it, I hope this post is interesting for at least a few people. ;)

The User

PS:
Of course minimization is also implemented, so you will not get useless states in the generated code.

Leave a Reply

XHTML: Use <blockquote cite="name"> for quotations, <pre lang="text    ∨ cpp-qt ∨ cpp ∨ bash ∨ other language"> for code, [latex] for formulas and <em> for em. Contact me if the comment does not get published, it may have accidentally been marked as spam.

Anti-Spam Quiz: