Archive for the ‘KDevelop-PG-Qt’ Category

KDevelop-PG-Qt: Generalised Expression Parsing

Sunday, February 26th, 2012

Hi guys and non-guys,

apparently I had no motivation for blogging in the last few months, however…

Currently (once in a while) I am porting the KDevelop-PG-Qt grammar from Flex+Bison to KDevelop-PG—Qt, a parser generator should trust itself and I hope it will increase the quality of the code (and of course it will allow using Unicode). Long Some time ago I implemented specialised parsing for (arithmetic) expressions: the recursive descent approach is not very good in those cases, you have to care about not using left-recursion and you will get very, very deep syntax trees, taking a lot of memory for simple expressions like “0” etc. thus increasing memory usage significantly (e. g. in KDevelop’s PHP-plugin). What was the approach? You can specify postfix, prefix, binary and ternary operators and parentheses. For each operator you have to specify the type of the tokens which should be interpreted as operator and a priority (higher≙stronger binding), additionally, binary and ternary operators have an associativity. I implemented it using a shunting yard-like approach (actually without knowing that termat that time).

KDevelop-PG-Qt 1.0 Beta

Tuesday, October 4th, 2011

Today KDevelop-PG-Qt 1.0 Beta (aka 0.9.82) got released, the parser generator used by several KDevelop (and Quanta) language plugins. There are some new features, and various bugs have been reported in the last few months since 0.9.5 and are now fixed.

New features

Most effort has been spent for implementing the lexer generation. You can now write a simple specification of the lexical structure by using regular expressions, the generated token stream class, which generates tokens from the input-data, can be used directly by the generated parser, but you can also just create a lexer without a parser. Thus, despite the name, KDevelop-PG-Qt is no longer just a parser generator. My motivation for writing the lexer generator was the lack of decent Unicode support in most lexers. Quex is quite good, but it is not free software (despite the LGPL based license) because of excluding military usage. The lexer can not only read different encodings, but also use different encodings for internal processing, e.g. it can either convert a UTF-8 stream to UTF-32 first or directly operate on UTF-8 bytes. Simple example:

%lexer ->
for FOR ;
"some keyword" SOME_KEYWORD ;
{alphabetic}* WORD ;
;

It does also include limited support for look-ahead (full support like in Flex make generated code more complicated and inefficient), and a simlar – but in my experience more useful for modern languages – feature called “barriers” which will be explained in the documentation soon.

The API used by the generated files has been cleaned up, some method names were very counterintuitive, some methods were not used at all. This break requires updating KDevelop-PG-Qt to build the latest versions of PHP, QMake, SGML and CSS plugins (you may have to do a clean build).

There are also some minor new features: AST structs now get forward declared reducing dependencies, code which relied on definitions at some places has been fixed some time ago. The token types do not get declared by the parser any longer. Additionally you no longer need “;;” in the grammer files, though it is still supported for compatibility.

Bug fixes

The bug fixes include proper line numbering. In GCC you can now see correct line numbers refering to the location of wrong code in the grammer file, for other compilers you may have to activate the –use-line-directive, then it will use the #line directive instead of the GCC specific syntax, but you will not see the location in the generated code. The CMake-macro will do that automatically. Some compatibility errors have been reported reported by Fedora and Debian packagers and fixed, special thanks to them. KDevelop-PG-Qt builds with QT_STRICT_ITERATOS now and also builds under Windows with MSVC (MSVC is not perfect for C++11 and Flex is not perfect for Windows). Annoying wrongly formated error messages in the lexer and parser of KDevelop-PG-Qt have been fixed, too.

Some bug fixes might be necessary, thus there is this beta-version first. There is some work to do now first: The documentation and the kate grammar file for proper syntax highlighting have to be updated.

C++11

This is the second release using C++11 features, especially the auto keyword and STL hashes (unordered_map and unordered_set). I had used variadic templates before to construct UTF-8 automata, but for compatibility with MSVC it has been replaced with a Ruby script generating some code, the variadic template had been quite ugly anyway.

Future development of KDevelop-PG-Qt

• Parser and lexer should be rewritten using KDevelop-PG-Qt
• Long requested: there should be a way to mark conflicts as resolved, that for the conflict reporting has to be refactored a bit
• The next release will hopefully support LL(k) parsing, making some stupid look-aheads obsolete
• Cleaning up further code

Special thanks to Milian Wolff for all his patience, bug reports and support with the release.

I will post the link to the tar-ball as soon as it is uploaded, there is only the tag for now.Tar-ball can be found here.

Unexpected Bottle Neck: vector<bool>

Tuesday, August 23rd, 2011

Hi folks!

The last paragraph is actually the interesting one for every C++-programmer, the leading-in might not be understandable.

The new lexer component of KDevelop-PG-Qt is not very fast, it can take several minutes with complex grammars, especially when using UTF-8 as internal encoding for the automata. Of course, first I want to finish the most important features and that for I am writing an example lexer for PHP to check its capabilities to spot bugs and missing features, but now I was curious and asked Valgrind/Callgrind for analysis. I expected a lot of useless copying/assignments because I have not done any optimisation yet and tried to be “defensive” by using too much copying. Indeed: For my first example about 78% of the runtime were spent on copying. But then I tried a full (as far as it is finished) PHP example with UTF-8 and it was a real surprise:

KCacheGrind Overview

It spends most of the time with comparing instances of std::vector<bool>! For the powerset construction I have to maintain following states for subsets of states in the NFA and after that I map those subsets to new unique IDs for the states of the DFA. For both cases I wanted to use a map with vector<bool> (a subset of states) as key_type, I could not decide if I should use map or unordered_map – I used map, but there is also some code to make it possible to use QBitArray with unordered_map, just for the case. However: I expected the minimisation to be a bottle neck (disclaimer: I stopped valgrind after some time, maybe it becomes a bottle neck later), but now the comparisons used in the map-implementation are crucial. I was not very economical with those map-operations – I could have done the mapping while computing the necessary subsets, it has to be optimised and I may use unordered_map and unordered_set (hashes) instead of map and set, but that is not the reason why I am writing this blog post.

The interesting thing about it is the usage of std::_Bit_reference. std::vector<bool> has a specialised implementation, the booleans get packed, thus it will only need size()/8+const Bytes when it is full. There is a special hashing-implementation directly using the integers used to store the bits, but there is only a generic comparison implementation. Thus it will iterate over all the bits, instead of using comparisons of chunks (like 32 or 64 bit, depending on the CPU). Of course it will be much slower, keep in mind that each iteration requires some bit-manipulation. Especially in my case of a red-black-tree the vector will get compared to very similar ones next to it in the tree, and in that case it takes especially long time. Be aware of vector<bool>, it is often awesome (safe a lot of memory), but something like this slow comparison may happen, and always keep in mind that the index-operator might be slower than for vector<char&gt. Unfortunately there seems to be no API for accessing the integer-chunks. Btw. QBitArray does not support comparison at all. I have filed a bug-report to GCC/libstdc++.

Graphical KDevelop-PG-Qt Output

Tuesday, April 26th, 2011

KDevelop-PG-Qt is really boring, it is just generating some boring C++-code. Well, I have not implemented QML-based animations or multitouch-gestures for KDevelop-PG-Qt, but now you can get .dot-output, graphs which can be visualized using GraphViz, e.g. dot on the command-line or KGraphViewer. That way you can visualize the finite-state-machines used for generating the lexer. I guess everybody knows what this is:

Overview of the DFA

You have not got it? Let us zoom in:

Rectangular view

Square view

You can also download the .dot-file and browse it using KGraphViewer, or browse the .svg-file (generated by using dot) with Gwenview or whatever. What is this automaton about? It as actually quite simple: The automaton will read single bytes from UTF-8-encoded input and recognize if the input represents a single alphabetic character (e.g. A or ? or ? or whatever). That is quite complicate, because there are many ranges in Unicode representing alphabetic characters and UTF-8 encoding makes it more complicate. This DFA is an optimal (minimum number of states) Moore-automaton (previous versions used Mealy for output, but Moore for optimization, that was stupid), and it needs 206 states, it is really minimal, no heuristics. You are right: Unicode is really complicated. Unfortunately it took 65 seconds to generate the lexer for this file:

%token_stream Lexer ; -- necessary name

%input_encoding "utf8" -- encoding used by the deterministic finite automaton

%token ALPHABETIC ; -- a token

%lexer ->
{alphabetic} ALPHABETIC ; -- a lexer-rule
;

I think such automatons like {alphabetic} should be cached (currently unimplemented), because they are responsible for most of the runtime. The implementation of the .dot-output is still both buggy and hackish, that has to be changed. But the graphs look nice and they help with spotting errors in the lexer.

KDevelop-PG-Qt lexer-branch merged into master

Saturday, April 23rd, 2011

Hi folks!

I have finally merged my private lexer-branch of KDevelop-PG-Qt into the master-branch and pushed it to git.kde.org – after having run KDevelop-php-plugin unit-tests ;). Well, what are the changes?

• Now there is support for lexer-generation, including unicode-support (UTF-8, UCS-2, UTF-16, UTF-32/UCS-4). I have talked about those features in some previous blog-posts.
• You do not have to use “;;” any longer, just “;” will work, but for compatibility “;;” will be like a single semicolon.
• Fixed a very old overflow when reading the grammar from an input-stream
• With –error-aware-code (the default) error-messages will now work correctly, when you are compiling the generated file and there is an error in your hand-written code, you will get correct references to the corresponding line in the grammar-file, unfortunately kdev-pg-qt’s error-messages are still not perfect (wrong column-numbers etc.).
• Added some command line options, marginal changes in their behaviour
• Changed some messages (spelling etc.)

Maybe, some day, there will be KDev-PG-Qt 1.0 (currently it is called 0.9.5 ;)), so, what is still to do?

• Make it possible to control what should happen when the lexer fails (it should not just be exit(-1))
• Update the documentation at Techbase
• Update the Kate-syntax-file
• Make new command-line-options available in the CMake-macro
• Find out, how to enable C++0x support in CMake allowing different compilers (yes, I am using variadic templates ;))
• Tokens should no longer be declared inside the parser-class (of course it should still be possible for compatibility)
• Forward-declarations for AST-classes
• Integration with QIODevice/KIO (optional)
• Dumping automata (optional)

Of course there could be more nice stuff, but that would not be related to the recent changes, e.g. LL(k)/LL(*)-parsing, look-ahead, possibility to switch rules off, packed AST-nodes, making it usable as a library etc.

I hope there are really no regressions, and the changes will be useful for the small elite of KDevelop-PG-Qt users.

Algorithmic ideas needed for Unicode

Monday, March 21st, 2011

Hi!

KDevelop-PG-Qt development is continuing slowly, and I need some creative ideas. You do not have to know anything about KDev-PG-Qt or lexers, it is an algorithmic issue about Unicode, I am failing at arranging my thoughts in a way that could be transformed into (maintainable, understandable) code.

The problem is quite simple: Given a range of Unicode characters (e.g. “from a to z”, a-z, or 0xef-0x19fe3), I want to represent this range in UTF-8 or UTF-16 encoding, using 8-bit or 16-bit ranges respectively (e.g. 0×14-0×85 is a 8-bit range). For example any ASCII-range (like a-z) stays the same encoded in UTF-8 or UTF-16, for a more sophisticated example you should know how UTF-16 (the encoding used by QString) works:

• Any UTF-32 codepoint (character) between 0×0 and 0xd800 (including 0×0, excluding 0xd800) and between 0xe000 and 0×10000 is simply converted into a 16-Bit integer, it does not get changed.
• For any larger codepoint subtract 0×10000, now you have got a 20-Bit number, split it into two 10-Bit numbers, add 0xd800 to the first one and 0xdc00 to the second one. Now you have got two 16-Bit numbers, the high-surrogate and the low-surrogate, together those numbers represent the character in UTF-16.
• Any surrogates (between 0xd800 and 0xe000) are invalid.

The example: 0xffef to 0×20000 should be transformed into UTF-16. The starting-codepoint is within UCS-2 (smaller than 0×10000 and it is not a surrogate), the end is not within UCS-2, it has to be encoded with a surrogate pair. So in UTF-16 the range would be 0xffef to (0×80+0xd800, 0×0+0xdc00) = (0xd880, 0xdc00). But I want to have 16-Bit ranges, a range from a single 16-Bit number to a pair of 16-Bit numbers is simply nonsense. Thus the range has to be split into two parts:

• The UCS-2 part: 0xffef-0×10000
• The non-UCS-2 part: 0×1000 = (0xd800, 0xdc00) to (0xd880, 0xdc00), resulting in (0xd800 – 0xd880) followed by (0xdc00 – 0xe000)

In that example there are exactly two parts, but that may be different, a range may have to be split because 0xd800-0xe000 are invalid, the starting codepoint may be non-UCS-2, then there can be up to three ranges, e.g.:
We want to encode the range 0x1000f-0x200ff. 0x1000f is (0xd800, 0xdc0f) in UTF-16, 0x200ff is (0xd880, 0xdcff), so we need three ranges:

• The first few surrogate pairs with the same high-surrogate: 0xd800 followed by (0xdc0f – 0xe000)
• The majority of surrogate pairs in the middle, where every low-surrogate is allowed: (0xd801 – 0xd880) followed by (0xdc00 – 0xe000)
• The rest with the same high-surrogate: 0xd880 followed by (0xdc00 – 0xdcff)

I hope it is clear which output-format I would like to have: a set of sequences of 8/16-Bit ranges (ranges may have only one element). Unfortunately there are a lot of cases, I have implemented the conversion (commit 1dbfb66ca66c392c6ffdaa3ff9d00d399370cf0e) for UTF-16 using a lot of if/else for handling all special cases, involving 130 lines of ugly code, well that was not too bad, but now I need the same for UTF-8: up to four ranges in sequence, and much more special cases, 500 lines of unreadable, boring, unmaintainable code just for UTF-8? I do not like this idea. I need some creative ideas how to implement it in a more abstract way without handling all cases explicitly, and without brute force (checking all codepoints within the range separately). There is one simple idea: do not care about 0xd800-0xe000 and remove it at the end. But I have currently no idea how to simplify the rest, information how many UTF-8 surrogates are needed to represent the start and the beginning and if the surrogates have the maximum or the minimum value should definitely be used. Read this for a comprehensive description of the UTF-8 format, it is not more complicate than UTF-16. Consider it a contest, the winner will recieve eternal glory, I hope somebody has a nice idea.

The KDev-PG-Qt Lexer-Generation-Algorithm

Saturday, February 19th, 2011

Notice: This post may require some background-information about lexers/regexp-parsing/automatons.

Today I want to describe the essential ideas for lexer-generation in KDevelop-PG-Qt. Each lexer-rule is described with a regular expression. The generated code has to determine which regular expressions match the following input-characters and which one to choose. A common approach for choosing one of multiple matching regular expressions is choosing the longest match (eat characters greedily), if there are multiple longest matches, priorities from the definition are used. That approach is easy to implement and very appropriate for real-world-languages, that is not the interesting part.
The usual approach for detecting the regular expressions is a translation to finite-state-machines. Each regular expression can be easily transformed into a nondeterministic-finite-automaton (NFA), accepting the words described by the regular expression, wich can be transformed into an equivalent deterministic-finite-automaton. For multiple tokens you can simply create a NFA accepting each of the expressions, while taking care about the rules which may match the current input using some additional states. Now the interesting part about KDev-PG-Qt’s implementation: The usual approach for handling the transitions between the states of the finite-state-machines is a table mapping the input character to the next state:

Foreach state store: table (character → next state(s))

But KDev-PG-Qt should support Unicode, it should be possible to process even UTF-32 input, more than a million possible inputs, memory usage, size of generated tables? You can imagine what would happen with a few hundred states. But it is possible to take advantage of a property of Unicode: similar characters are grouped in blocks, and usually tokens will consist of similar tokens. So it is a good approach to save the transitions blockwise, e.g. A-F, X-Z, a-i and m-p, such representations are obviously much more efficient than table-based approaches. But KDev-PG-Qt should support both approaches, but the used algorithms should always be the same (generic programming, yeah!), so I need some kind of abstraction: sets of characters which may be represented using either bits or blocks/sequences of characters. That for I have also choosen a different internal format for transitions:

Foreach state store: table (next state → set of characters for this transition)

The algorithms are now using set-based-arithmetics (union, difference, intersection, complement) to work with any representation of character-sets. For tables it can be easily implemented using bit-operations, for the interval-based representation it is a bit more complicated, but possible in Ο(n), where n is the number of intervals in the sets you are currently dealing with. So it can run without using Gibibytes of memory and within reasonable time for creating a Unicode-aware lexer.
However, not only KDev-PG-Qt should not require too much memory, but also the generated lexer should be short and efficient. That is why there should be differnt code-generation-strategies for the different representations of sets of characters. For the tables you can simply generate tables or switch-statements for the state-transitions. But for the sequence-based-approach you can do better: The sets of characters get translated into nested if-else-statements, some kind a static binary search, they can decide in which interval the character from the input is. So the lexer will use a very small number of comparisons to decide which state to choose next instead of using a large table, that makes the generated code even more readable (although that is not the purpose).
Everything is programmed using a lot of templates, some of you might not like this approach, but it is really nice to be able to create one algorithm accepting different character-encodings (ASCII, Latin1, UTF-16, UTF-32, …) supporting different internal storage representations and different output-formats for the decisions.

Well, that is it, I hope this post is interesting for at least a few people.

The User

PS:
Of course minimization is also implemented, so you will not get useless states in the generated code.

RFC: Regular Expressions in KDev-PG-Qt

Thursday, February 17th, 2011

Hi folks!

Now I want to tell you about the progress in KDevelop-PG-Qt and I am interested in comments about the syntax of regular expressions/lexer-declarations, I recently implemented.

But first of all for those who do not know (probably most Planet KDE readers): What is KDev-PG-Qt (KDevelop-PG-Qt)?
“KDev-PG-Qt” is the shortform for “KDevelop-PG-Qt”, “KDevelop-PG-Qt” is the longform for “KDev-PG-Qt”, KDevelop is our favourite IDE, Qt is our favourite C++-framework, and “PG” means “Parser Generator”. So, it is KDE’s parser-generator (like Bison or AntLR) written some time ago for KDevelop, long time ago it was called “KDevelop-PG” and it used STL-strings and -streams, but then apaku (iirc) ported it to Qt, and KDevelop-PG is dead nowadays, but KDevelop-PG-Qt is still alive and used for some KDevelop-language-plugins. If you are interested, you can find an introduction here.

What am I currently doing? When writing a parser you usually divide it into two components: the scanner/lexer/lexical analysis and the parser/syntactic analysis, because it is quite comfortable to add an abstraction between single characters and the syntax: tokens. KDevelop-PG-Qt currently does not provide lexer-generation, it is used in combination with handwritten lexers or lexers generated by flex. Writing lexers manually is usually not very complicate, but it is not the optimal way, and those lexers are usually relatively slow. Flex with Unicode is really a painfull experience. There is also quex, a nice lexer-generator, supporting Unicode and a lot of extra-features, but it uses a modified LGPL-version “excluding military usage”, whatever this condition may mean, it is problematic. I do not have a problem with armies using my lexer-generator, and writing an own lexer-generator makes integration with KDev-PG-Qt context-free grammars easier.

Current state:

• Algorithms work very well, I can construct the union, intersection, complement, difference, concatenation etc. of regular languages, optimal automatons get generated
• Code-generation for automatons works, too, the generated lexer is easily integratable with the parser
• A weird (?) input-syntax for the declaration of the tokens is implemented
• Regular expressions are usable, but I am not using the Perl-syntax

Todo:

• There are some compiler-warnings indicating bugs in the algorithms
• Unicode-character-classes are a must-have feature for a Unicode-aware lexer-generator
• Escape-sequences are incomplete, 0-9 and foo{1,3} syntax not yet supported
• Extra-features like word/line-boundaries, inheritance, actions in state-transitions etc.

Well, now I want to hear your opinion about the regexp-syntax: The lexer-declaration should support the same syntactic elements as the parser-declaration (wherever it makes sense), so there are the postfix-operators * (Kleene star, like in Perl), + (like in Perl), the binary-operators @ (e.g. “bla”@”,” accepts comma-separated lists of blas), & (intersection), | (union alternative, like in Perl), ^ (difference, like in Perl, but not only for characters inside [ ], but for any regular expressions), the prefix-operators ? (optinality, like in Perl, but prefix, like in the parser) and ~ (negation, complement), and the “empty operator” for concatenation. You usually have to quote words, but for simple alphanumeric strings it works without quotation marks. Unquoted/unescaped whitespace will be ignored, that is maybe unfamiliar, but it allows more structuring of the regular expressions. The empty operator has the highest priority. Now a syntactic element which may be interesting, weird or whatever: The brackets ([ ]) are not actually a notation for character-sets, but they change the meaning of the empty operator to the union (|). That sounds really weird, but it has some benefits, e.g. it is possible to work with multi-byte-characters. Everything between normal parentheses will be in normal mode (empty operator means concatenation), that makes ugly expressions possible, but it is also very flexible, e.g. imagine “ab” should be a digit, you could simply write [0123456789"ab"] for digits, but normally everything looks like normal Perl-syntax, but it would not be [^ab], but [.^ab], and you have to escape whitespace. I am not sure about that issue. It is currently implemented here.

About the general syntax of lexer-declarations I am even more unsure, but let us discuss this later, I have already written too much.

The User

PS:
Omitting quotation marks only works for characters inside of parentheses or brackets, otherwise they could be confused with the symbolclass (name of the token).

PPS:
An example, it is currently working with my KDev-PG-Qt version: foolisp, it uses S-expressions, the keywords “foo”, “bar” and “baz”, string-literals, integer-literals and identifiers:

%token_stream Lexer ;
%token LPAREN, RPAREN, FOO, BAR, BAZ, NUMBER, SHEBANG, STRING, IDENTIFIER ;
%input_encoding "ascii"
%table_lexer
%input_stream "KDevPG::QByteArrayIterator"

%lexer ->
"("                           LPAREN ;
")"                           RPAREN ;
"foo"                         FOO ;
"bar"                         BAR ;
"baz"                         BAZ ;
"0"|"1"|"2"|"3"|"4"|"5"|"6"|"7"|"8"|"9" -> digit ; -- Should be predefined wit unicode character classes
"0" | ( {digit}^"0" ) {digit}*    NUMBER ;
--   [{ASCII}&{LETTER}\-{digit}]+ IDENTIFIER ; -- That's how it shold look like
[abc\-][abc\-{digit}]*        IDENTIFIER ; -- That's how it is currently working
"#!foolisp"                   SHEBANG ;
[\ \n\t\f]+                   [: /* blank */ :] ;
\" [.(\\.)^\"]* \"            STRING ;
;

?SHEBANG sexp=sexp -> start ;

LPAREN #sexp=sexp* RPAREN
| foo=FOO | bar=BAR | baz=BAZ | number=NUMBER | string=STRING | identifier=IDENTIFIER
-> sexp ;

I hope that clarifies the regular expressions for the lexer.