RFC: Regular Expressions in KDev-PG-Qt

Hi folks!

Now I want to tell you about the progress in KDevelop-PG-Qt and I am interested in comments about the syntax of regular expressions/lexer-declarations, I recently implemented.

But first of all for those who do not know (probably most Planet KDE readers): What is KDev-PG-Qt (KDevelop-PG-Qt)?
“KDev-PG-Qt” is the shortform for “KDevelop-PG-Qt”, “KDevelop-PG-Qt” is the longform for “KDev-PG-Qt”, KDevelop is our favourite IDE, Qt is our favourite C++-framework, and “PG” means “Parser Generator”. So, it is KDE’s parser-generator (like Bison or AntLR) written some time ago for KDevelop, long time ago it was called “KDevelop-PG” and it used STL-strings and -streams, but then apaku (iirc) ported it to Qt, and KDevelop-PG is dead nowadays, but KDevelop-PG-Qt is still alive and used for some KDevelop-language-plugins. If you are interested, you can find an introduction here.

What am I currently doing? When writing a parser you usually divide it into two components: the scanner/lexer/lexical analysis and the parser/syntactic analysis, because it is quite comfortable to add an abstraction between single characters and the syntax: tokens. KDevelop-PG-Qt currently does not provide lexer-generation, it is used in combination with handwritten lexers or lexers generated by flex. Writing lexers manually is usually not very complicate, but it is not the optimal way, and those lexers are usually relatively slow. Flex with Unicode is really a painfull experience. There is also quex, a nice lexer-generator, supporting Unicode and a lot of extra-features, but it uses a modified LGPL-version “excluding military usage”, whatever this condition may mean, it is problematic. I do not have a problem with armies using my lexer-generator, and writing an own lexer-generator makes integration with KDev-PG-Qt context-free grammars easier.

Current state:

  • Algorithms work very well, I can construct the union, intersection, complement, difference, concatenation etc. of regular languages, optimal automatons get generated
  • Code-generation for automatons works, too, the generated lexer is easily integratable with the parser
  • A weird (?) input-syntax for the declaration of the tokens is implemented
  • Regular expressions are usable, but I am not using the Perl-syntax

Todo:

  • There are some compiler-warnings indicating bugs in the algorithms
  • Unicode-character-classes are a must-have feature for a Unicode-aware lexer-generator
  • Escape-sequences are incomplete, 0-9 and foo{1,3} syntax not yet supported
  • Extra-features like word/line-boundaries, inheritance, actions in state-transitions etc.

Well, now I want to hear your opinion about the regexp-syntax: The lexer-declaration should support the same syntactic elements as the parser-declaration (wherever it makes sense), so there are the postfix-operators * (Kleene star, like in Perl), + (like in Perl), the binary-operators @ (e.g. “bla”@”,” accepts comma-separated lists of blas), & (intersection), | (union alternative, like in Perl), ^ (difference, like in Perl, but not only for characters inside [ ], but for any regular expressions), the prefix-operators ? (optinality, like in Perl, but prefix, like in the parser) and ~ (negation, complement), and the “empty operator” for concatenation. You usually have to quote words, but for simple alphanumeric strings it works without quotation marks. Unquoted/unescaped whitespace will be ignored, that is maybe unfamiliar, but it allows more structuring of the regular expressions. The empty operator has the highest priority. Now a syntactic element which may be interesting, weird or whatever: The brackets ([ ]) are not actually a notation for character-sets, but they change the meaning of the empty operator to the union (|). That sounds really weird, but it has some benefits, e.g. it is possible to work with multi-byte-characters. Everything between normal parentheses will be in normal mode (empty operator means concatenation), that makes ugly expressions possible, but it is also very flexible, e.g. imagine “ab” should be a digit, you could simply write [0123456789"ab"] for digits, but normally everything looks like normal Perl-syntax, but it would not be [^ab], but [.^ab], and you have to escape whitespace. I am not sure about that issue. It is currently implemented here.

About the general syntax of lexer-declarations I am even more unsure, but let us discuss this later, I have already written too much.

The User

PS:
Omitting quotation marks only works for characters inside of parentheses or brackets, otherwise they could be confused with the symbolclass (name of the token).

PPS:
An example, it is currently working with my KDev-PG-Qt version: foolisp, it uses S-expressions, the keywords “foo”, “bar” and “baz”, string-literals, integer-literals and identifiers:

%token_stream Lexer ;
%token LPAREN, RPAREN, FOO, BAR, BAZ, NUMBER, SHEBANG, STRING, IDENTIFIER ;
%input_encoding "ascii"
%table_lexer
%input_stream "KDevPG::QByteArrayIterator"
 
%lexer ->
  "("                           LPAREN ;
  ")"                           RPAREN ;
  "foo"                         FOO ;
  "bar"                         BAR ;
  "baz"                         BAZ ;
  "0"|"1"|"2"|"3"|"4"|"5"|"6"|"7"|"8"|"9" -> digit ; -- Should be predefined wit unicode character classes
  "0" | ( {digit}^"0" ) {digit}*    NUMBER ;
--   [{ASCII}&{LETTER}\-{digit}]+ IDENTIFIER ; -- That's how it shold look like
  [abc\-][abc\-{digit}]*        IDENTIFIER ; -- That's how it is currently working
  "#!foolisp"                   SHEBANG ;
  [\ \n\t\f]+                   [: /* blank */ :] ;
  \" [.(\\.)^\"]* \"            STRING ;
  ;
 
   ?SHEBANG sexp=sexp -> start ;
 
   LPAREN #sexp=sexp* RPAREN
 | foo=FOO | bar=BAR | baz=BAZ | number=NUMBER | string=STRING | identifier=IDENTIFIER
-> sexp ;

I hope that clarifies the regular expressions for the lexer.

2 Responses to “RFC: Regular Expressions in KDev-PG-Qt”

  1. Svempa Says:

    Interesting. I take it this is a “new” regex syntax, then (not Perl, not sed’s –regexp-extended or any other existing one)?

  2. The User Says:

    Well, I tried to be similar to Perl and Flex. See the example, I hope that clarifies it.

Leave a Reply

XHTML: Use <blockquote cite="name"> for quotations, <pre lang="text    ∨ cpp-qt ∨ cpp ∨ bash ∨ other language"> for code, [latex] for formulas and <em> for em. Contact me if the comment does not get published, it may have accidentally been marked as spam.

Anti-Spam Quiz: