Hi guys and non-guys,
apparently I had no motivation for blogging in the last few months, however…
Currently (once in a while) I am porting the KDevelop-PG-Qt grammar from Flex+Bison to KDevelop-PG—Qt, a parser generator should trust itself and I hope it will increase the quality of the code (and of course it will allow using Unicode).
Long time ago I implemented specialised parsing for (arithmetic) expressions: the recursive descent approach is not very good in those cases, you have to care about not using left-recursion and you will get very, very deep syntax trees, taking a lot of memory for simple expressions like “0” etc. thus increasing memory usage significantly (e. g. in KDevelop’s PHP-plugin). What was the approach? You can specify postfix, prefix, binary and ternary operators and parentheses. For each operator you have to specify the type of the tokens which should be interpreted as operator and a priority (higher≙stronger binding), additionally, binary and ternary operators have an associativity. I implemented it using a shunting yard-like approach (actually without knowing that termat that time).
However, when porting the grammar of KDevelop-PG-Qt I noticed that this approach did not suffice, but of course I wanted to use the specialised expression parsing for regular expressions etc.: I needed an “empty binary operator” for the usual concatenations in regular expressions (you can just write ab and do not have to use an operator to join them, e.g. a.b). It had been on my TODO-list for long time, and now I had to implement it. Well, when you have to change something, of course you want to choose a more general approach, thus instead of adding a special syntax for “concat-operators” or something like that—I tinkered with the idea—I allowed using arbitrary expressions as operator (already tinkered with that idea long time ago): The parser will parse whatever syntactic unit specified as operator instead of eating a single token. The interesting observation I want to tell about–let us start with an example: With this new feature we could implement Knuth’s hyperoperator: 3↑3, 3↑↑3, 3↑↑↑3 etc., the operator would be specified by the expression “UP_ARROW+” or something like that. Interestingly this is kinda a ternary operator: we can specify an arbitrary integer for the number of arrows, and the number is encoded in unary representation. Let us choose a more comfortable notation: 33 for twelve arrows, or why not 3[1+3*2]3? In terms of the classical ternary operator: “cond ? opt1 : opt2“ is a binary expression, where “? opt1 :” is the operator. Now it is obviously a normal ternary operator taking the arithmetic expressions as parameters, but we can interprete it as binary operator where the operator symbol itself is an expression surrounded by brackets. In fact we can handle any n-ary infix operators that way as binary infix operators, n-ary prefix operators as unary prefix operators and n-ary postfix operators as unary postfix operators, since for precedence the inner arguments do not matter, and the expression parsing problem gets reduced to these three basic cases and parentheses—of course excluding certain kinds of clashes between the symbols used (colon for both a ternary (cond? opt1 : opt2) and binary (x:y) operators etc.) where probably all usual linear time parsing algorithms will fail and which are unsuitable for programming languages. This allows a great simplification of the parsing algorithm, I had a lot of code dedicated to handle ternary operators, when composing the syntax tree the code had to look for them explicitly (of course now it is obvious that this was unnecessary). For now I have removed the support for ternary operators, later I might add the syntactic sugar to implement such operators easily without caring about these technique, that for I have to allow custom fields in the AST for specific operators, but the shunting yard routine will not have to be aware of this—I will stop with those details, I hope you got the interesting point how shunting yard can be used in a very general way and how expression parsing gets simplified to the few basic cases found in most programming languages by allowing everything to be an “operator”.