Archive for the ‘KDE’ Category

Do we want to be 1991 forever?

Monday, April 25th, 2011


KDE’s licensing policies do not allow GPLv3+, LGPLv3+ and AGPLv3+ software in KDE’s repositiories (I guess it is for git, too, not only for SVN). But do we really want to keep that policy? There are more and more web-applications, ugly cloud-stuff, “software as a service”, is growing, and developers want to protect their Free Software by using the GNU Affero General Public Licens e. KDE is going to be adapted on embedded devices, and we should not care about tivoization? We should only use a 1991 license not targeting a lot of important issues of our times? Why should a KDE-application not be relicensed under the conditions of the GLPv3+ or AGPLv3+? In many cases that may be good. Why should there not be new GPLv3/AGPL development? Should developers fearing cloud-services not be integrated into KDE-community? We are not BSD, we want to protect ourselves, WebOS is coming soon, GPLv2′s copyleft may become completely noneffective, and we – usually accepting copyleft – should be stuck in GPLv2? That may be really, really bad. Should we care about GPLv2 only software which would not be able to integrate (A)GPLv3 software? No, they should start caring about the problems of GPLv2, and we should also think about the (A)GPLv3 only software we are currently excluding from being integrated into KDE. Btw: ownCloud already violates the licensing policy (though it may be invalid, because it does not mention git), and I am sure we will find more “mislicensed” software in the repositories.


KDevelop-PG-Qt lexer-branch merged into master

Saturday, April 23rd, 2011

Hi folks!

I have finally merged my private lexer-branch of KDevelop-PG-Qt into the master-branch and pushed it to – after having run KDevelop-php-plugin unit-tests ;). Well, what are the changes?

  • Now there is support for lexer-generation, including unicode-support (UTF-8, UCS-2, UTF-16, UTF-32/UCS-4). I have talked about those features in some previous blog-posts.
  • You do not have to use “;;” any longer, just “;” will work, but for compatibility “;;” will be like a single semicolon.
  • Fixed a very old overflow when reading the grammar from an input-stream
  • With –error-aware-code (the default) error-messages will now work correctly, when you are compiling the generated file and there is an error in your hand-written code, you will get correct references to the corresponding line in the grammar-file, unfortunately kdev-pg-qt’s error-messages are still not perfect (wrong column-numbers etc.).
  • Added some command line options, marginal changes in their behaviour
  • Changed some messages (spelling etc.)

Maybe, some day, there will be KDev-PG-Qt 1.0 (currently it is called 0.9.5 ;)), so, what is still to do?

  • Make it possible to control what should happen when the lexer fails (it should not just be exit(-1))
  • Update the documentation at Techbase
  • Update the Kate-syntax-file
  • Decide which additional unicode-character classes are useful, add some aliases
  • Make new command-line-options available in the CMake-macro
  • Find out, how to enable C++0x support in CMake allowing different compilers (yes, I am using variadic templates ;))
  • Tokens should no longer be declared inside the parser-class (of course it should still be possible for compatibility)
  • Forward-declarations for AST-classes
  • Integration with QIODevice/KIO (optional)
  • Dumping automata (optional)

Of course there could be more nice stuff, but that would not be related to the recent changes, e.g. LL(k)/LL(*)-parsing, look-ahead, possibility to switch rules off, packed AST-nodes, making it usable as a library etc.

I hope there are really no regressions, and the changes will be useful for the small elite of KDevelop-PG-Qt users. ;)

Unification and Orthogonalisation: Part 2, Plugins and Bindings

Sunday, April 17th, 2011

Hi folks!

Part 2 of the series, there will be hopefully some more rational discussions. I want to review KDE’s plugin system.

There are currently three mainly used Plugn-technologies used by KDE projects:

  • KPluginFactory
  • Kross
  • QtScript

QtScript and Kross work similarly, both use qt-metacalls to allow scripts to access C++-objects. Thus it does only work for QObjects and only for q-properties, signals, slots and functions marked as scriptable. Additionally some data-types get automatically converted (containers, strings) and there is usually a fixed set of supported datatypes, e.g. QPainter etc. Both Kross and QtScript are easy to use, you can load script files and call functions using QVariants/QScriptValues and it reuses the QMetaObject-stuff. However, in my opinion there are a lot of downsides of those approaches:

  • You have to change the C++-code to allow interfacing with it (subclassing QScriptable, marking methods as scriptable etc.)
  • No easy way to expose functionality of non-QObjects
  • No plugins written in C++, which would be natural to have, although it would not be that hard to allow at least C-plugins in Kross
  • KDE application are usually programmed using object-orientation, even scripts are handled as objects, but inside the scripts the object-oriented architecture does not get continued, e.g. kwin-scripts are written in a procedural style, sometimes that is nice, but sometimes it would be nice if the script would follow the overall-architecture with plugin-classes or whatever

Some downsides are specific for QtScript:

  • It does not work with other languages than EcmaScript/JavaScript (no Ruby, Python, Perl…)
  • Sometimes QScriptable has to be subclassed thus making core-classes and plugin-functionality unnecassarily interdependent

Well, there is an alternative: KPluginLoader. Wasn’t that an uncool, unnecessarily complicated, old technology for C++-plugins? No, it is not. The idea is old (similar API had been included in KDE 2.0), but there are good reasons that it is still used although it has been indeed focused on C++: It is really flexible and it is really cool. Let me describe the idea: It is integrated into XDG-configuration-system, each plugin is described by a .desktop-file referencing a shared object file. A factory-object inside this shared object file is used to create instances of a plugin-class, which may subclass a class used to allow plugins inside the application (e.g. a Plasma::Applet, a KoShape or whatever). The factory is the reason for the flexibility of the architecture: There are specialised factories not simply allocating a C++-object but loading script-files, e.g. there are KRubyPluginFactory, KPythonPluginFactory and KPerlPluginFactory creating an object inside the scripting-language subclassing the plugin-class and exposing it to C++, but even more specialised classes libraries like plasma_appletscript_simple_javascript allowing a procedural interface. Those classes make arbitrary applications like Konqueror and KDevelop scriptable.

That is really cool, but there are limitations: Because the meta-object-system is not used there has to be a binding exposing the API of the application to the script. Of course it is nice that such bindings are orthogonal to the application itself and even non-QObjects can be accessed, but unfortunately it is currently complicated to add such bindings to an application. E.g. most bindings are using Smoke to access the libraries and the Smoke-Generator to generate the code for bindings, but you have to invoke it for each language, and the Python binding is using SIP. Hence it is comprehensible that many applications are using Kross or QtScript (though I do not see any good arguments for QtScript).

What could be done for unification? How could the dependencies between the plugin-functionality and the core-stuff be reduced? In my opinion it should be made easier to create a binding since that is currently the difficult task about scripting usin KPluginFactory, creating a plugin-class and invokin KPluginLoader to get plugins is not that complicated. A comprehensive set of cmake-macros would be cool, allowing to create bindings for Ruby, Python, Perl and QtScript using a single xml-file, it should detect pregenerated binding-files, invoke installed smoke-generators, convert it to SIP-information, and install it were the specific scripting-language expects the files. Creating such a XML-file would certainly not be more complicated than adding Q_SCRIPTABLE everywhere.

Maybe some of Plasma’s efforts to make scripting easy could be generalised to be usable everywhere in KDE-based projects (they did a good job ;)). Of course there should be a general purpose QtScriptPluginFactory (there is no such class, isn’t that strange?), but there are also some special features in Plasma, e.g. limitating the accessible API is a nice feature, it should be implemented in the plugin-factories, usually scripting-languages provide features to restrict importing of modules, accessing files etc., there should be a unified set of xdg-properties to specify what a script should be allowed to do. Using X-KDE-PluginKeyword to specify the script to load is also not very elegant, and passing arguments to plugins is also not perfectly implemented. Procedural scripting would be nice to have in many cases, too, e.g. there could be a KProceduralRubyPluginFactory automatically wrapping the functions in the scripting-file into a plugin-object in a generic way.

With such improvements we may get rid of QtScript and Kross, the architecture of some applications could be improved easily, and everybody could choose if he wants to use EcmaScript, QML, C++, Ruby, Python, Perl or whatever.

Disclaimer: There are always some people confusing thoughts and announcements, although the number of disclaimers does not seem to matter for them, there should be at least one: That is a review, some thoughts, no announcement for whatever. And I am familiar with the codebase.

KDE 5: Unification and Orthogonalisation, Part 1: Plasma and KWin

Tuesday, April 12th, 2011


In the next few blog-posts I want to tell you about some thoughts about the state of KDE, about things which will certainly not change in KDE SC 4, that are not concrete development-plans, but things we could talk about, where orthogonality is missing in current KDE SC.

Obviously Plasma has become much more than a unification of Kicker, KDesktop and SuperKaramba, Plasma-Desktop is not only a dashboard for controlling the desktop, there are games running in Plasma, a micro-blogging-client, a comic-viewer etc. And it has become a playground for new user-interfaces. The dashboard-concept might still apply for many desktop-systems, but when looking at Plasma Active we realise: Plasma and KWin are no longer orthogonal. People will probably like the widget-navigation for small touchscreen-devices, and why should there be a separate KWin-window-navigation? In my opinion Plasma-Desktop and KWin should be merged in KDE 5. First of all: do not say it is impossible because xembed etc. are ugly. A window manager has great capabilities, why should it not organise the windows using Plasma-libraries. This will require big changes in KWin and even in Plasma-libraries (Wayland- or Xorg-windows are currently not handable like Plasmoids). Of course it would be tricky to make it still embeddable into classic QWidget-environments (e.g. Amarok), but I think it would be worth it.

The strengths of both should be combined: GPU-usage from KWin, flexibility of Plasmoid-organisation from Plasma, elaborated effects and animations from both, new layouting/tiling capabilities for both (both currently have some limited capabilities), a new unified scripting interface for advanced configuration, activities and virtual desktops (already incorporating Plasma-Desktop and KWin) should become more general and more powerful concepts (may happen in KDE SC 4). Flexible handling of window-decoration and menu-bars should become possible (there do not have to be window-decorations, but the window manager must provide the possibility to close a window etc.!). It is a limitation of current window-managers that applications cannot talk to them, that is why some people make propaganda for client-side-decorations. Of course client-side-decorations are a very bad idea, but in Plasma Plasmoids can talk with the environment, the main-reason for it is probably the single-process-policy, but I am sure it could be adopted by using a DBus-interface or (even better) aplication-specific config/scripting-files. QGraphicsView should no longer be used because a window-manager can do better using the GPU, not incorporating QWidget-QGraphicsView/-QGraphicsItem-QObject-QGraphicsWidget-overhead (of course it should still provide a high-level API using QObject, but QGraphicsWidget etc. are really useless overhead, the bad thing is the redundance, not the abstraction), thus not only the window-manager should use Plasma-APIs, but Plasma-libraries would have to be adapted.

I hope you have got my point: Such a unification would be good for all form-factors.

That are just hallucinations, random thoughts for discussion, no concrete plans.

Algorithmic ideas needed for Unicode

Monday, March 21st, 2011


KDevelop-PG-Qt development is continuing slowly, and I need some creative ideas. You do not have to know anything about KDev-PG-Qt or lexers, it is an algorithmic issue about Unicode, I am failing at arranging my thoughts in a way that could be transformed into (maintainable, understandable) code.

The problem is quite simple: Given a range of Unicode characters (e.g. “from a to z”, a-z, or 0xef-0x19fe3), I want to represent this range in UTF-8 or UTF-16 encoding, using 8-bit or 16-bit ranges respectively (e.g. 0×14-0×85 is a 8-bit range). For example any ASCII-range (like a-z) stays the same encoded in UTF-8 or UTF-16, for a more sophisticated example you should know how UTF-16 (the encoding used by QString) works:

  • Any UTF-32 codepoint (character) between 0×0 and 0xd800 (including 0×0, excluding 0xd800) and between 0xe000 and 0×10000 is simply converted into a 16-Bit integer, it does not get changed.
  • For any larger codepoint subtract 0×10000, now you have got a 20-Bit number, split it into two 10-Bit numbers, add 0xd800 to the first one and 0xdc00 to the second one. Now you have got two 16-Bit numbers, the high-surrogate and the low-surrogate, together those numbers represent the character in UTF-16.
  • Any surrogates (between 0xd800 and 0xe000) are invalid.

The example: 0xffef to 0×20000 should be transformed into UTF-16. The starting-codepoint is within UCS-2 (smaller than 0×10000 and it is not a surrogate), the end is not within UCS-2, it has to be encoded with a surrogate pair. So in UTF-16 the range would be 0xffef to (0×80+0xd800, 0×0+0xdc00) = (0xd880, 0xdc00). But I want to have 16-Bit ranges, a range from a single 16-Bit number to a pair of 16-Bit numbers is simply nonsense. Thus the range has to be split into two parts:

  • The UCS-2 part: 0xffef-0×10000
  • The non-UCS-2 part: 0×1000 = (0xd800, 0xdc00) to (0xd880, 0xdc00), resulting in (0xd800 – 0xd880) followed by (0xdc00 – 0xe000)

In that example there are exactly two parts, but that may be different, a range may have to be split because 0xd800-0xe000 are invalid, the starting codepoint may be non-UCS-2, then there can be up to three ranges, e.g.:
We want to encode the range 0x1000f-0x200ff. 0x1000f is (0xd800, 0xdc0f) in UTF-16, 0x200ff is (0xd880, 0xdcff), so we need three ranges:

  • The first few surrogate pairs with the same high-surrogate: 0xd800 followed by (0xdc0f – 0xe000)
  • The majority of surrogate pairs in the middle, where every low-surrogate is allowed: (0xd801 – 0xd880) followed by (0xdc00 – 0xe000)
  • The rest with the same high-surrogate: 0xd880 followed by (0xdc00 – 0xdcff)

I hope it is clear which output-format I would like to have: a set of sequences of 8/16-Bit ranges (ranges may have only one element). Unfortunately there are a lot of cases, I have implemented the conversion (commit 1dbfb66ca66c392c6ffdaa3ff9d00d399370cf0e) for UTF-16 using a lot of if/else for handling all special cases, involving 130 lines of ugly code, well that was not too bad, but now I need the same for UTF-8: up to four ranges in sequence, and much more special cases, 500 lines of unreadable, boring, unmaintainable code just for UTF-8? I do not like this idea. I need some creative ideas how to implement it in a more abstract way without handling all cases explicitly, and without brute force (checking all codepoints within the range separately). There is one simple idea: do not care about 0xd800-0xe000 and remove it at the end. But I have currently no idea how to simplify the rest, information how many UTF-8 surrogates are needed to represent the start and the beginning and if the surrogates have the maximum or the minimum value should definitely be used. Read this for a comprehensive description of the UTF-8 format, it is not more complicate than UTF-16. Consider it a contest, the winner will recieve eternal glory, I hope somebody has a nice idea. :D

Nokia’s risks

Monday, March 14th, 2011


In a recent report to the US Securities and Exchange Commission Nokia is talking about the risks of their new strategy. First of all, they, seem to forget about Qt and Free Software Communities, only few words about Qt:

We have also endeavored to offer a better experience to developers through the unified Qt development environment. By using Qt’s programming interface, both our own and third party developers are able to build an application once and simultaneously make it available for our Symbian and future MeeGo-based products as well as many products supported by other mobile and desktop operating systems without having to rewrite the source code.

Yes, thet was a really nice idea, but they forgot to mentioned that that is over now, at least for them, Android and Necessitas will now become the mobile platfor suitable for simple cross-platform development.

For developers, we believe that we can create new and highly attractive monetization opportunities. By leveraging Microsoft’s proven developer tools and support, based on Silverlight, with our operator billing, merchandising and global application store, we intend to offer new monetization mechanisms for developers while providing access to Nokia’s global scale. We will continue to promote Qt as the sole application development framework for our Symbian smartphone platform on which we expect to sell approximately 150 million more devices in the years to come. For our Series 40-based feature phones, we will continue to support a Java-based development environment.

Is that a joke? It should be “attrictive” for developers, if they are changing their programming languages and APIs all the time, completely? And they forgot that not every developer wants to have “monetization opportunities”, but there are vibrant Free Software/Open Source communities, and for them this great platform – part of this great “strategy” – is certainly not very nice: No GPL, but Microsoft’s own semi-free, copyleft license? One word: immoral! Silverlight? Yes, we want to port all the existing C++-code to an entirely different proprietary platform, of course. Really surprising that they have forgotten us in that short time.


Today, industry participants are creating competing ecosystems of mutually beneficial partnerships to combine the hardware, software, services and application environment to create high-quality differentiated winning smartphones.

Well, and they will simply stop to innovate with software.

Well, they have pointed out some risks we all knew about, too:

  • It may be a good deal for Microsoft, but they may damage Nokia directly for various reasons (the brand, privacy)
  • Microsoft does not guarantees anything
  • Windows Phone may be just bad
  • It takes long time to switch to that platform (two years)
  • Bad support for low-end phones
  • Symbian will not be attractive for anybody in the meantime
  • They will fail to make MeeGo a nice system for mobile-phones
  • They will not be able to innovate, all innovation will have gone
  • Nobody will want to work for Nokia, because it is just uninnovative
  • The developers get discouraged, of course, because a lot of them will be fired (reducing R&D costs), and their work of years will be screwed
  • Reducing costs might fail, because they will still need a lot of support for Symbian and Microsoft will recieve a lot of money
  • Target-platforms are very limited (they are not talking about the desktop, but about missing tablets)
  • Nokia-phones will not have any distinguishing features
  • Integration of their services might fail because of Microsoft’s interests

Few more things they are not talking about:
  • The “next generation user devices”-story is a farce, after that bad, bad Microsoft-deal, years later all the FLOSS-people will come back to use MeeGo, which will be awesome because of Nokia’s research? Seriously…
  • They are now totally noncredible not anly for business-partners, banks and investors, but also for all developers
  • With less research and development (“R&D”) they want to release more innovative products, that is strange
  • Nokia will die. This short and simple statement has been forgotten

It is horrible…

Writing Konqueror-Plugins with Ruby

Monday, March 7th, 2011


You may not know about it: But you can write Konqueror-plugins, and you can use a scripting-language like Ruby. KDE/KParts/Konqueror have a very nice plugin-system, you can write plugins using C++, but you can also use a scripting-language like Ruby or Python. It should even work with Perl (untested), unfortunately for KJS and QtScript there is no plugin-factory implementing the support. Well, let us have a look at a simple plugin written in Ruby. There are four important files:

A desktop-file containing some meta-data

It works like a normal plugin-desktop-file (see e.g. /usr/share/kde4/apps/khtml/kpartplugins/plugin_adblock.desktop), but you need two special lines:


Well, X-KDE-PluginKeyword is a not very expressive name for the name of the script to use (because it may be used for other stuff as well), but what is krubypluginfactory? It is a very cool invention, like the usual KDE-plugin-factories it instantiates plugin-objects, but this one instantiates special objects, were virtual method- and slot-invocations will be delegated to Ruby.

As usual you can use X-KDE-ParentApp (in our case konqueror) to tell KParts not to load the plugin in other applications using the KPart.

A .rc-GUI-XMl-file

You may have heard about KXmlGui – you need it to define actions in the menubar etc., but you will even need it if you do not access any GUI-elements, otherwise Konqueror (or any application using KParts-plugins) will not find it:

<!DOCTYPE kpartgui>
<kpartplugin name="MyPlugin" library="krubypluginfactory" version="0.1">
  <Menu name="tools"><text>&amp;Tools</text>
    <separator group="tools_operations" />
    <Action name="tools_MyPlugin" group="tools_operations" />

A CMake-file (CMakeLists.txt)

find_package(KDE4 REQUIRED)
include (KDE4Defaults)
install(FILES myplugin.rc myplugin.desktop DESTINATION ${DATA_INSTALL_DIR}/khtml/kpartplugins)

That one is really trivial, some elementary checks, then it will be installed to the correct directories, Ruby is an interpreted language. ;)

The Ruby-Code

The name of the Ruby-class has to match the filename, otherwise KRubyPluginFactory will not be able to find it (KPythonPluginFactory works differently, there you have to create a separate function to instantiate the plugin-object). Like normal plugins they have to inherit KParts::Plugin, the constructor can take a parent-object (that is the KParts::Part-instance), a parent-widget and a list of arguments (which will usually be empty). There it can initialize the actions declared in the XML-file and add signal-slot-connections such that something will happen. What should happen in my example-plugin? It will find bad-links to sites of evil companies (Microsoft, Apple, SAP, Oracle, Facebook, Google, but the list can be easily extended to your own moral standards :D), and when it finds such a link, it will get angry and wil replace all the links on the website with good links, e.g. to Planet KDE, GNU, Wikipedia or :D That for it uses the DOM-API exposed by KHTML.

#!/usr/bin/env ruby
# Qt
require 'Qt'
# KDE-libraries
require 'korundum4'
require 'khtml'
# Ruby-CGI-library (for url-handling)
require 'cgi'
include KDE
include Qt
module MyPlugin # has to match the directory-name
class MyPlugin < KParts::Plugin # has to match the file-name
  # Slot-Declarations look that way, “a bit” C++-ish
  slots 'toggle(bool)', 'hoverUrl(const QString&)'
  def initialize(parent, parentWidget, args)
    if !parent.is_a? KDE::HTMLPart
      qWarning("MyPlugin: Not a KHTML-Part")
    @part = parent
    @action ="Enable bad link detection", self)
    @action.checkable = true
    # the action declared in the XML-file
    actionCollection().addAction("tools_MyPlugin", @action)
    # unfortunately you have to provide parameter lists
    connect(@action, SIGNAL('toggled(bool)'),
            self,    SLOT('toggle(bool)'))
    # good links
    @favLinks = ["",
    # bad links
    @badLinks = ["microsoft", "apple", "sap",
                 "oracle", "facebook", "google"]
  def toggle(active)
    if active
      connect(@part, SIGNAL('onURL(const QString&)'),
              self,  SLOT('hoverUrl(const QString&)'))
      # normal Qt API
      Qt::MessageBox::information(nil, "Success!",
          "Now you are safe, you will only visit good sites!")
      disconnect(@part, SIGNAL('onURL(const QString&)'),
                 self,  SLOT('hoverUrl(const QString&)'))
      Qt::MessageBox::information(nil, "Success!",
          "Bad links will no longer be removed")
  # replace all the links recursively using the DOM-API
  def replaceLinks(dom)
    href ="href")
    if     dom.nodeName.string == "A"
        && dom.attributes.getNamedItem(href) != nil
      oldUrl = dom.attributes.getNamedItem(href).nodeValue.string
      newUrl =   @favLinks[rand(@favLinks.length)]
               + "/#" + CGI::escape(oldUrl)
    # recursion
    children = dom.childNodes
    for i in 0..(children.length-1)
      child = children.item(i)
  # slot invoked when hovering a link
  def hoverUrl(url)
    # this will happen when you move the cursor away
    return if url == nil
    found = false
    for badLink in @badLinks
      pos = url.index(badLink)
      sharpIndex = url.index("#")
      if pos != nil && (sharpIndex == nil || pos < sharpIndex)
        found = true
    # maybe the link is not evil
    return unless found
    # replace all the links

You may have noticed it: It is not that hard to write a Konqueror-plugin in a scripting-language like Ruby, it is not perfect, sometimes debugging-output is bad, sometimes you need C++-style-stuff, but that is handable without knowing anything about C++. And there is unfortunately no KHotNewStuff-integration. In theory it should work with KWebKit the same way (those techniques work for all applications using the KPluginFactory for creating their plugins in theory, and of course for all KParts), but unfortunately there is no Ruby-binding for KWebKit, so the Ruby-script will be invoked, but Ruby will not recognize that it is a KWebKitPart, it will be just a KParts::ReadOnlyPart, you can access KWebKitPart’s overridden virtual functions, its signals and slots and its meta-object, but the most important method (KWebKitPart::view()) returning the QWebView, which can be used to manipulate all the stuff, will not be accessible. It would require some confguration-files to build Smoke and Korundum with KWebKit support.

Have fun! Maybe somebody will write a useful plugin. ;)

Kile Inline Preview

Friday, March 4th, 2011


I just want to inform you about a really nice Kile-Feature-Fork: Kile Inline Preview.
When you install this patched version of Kile you will get embedded preview-images for your formulas. That is really nice, because you want to have WYSIWYM, and a rendered formula looks more like what you mean than LaTeΧ-formula-code, while “normal” text-mode commands are more expressive than the rendered version. When you move the cursor into the preview-image, it will disappear and you can edit the formula or view the code. I think it is really nice, it even works properly. Maybe it could be merged one day into git, because it is not so nice to have to wait for the developer (nice guy ;)) updating the Kile-version used by the repository (I think he is still using a SVN-version). I love this feature! In my opinion that makes Kile better than LyΧ or TeΧMacs. Try it yourself!

The User

Too lazy and tired, to create a screenshot, maybe tomorrow. :D

Here it is! The cursor is currently at line 21.

The KDev-PG-Qt Lexer-Generation-Algorithm

Saturday, February 19th, 2011

Notice: This post may require some background-information about lexers/regexp-parsing/automatons.

Today I want to describe the essential ideas for lexer-generation in KDevelop-PG-Qt. Each lexer-rule is described with a regular expression. The generated code has to determine which regular expressions match the following input-characters and which one to choose. A common approach for choosing one of multiple matching regular expressions is choosing the longest match (eat characters greedily), if there are multiple longest matches, priorities from the definition are used. That approach is easy to implement and very appropriate for real-world-languages, that is not the interesting part.
The usual approach for detecting the regular expressions is a translation to finite-state-machines. Each regular expression can be easily transformed into a nondeterministic-finite-automaton (NFA), accepting the words described by the regular expression, wich can be transformed into an equivalent deterministic-finite-automaton. For multiple tokens you can simply create a NFA accepting each of the expressions, while taking care about the rules which may match the current input using some additional states. Now the interesting part about KDev-PG-Qt’s implementation: The usual approach for handling the transitions between the states of the finite-state-machines is a table mapping the input character to the next state:

Foreach state store: table (character → next state(s))

But KDev-PG-Qt should support Unicode, it should be possible to process even UTF-32 input, more than a million possible inputs, memory usage, size of generated tables? You can imagine what would happen with a few hundred states. But it is possible to take advantage of a property of Unicode: similar characters are grouped in blocks, and usually tokens will consist of similar tokens. So it is a good approach to save the transitions blockwise, e.g. A-F, X-Z, a-i and m-p, such representations are obviously much more efficient than table-based approaches. But KDev-PG-Qt should support both approaches, but the used algorithms should always be the same (generic programming, yeah!), so I need some kind of abstraction: sets of characters which may be represented using either bits or blocks/sequences of characters. That for I have also choosen a different internal format for transitions:

Foreach state store: table (next state → set of characters for this transition)

The algorithms are now using set-based-arithmetics (union, difference, intersection, complement) to work with any representation of character-sets. For tables it can be easily implemented using bit-operations, for the interval-based representation it is a bit more complicated, but possible in Ο(n), where n is the number of intervals in the sets you are currently dealing with. So it can run without using Gibibytes of memory and within reasonable time for creating a Unicode-aware lexer.
However, not only KDev-PG-Qt should not require too much memory, but also the generated lexer should be short and efficient. That is why there should be differnt code-generation-strategies for the different representations of sets of characters. For the tables you can simply generate tables or switch-statements for the state-transitions. But for the sequence-based-approach you can do better: The sets of characters get translated into nested if-else-statements, some kind a static binary search, they can decide in which interval the character from the input is. So the lexer will use a very small number of comparisons to decide which state to choose next instead of using a large table, that makes the generated code even more readable (although that is not the purpose).
Everything is programmed using a lot of templates, some of you might not like this approach, but it is really nice to be able to create one algorithm accepting different character-encodings (ASCII, Latin1, UTF-16, UTF-32, …) supporting different internal storage representations and different output-formats for the decisions.

Well, that is it, I hope this post is interesting for at least a few people. ;)

The User

Of course minimization is also implemented, so you will not get useless states in the generated code.

RFC: Regular Expressions in KDev-PG-Qt

Thursday, February 17th, 2011

Hi folks!

Now I want to tell you about the progress in KDevelop-PG-Qt and I am interested in comments about the syntax of regular expressions/lexer-declarations, I recently implemented.

But first of all for those who do not know (probably most Planet KDE readers): What is KDev-PG-Qt (KDevelop-PG-Qt)?
“KDev-PG-Qt” is the shortform for “KDevelop-PG-Qt”, “KDevelop-PG-Qt” is the longform for “KDev-PG-Qt”, KDevelop is our favourite IDE, Qt is our favourite C++-framework, and “PG” means “Parser Generator”. So, it is KDE’s parser-generator (like Bison or AntLR) written some time ago for KDevelop, long time ago it was called “KDevelop-PG” and it used STL-strings and -streams, but then apaku (iirc) ported it to Qt, and KDevelop-PG is dead nowadays, but KDevelop-PG-Qt is still alive and used for some KDevelop-language-plugins. If you are interested, you can find an introduction here.

What am I currently doing? When writing a parser you usually divide it into two components: the scanner/lexer/lexical analysis and the parser/syntactic analysis, because it is quite comfortable to add an abstraction between single characters and the syntax: tokens. KDevelop-PG-Qt currently does not provide lexer-generation, it is used in combination with handwritten lexers or lexers generated by flex. Writing lexers manually is usually not very complicate, but it is not the optimal way, and those lexers are usually relatively slow. Flex with Unicode is really a painfull experience. There is also quex, a nice lexer-generator, supporting Unicode and a lot of extra-features, but it uses a modified LGPL-version “excluding military usage”, whatever this condition may mean, it is problematic. I do not have a problem with armies using my lexer-generator, and writing an own lexer-generator makes integration with KDev-PG-Qt context-free grammars easier.

Current state:

  • Algorithms work very well, I can construct the union, intersection, complement, difference, concatenation etc. of regular languages, optimal automatons get generated
  • Code-generation for automatons works, too, the generated lexer is easily integratable with the parser
  • A weird (?) input-syntax for the declaration of the tokens is implemented
  • Regular expressions are usable, but I am not using the Perl-syntax


  • There are some compiler-warnings indicating bugs in the algorithms
  • Unicode-character-classes are a must-have feature for a Unicode-aware lexer-generator
  • Escape-sequences are incomplete, 0-9 and foo{1,3} syntax not yet supported
  • Extra-features like word/line-boundaries, inheritance, actions in state-transitions etc.

Well, now I want to hear your opinion about the regexp-syntax: The lexer-declaration should support the same syntactic elements as the parser-declaration (wherever it makes sense), so there are the postfix-operators * (Kleene star, like in Perl), + (like in Perl), the binary-operators @ (e.g. “bla”@”,” accepts comma-separated lists of blas), & (intersection), | (union alternative, like in Perl), ^ (difference, like in Perl, but not only for characters inside [ ], but for any regular expressions), the prefix-operators ? (optinality, like in Perl, but prefix, like in the parser) and ~ (negation, complement), and the “empty operator” for concatenation. You usually have to quote words, but for simple alphanumeric strings it works without quotation marks. Unquoted/unescaped whitespace will be ignored, that is maybe unfamiliar, but it allows more structuring of the regular expressions. The empty operator has the highest priority. Now a syntactic element which may be interesting, weird or whatever: The brackets ([ ]) are not actually a notation for character-sets, but they change the meaning of the empty operator to the union (|). That sounds really weird, but it has some benefits, e.g. it is possible to work with multi-byte-characters. Everything between normal parentheses will be in normal mode (empty operator means concatenation), that makes ugly expressions possible, but it is also very flexible, e.g. imagine “ab” should be a digit, you could simply write [0123456789"ab"] for digits, but normally everything looks like normal Perl-syntax, but it would not be [^ab], but [.^ab], and you have to escape whitespace. I am not sure about that issue. It is currently implemented here.

About the general syntax of lexer-declarations I am even more unsure, but let us discuss this later, I have already written too much.

The User

Omitting quotation marks only works for characters inside of parentheses or brackets, otherwise they could be confused with the symbolclass (name of the token).

An example, it is currently working with my KDev-PG-Qt version: foolisp, it uses S-expressions, the keywords “foo”, “bar” and “baz”, string-literals, integer-literals and identifiers:

%token_stream Lexer ;
%input_encoding "ascii"
%input_stream "KDevPG::QByteArrayIterator"
%lexer ->
  "("                           LPAREN ;
  ")"                           RPAREN ;
  "foo"                         FOO ;
  "bar"                         BAR ;
  "baz"                         BAZ ;
  "0"|"1"|"2"|"3"|"4"|"5"|"6"|"7"|"8"|"9" -> digit ; -- Should be predefined wit unicode character classes
  "0" | ( {digit}^"0" ) {digit}*    NUMBER ;
--   [{ASCII}&{LETTER}\-{digit}]+ IDENTIFIER ; -- That's how it shold look like
  [abc\-][abc\-{digit}]*        IDENTIFIER ; -- That's how it is currently working
  "#!foolisp"                   SHEBANG ;
  [\ \n\t\f]+                   [: /* blank */ :] ;
  \" [.(\\.)^\"]* \"            STRING ;
   ?SHEBANG sexp=sexp -> start ;
   LPAREN #sexp=sexp* RPAREN
 | foo=FOO | bar=BAR | baz=BAZ | number=NUMBER | string=STRING | identifier=IDENTIFIER
-> sexp ;

I hope that clarifies the regular expressions for the lexer.