Unexpected Bottle Neck: vector<bool>

Hi folks!

The last paragraph is actually the interesting one for every C++-programmer, the leading-in might not be understandable.

The new lexer component of KDevelop-PG-Qt is not very fast, it can take several minutes with complex grammars, especially when using UTF-8 as internal encoding for the automata. Of course, first I want to finish the most important features and that for I am writing an example lexer for PHP to check its capabilities to spot bugs and missing features, but now I was curious and asked Valgrind/Callgrind for analysis. I expected a lot of useless copying/assignments because I have not done any optimisation yet and tried to be “defensive” by using too much copying. Indeed: For my first example about 78% of the runtime were spent on copying. But then I tried a full (as far as it is finished) PHP example with UTF-8 and it was a real surprise:

KDevelop-PG-Qt: Callgrind results: a lot of comparisons on vector<bool>

KCacheGrind Overview

It spends most of the time with comparing instances of std::vector<bool>! For the powerset construction I have to maintain following states for subsets of states in the NFA and after that I map those subsets to new unique IDs for the states of the DFA. For both cases I wanted to use a map with vector<bool> (a subset of states) as key_type, I could not decide if I should use map or unordered_map – I used map, but there is also some code to make it possible to use QBitArray with unordered_map, just for the case. However: I expected the minimisation to be a bottle neck (disclaimer: I stopped valgrind after some time, maybe it becomes a bottle neck later), but now the comparisons used in the map-implementation are crucial. I was not very economical with those map-operations – I could have done the mapping while computing the necessary subsets, it has to be optimised and I may use unordered_map and unordered_set (hashes) instead of map and set, but that is not the reason why I am writing this blog post. ;)

The interesting thing about it is the usage of std::_Bit_reference. std::vector<bool> has a specialised implementation, the booleans get packed, thus it will only need size()/8+const Bytes when it is full. There is a special hashing-implementation directly using the integers used to store the bits, but there is only a generic comparison implementation. Thus it will iterate over all the bits, instead of using comparisons of chunks (like 32 or 64 bit, depending on the CPU). Of course it will be much slower, keep in mind that each iteration requires some bit-manipulation. Especially in my case of a red-black-tree the vector will get compared to very similar ones next to it in the tree, and in that case it takes especially long time. Be aware of vector<bool>, it is often awesome (safe a lot of memory), but something like this slow comparison may happen, and always keep in mind that the index-operator might be slower than for vector<char&gt. Unfortunately there seems to be no API for accessing the integer-chunks. Btw. QBitArray does not support comparison at all. I have filed a bug-report to GCC/libstdc++.

8 Responses to “Unexpected Bottle Neck: vector<bool>”

  1. Giorgos Tsiapaliwkas Says:


    nice work!

    I looked at kcachegrind and i saw that is a dead project..
    Is there any maintained programm similar to this one?

  2. The User Says:

    Maybe it is unmaintained, but it does always work for me, very stable and provides the features I want to have. I do not think it is a problem in that case that it is unmaintained.

  3. Albert Astals Cid Says:

    Quick check, did you run callgrind over a release/optimized build (i.e. compiled with -O2) or not? Because it will cause major changes

  4. The User Says:

    That is a debug-build. Of course it will cause major changes, but without inlining we can see the usage of std::_Bit_reference all the time, even with optimisation it will still iterate bitwise, but we can’t see that in the output.

    Probably after some time minimisation gets more relevant. But the bad performance of operator< is the same with -O2. The total numbers are not the important thing here (it is not even a complete run). PPS:

    KDevelop-PG-Qt: Callgrind results: a lot of comparisons on vector<bool> (with -O2)

    RelWitDebInfo (-O2)

    That is the result of a complete run of an optimised version. You cannot see the reason for the bad performance any longer. The powerset construction needs most of the time (the dfa() function), including all the comparisons, some time is spent on minimisation, output (generateLexer) and output-formatting (the astyle stuff).

  5. Tommy Says:

    The rule is to never use std::vector as it was misimplemented in the C++ standard library (this is not GNU C++’s problem, but an issue in the standard itself). It will probably be fixed soon, as the C++11 standard seems to be finished (finally).

    The problem itself is actually widely known, see for example

    There are also chapters about this issue in Scott Meyer’s effective C++ and Herb Sutter’s C++ coding standards. (If you want to vastly improve your C++ skills, the latter is highly recommended!)

  6. Tommy Says:

    The first line should of course read: The rule is to never use std::vector ….

    For all other types the usage of std::vector is of course highly recommended.

  7. Rsh Says:

    I suppose it would be easy to implement your own minimal vector ;) I think applying memcmp to its memory could be pretty damn fast – GCC optimizes memcmp with SIMD x86-alike platforms but should also use some specialized versions on ARM and others.

  8. The User Says:

    I am now using an unordered_set. ;)

Leave a Reply

XHTML: Use <blockquote cite="name"> for quotations, <pre lang="text    ∨ cpp-qt ∨ cpp ∨ bash ∨ other language"> for code, [latex] for formulas and <em> for em. Contact me if the comment does not get published, it may have accidentally been marked as spam.

Anti-Spam Quiz: