A profiling war story
Monday, May 4th, 2009The other day I ran across an XML schema that took the XSD compiler unusually long to translate to C++. While the schema itself was quite large, almost 2MB of definitions spread over 80 files, the compilation time on my machine was excessive, about 160 seconds. We also had another report from several months ago where XSD would seem to run indefinitely while trying to compile a really large schema (15K element declarations across about 200 files). Unfortunately, the reporter couldn’t provide us with that schema for debugging due to distribution restrictions. All we could do was give them a new version of the compiler and they had to tell us whether it worked or not.
I set out to figure out what was going on using the schema I had suspecting that it was the same problem as the one affecting the other schema. The first step was to understand which part of the compilation process takes so much time. A good profiler can be very helpful in accomplishing this. I personally prefer OProfile mainly because: 1) it is system-wide so you can see the complete picture (your application, libraries, even the kernel), 2) it is transparent in the sense that you don’t need a special build of your application and all the libraries that it depends on, and 3) it has a low overhead which means that your application doesn’t take 10 times as long to execute when profiling.
I ran OProfile while the XSD compiler was compiling the schema. My first step in analyzing the program execution was to get a flat symbol list that shows how much time is spent in each function. Here is the top of the listing for XSD:
samples % image name symbol name 405527 14.4254 libc.so _int_malloc 334761 11.9081 libc.so _int_free 287232 10.2174 libc.so free 285081 10.1409 xsd Cult::MM::allocate 199565 7.0989 libc.so malloc 124309 4.4219 xsd Schema::find 119779 4.2608 xsd operator delete 104893 3.7312 xsd std::map<...>::find 94964 3.3780 libc.so wmemcmp 85689 3.0481 xsd Cult::MM::Key::size 80308 2.8567 xsd operator new 62390 2.2193 libc.so strlen 37687 1.3406 xsd std::string::_S_create 35205 1.2523 libc.so memcpy
When you read articles about profiling you are normally presented with a listing similar to the one above where one or two functions at the top account for 90% percent of the execution time. This, however, doesn’t happen very often for complex programs that were built by competent engineers. Complex programs perform several operations in a pipe-line manner instead of just one algorithmic task (image conversion is good example of an algorithmic application where you may see a single function at the top taking 90% of the time). Competent engineering means that there are no silly programming or architectural mistakes that result in single-function bottlenecks. Instead, you normally see a listing like the one above: execution time is spread over many functions without any time standing out as unreasonable. Such information is completely useless in understanding what’s going on in the program.
What we need instead are the aggregate times for each function. That is, a time spent in the function itself and in all the functions that it calls. We also need to know which other functions each function calls with their respective times. With this information we can find which operation in our program takes most of the execution time as well as figure out which parts of this operation are responsible for the slowdown.
The best way to get this information from OProfile is to collect the samples with the call graph information (see the --callgraph
option), then convert them to the gprof format with the opgprof
command, and finally generate the call graph with the gprof -q
command. One limitation of the call graph information is that it is very inaccurate if your application is optimized. To get a useful result you will need to recompile your application with optimization turned off.
The call graph output is quite a bit more complex and can be a lot more verbose compared to the flat symbol list presented above. The gprof call graph format is explained in detail in the gprof documentation.
One way to get a summary of the gprof call graph that just tells us which functions have the largest aggregate times is to run the following sed
script on the call graph output:
sed -e '/^\[/p;d' callgraph >summary
Here is the top of the summary listing for XSD:
% time self children called name 99.8 0 645554 main 94.0 16021 591541 588759+ Schema::find 3889477 82.6 1024 533243 512793+ <cycle 4 as a whole> 36104907 82.4 0 532686 512443 Parser::parse 55.5 8 358965 465329 Node<Element>::trampoline 49.5 27896 291869 265608+ <cycle 1 as a whole> 36968 47.9 0 309389 299838 Resolver::resolve_element 42.6 18427 256696 215619 operator new 38.6 1 249727 242001 Resolver::traverse
From this listing it is immediately apparent that the problem is most likely in the Schema::find
operation. It takes up 94% of the program execution time and is called half a million times from other functions and almost four million times recursively.
The Schema::find()
function looks up namespaces with the specified XML namespace name in the schema graph. To understand how it works one needs a basic understanding of how the schema graph is organized. Each schema file is represented by a Schema
object which contains a list of references to included and imported schema files (also represented as Schema
objects) along with the Namespace
object which contains the actual type, element, etc., declarations. Schema::find()
returns a list of Namespace
object corresponding to a namespace name. To collect this list, it traverses the include/import graph to make sure it visits every Schema
object visible, directly or indirectly, from the initial schema.
There could be two general reasons why Schema::find()
takes a disproportional amount of time. It can be because it is called too many times by other functions. Or the problem could be in its implementation. Or it can be both. Seeing that Schema::find()
is called almost 8 times more often recursively than from outside, the suspicion is that it is the implementation. Here is the fragment of the call graph listing corresponding to this function:
self children called name 16021 591541 588759+3889477 Schema::find 8275 217657 208177/208268 Context::set 3528 174629 158008/158008 std::list<...>::insert 3190 45372 65030/65055 Context::remove 346 47712 43879/43880 std::list<...>::clear 3500 28218 32082/37821 Scope::find 3560 9619 20461/20586 Context::count
As you can see the time is dominated by two sets of operations. The first are Context
operations which are used to detect recursive import/inclusion (that is, a flag is set in the Schema
’s context to indicate that it is being traversed). The second are the list insertion and removal operations. This tells us two things: a lot of Schema
objects are being traversed and a lot of Namespace
objects are being inserted into the resulting list.
The size of the schema, and in particular the number of files which corresponds to the number of Schema
objects, could not by itself explain the long compilation time since there are comparable or even larger schemas that compile much faster. There had to be something else special about this schema. As it turned out, and this is something I just had to figure out on my own since no profiling tool is available for this last step, the schema had a very expansive include/import tree. Most files in this schema included or imported a handful of other files which in turn did the same. The end result was that each schema file was included or imported, directly and indirectly, hundreds of times. That meant that the Schema::find()
implementation had to traverse the same files over and over again (thus all the Context
operations) and insert the same Namespace
objects into the resulting list over and over again (thus all the list operations).
The solution was fairly straightforward. I added a set which keeps track of all the Schema
objects that has already been seen in this execution of Schema::find()
. The compilation time dropped from 160 seconds to 1.5 seconds which is more than 100 times speedup. And, as it turned out, this was the same problem that affected the larger schema which we could not get.