Archive for May, 2010

Parsing C++ with GCC plugins, Part 1

Monday, May 3rd, 2010

You have probably heard about the recent release of GCC 4.5.0. One of the new features in this version is the support for plugins. You can now write a shared object (.so) that can be loaded into GCC and hooked into various stages of the compilation process.

In the past couple of months I have been working on a new project (what it’s about is a secret, for now; UDATE: no longer a secret ) that uses GCC and the new plugin feature in order to parse C++ and then to generate some code based on it.

Writing a plugin to accomplish this was both fun and frustrating. Fun because GCC has a very rich abstract syntax tree (AST, sometimes called C++ Tree in GCC documentation). The amount of information available about parsed C++ is amazing; there isn’t much you can’t infer about the code. It was frustrating because this AST is very complex and very poorly documented. So is the plugin API. Most of the time I was reading the AST headers to learn more about the API and studied the GCC compiler source code to understand how to use it.

While there are a few other plugins around (and more will probably be written in the future), most of them concentrate on either optimizations or code generation (a good example of the latter is LLVM’s DragonEgg plugin). The only exception is probably Mozilla’s Dehydra/Treehydra set of plugins. However, Dehydra simply exposes a flattened subset of GCC’s AST as a set of JavaScript objects (for example, there is no namespace or #include information). Treehydra relies on GIMPLE which is a representation one level below (towards the machine code) from the parsed C++.

As a result, there isn’t much information or source code examples that show how to work with the GCC’s C++ AST. And since I have already figured out most of the basics, I was thinking about writing a series of blog posts that show how to use GCC plugins to parse C++. What you do based on this information is up to you. Some of the potential applications include static analysis, (source) code generation, documentation generation, binding to other languages, editor/IDE support, etc. In today’s post I am going to show how to set up the plugin infrastructure for this kind of tasks. If there is interest, future posts will cover various aspects of working with GCC’s AST. So if you would like to read more on this topic, drop a line in the comments and if there is enough interest, I will write more on GCC plugins.

GCC plugin API is covered in Chapter 23, “Plugins” in the GCC Internals documentation. As described in this chapter, there are several compilation events (or phases) that the plugin can register for. Unfortunately none of the existing events are suitable for the kind of task that we want to perform. What we want is to be called just after the AST has been constructed and before any other passes are performed. We don’t want to perform any other passes since that would only be a waste of time. All we need is the C++ AST. At first it may seem that PLUGIN_FINISH_UNIT is a good place to run our code. However, a number of passes are performed before it (you can test this by registering a callback for the PLUGIN_OVERRIDE_GATE event which will allow you to see all the passes that are being executed).

One way to achieve what we want would be to register a callback for the PLUGIN_OVERRIDE_GATE event. This callback is called before every pass and it allows the plugin to decide whether to run the pass in question. The first call to this callback will then by definition be before any other pass has run. We can then call our code from this first execution of the callback and then terminate GCC. Here is the skeleton for this callback:

extern "C" void
gate_callback (void* gcc_data, void*)
{
  // If there were errors during compilation,
  // let GCC handle the exit.
  //
  if (errorcount || sorrycount)
    return;
 
  int r (0);
 
  //
  // Process AST. Issue diagnostics and set r
  // to 1 in case of an error.
  //
 
  // Terminate GCC.
  //
  exit (r);
}

errorcount and sorrycount are GCC variables that contain the error counts. The plugin API includes all the internal GCC headers so a plugin can access all the data and call all the functions that the code in the GCC compiler itself can.

Now we have set up the entry point for our plugin in the overall compilation process. There is, however, another thing that we need to take care of: the compiler output. When you execute something like this:

g++ -fplugin=plugin.so -c test.cxx

g++ isn’t the executable that will actually load plugin.so. g++ is a compiler driver that runs several other programs under the hood in order to translate test.cxx to test.o (use the -v option to see what’s actually being executed by g++). It first runs the program called cc1plus which is the actual C++ compiler and which will load the plugin. The output of cc1plus is an assembly file. Once the assembly file is generated, g++ invokes as to translate the assembly file to test.o.

Our plugin is altering the GCC compilation process. Instead of the assembly file we want to generate something else (or maybe no output files at all in case of a static analysis tool). Do you see the problem now? While our plugin is producing some other output, g++ assumes it will produce an assembly file which it will then try to pass to the assembler.

While we can try to invoke cc1plus directly, it is an internal program of GCC and is invoked by g++ with some additional options which we would rather not deal with. Instead, we can ask g++ to produce an assembly file by passing -S instead of -c. In this case g++ is not going to invoke the assembler and nobody will care that the output assembly file does not exist.

So this part is sorted out then. Well, not quite. While we terminate GCC quite early, before any assembly can actually be generated, the output assembly file is still created. To get rid of this file we need to add the following line in our plugin_init():

asm_file_name = HOST_BIT_BUCKET;

HOST_BIT_BUCKET is defined as "/dev/null". Here is the complete source code for the skeleton of our plugin:

// GCC header includes to get the parse tree
// declarations. The order is important and
// doesn't follow any kind of logic.
//
 
#include <stdlib.h>
#include <gmp.h>
 
#include <cstdlib> // Include before GCC poisons
                   // some declarations.
 
extern "C"
{
#include "gcc-plugin.h"
 
#include "config.h"
#include "system.h"
#include "coretypes.h"
#include "tree.h"
#include "intl.h"
 
#include "tm.h"
 
#include "diagnostic.h"
#include "c-common.h"
#include "c-pragma.h"
#include "cp/cp-tree.h"
}
 
#include <iostream>
 
using namespace std;
 
int plugin_is_GPL_compatible;
 
extern "C" void
gate_callback (void*, void*)
{
  // If there were errors during compilation,
  // let GCC handle the exit.
  //
  if (errorcount || sorrycount)
    return;
 
  int r (0);
 
  //
  // Process AST. Issue diagnostics and set r
  // to 1 in case of an error.
  //
  cerr << "processing " << main_input_filename << endl;
 
  exit (r);
}
 
extern "C" int
plugin_init (plugin_name_args* info,
             plugin_gcc_version* ver)
{
  int r (0);
 
  cerr << "starting " << info->base_name << endl;
 
  //
  // Parse options if any.
  //
 
  // Disable assembly output.
  //
  asm_file_name = HOST_BIT_BUCKET;
 
  // Register callbacks.
  //
  register_callback (info->base_name,
                     PLUGIN_OVERRIDE_GATE,
                     &gate_callback,
                     0);
  return r;
}

You can compile and try it out like so:

$ g++-4.5 -I`g++-4.5 -print-file-name=plugin`/include \
-fPIC -shared plugin.cxx -o plugin.so
 
$ g++-4.5 -S -fplugin=./plugin.so test.cxx
starting plugin
processing test.cxx

Update: Starting with version 4.7.0, GCC can be built either in C or C++ mode. And starting with version 4.8.0, it is always built as C++. If you try to run the above example using GCC built in the C++ mode, you will get an error saying that the plugin cannot be loaded because one or more symbols are undefined. The reason for this error is that now all the GCC symbols have C++ linkage while we include them as extern "C". The solution to this problem is to remove the extern "C" { } block around the include directives at the beginning of our plugin source code (note that the following functions should still remain extern "C").

Another option that you will probably want to add to the plugin invocation is -x c++. It tells GCC that what’s being compiled is C++ regardless of the file extension. This is useful if you plan to compile, for example, C++ header files (in this case and without this option, GCC will try to generate a precompiled header instead of an assembly file). Having to remember to specify the two options (-S -x c++) could be quite inconvenient for the users of our plugin.

The plugin can also have options of its own which are specified on the g++ command line in the following form:

-fplugin-arg-<plugin-name>-<key>[=<value>]

This is quite verbose and can also become a major inconvenience for the users of our plugin. To address the above two problems it makes sense to create a driver for our plugin, similar to how g++ is a driver for cc1plus. The driver will automatically pass the -S -x c++ -fplugin=./plugin.so options to g++ and convert plugin options to the -fplugin-arg- format before passing them to g++.

For my project I wrote a plugin driver that uses the following conventions. The driver recognizes the commonly used options such as -I, -D, etc., and passes them to g++ as is. Otherwise the -x option can be used to pass extra options to g++ (for example, -x -m32 ). If an argument to -x does not start with ‘-‘, then it is treated as the g++ executable name. Everything else is converted to the -fplugin-arg- format and passed as plugin options which are then handled in the plugin code with the help of cli. So if you execute:

driver -x g++-4.5 -x m32 --foo bar test.cxx

Then the g++ command line will look like this:

g++-4.5 -m32 -S -x c++ -fplugin=./plugin.so \
-fplugin-arg-plugin-foo=bar test.cxx

And that’s it for today. Remember to drop a line in the comments if you would like to read more about parsing C++ with GCC plugins.