Parsing C++ with GCC plugins, Part 3

This is the third installment in the series of posts about parsing C++ with GCC plugins. In the previous post we covered the basics of the GCC AST (abstract syntax tree) as well as learned how to traverse all the declarations in the translation unit. This post is dedicated to types. In particular, we will learn how to access various parts of the class definition, such as its bases, member variables, member functions, nested type declarations, etc. At the end we will have a working plugin that prints all this information for every class defined in the translation unit.

All type nodes in the GCC AST have tree codes that end with _TYPE. To get a type node from a declaration node we use the TREE_TYPE macro. If a declaration has no type, such as NAMESPACE_DECL, then this macro returns NULL. Here is how we can improve the print_decl() function from the previous post to also print the declaration’s type’s tree code:

void
print_decl (tree decl)
{
  int tc (TREE_CODE (decl));
  tree id (DECL_NAME (decl));
  const char* name (id
                    ? IDENTIFIER_POINTER (id)
                    : "<unnamed>");
 
  cerr << tree_code_name[tc] << " " << name;
 
  if (tree t = TREE_TYPE (decl))
    cerr << " type " << tree_code_name[TREE_CODE (t)];
 
  cerr << " at " << DECL_SOURCE_FILE (decl)
       << ":" << DECL_SOURCE_LINE (decl) << endl;
}

If we now run the modified plugin on the following C++ code fragment:

class c {};
typedef const c* p;
int i;

We will get the following output:

type_decl c type record_type at test.cxx:1
type_decl p type pointer_type at test.cxx:2
var_decl i type integer_type at test.cxx:3

The most commonly seen AST types can be divided into three categories:

Fundamental Types
  • VOID_TYPE
  • REAL_TYPE
  • BOOLEAN_TYPE
  • INTEGER_TYPE
Derived Types
  • POINTER_TYPE
  • REFERENCE_TYPE
  • ARRAY_TYPE
User-Defined Types
  • RECORD_TYPE
  • UNION_TYPE
  • ENUMERAL_TYPE

Some node types, such as REAL_TYPE and INTEGER_TYPE, cover several fundamental types. In this case the AST has a separate node instance for each specific fundamental type. For example, the integer_type_node is a global variable that holds a pointer to the INTEGER_TYPE node corresponding to the int type. For the derived types (here the term derived type means pointer, reference, or array type rather than C++ class inheritance), the TREE_TYPE macro returns the pointed-to, referenced, or element type, respectively. The RECORD_TYPE nodes represent struct and class types.

You might also expect that GCC has a separate node kind to represent const/volitile/restrict-qualified (cvr-qualified) types. This is not the case. Instead, each type node contains a cvr-qualifier. So when the source code defines a const variant of some type, GCC creates a copy of the original type node and sets the const-qualifier on the copy to true. To check whether a type has one of the qualifiers set, you can use the CP_TYPE_CONST_P, CP_TYPE_VOLATILE_P, and CP_TYPE_RESTRICT_P macros.

The above design decision has one important implication: the AST can contain multiple type nodes for the same C++ type. In fact, according to the GCC documentation, the copies may not even have different cvr-qualifiers. In other words, the AST can use two identical nodes to represent the same type for no apparent reason. As a result, you shouldn’t use tree node pointer comparison to decide whether you are dealing with the same type. Instead, the GCC documentation recommends that you use the same_type_p predicate.

One macro that is especially useful in dealing with the multiple nodes situation is TYPE_MAIN_VARIANT. This macro returns the primary, cvr-unqualified type from which all the cvr-qualified and other copies have been made. In particular, this macro allows you to use the type node pointer in a set or as a map key, which is not possible with same_type_p.

Let’s now concentrate on the RECORD_TYPE nodes which represent the class types. The first thing that you will probably want to do once you are handed a class node is to find its name. Well, that’s actually a fairly tricky task in the GCC AST. In fact, I would say it is the most convoluted area, outdone, maybe, only by the parts of the AST dealing with C++ templates. Let’s try to unravel this from the other side, notably the type declaration side.

In the GCC AST types don’t have names. Instead, types are declared to have names using type declarations (TYPE_DECL tree node). This may seem unnatural to you since in C++ user-defined types do have names, for example:

class c {};

While that’s true, the AST treats the above declaration as if it was declared like this:

typedef class {} c;

The problem with this approach is how to distinguish the following two cases:

class c {}; // AST: typedef class {} c;
typedef c t;

To distinguish such cases the TYPE_DECL nodes that are “imagined” by the compiler are marked as artificial which can be tested with the DECL_ARTIFICIAL macro. Let’s add the print_class() function and modify print_decl() to test this out:

void
print_class (tree type)
{
  cerr << "class ???" << endl;
}
 
void
print_decl (tree decl)
{
  tree type (TREE_TYPE (decl));
  int dc (TREE_CODE (decl));
  int tc;
 
  if (type)
  {
    tc = TREE_CODE (type);
 
    if (dc == TYPE_DECL && tc == RECORD_TYPE)
    {
      // If DECL_ARTIFICIAL is true this is a class
      // declaration. Otherwise this is a typedef.
      //
      if (DECL_ARTIFICIAL (decl))
      {
        print_class (type);
        return;
      }
    }
  }
 
  tree id (DECL_NAME (decl));
  const char* name (id
                    ? IDENTIFIER_POINTER (id)
                    : "<unnamed>");
 
  cerr << tree_code_name[dc] << " "
       << decl_namespace (decl) << "::" << name;
 
  if (type)
    cerr << " type " << tree_code_name[tc];
 
  cerr << " at " << DECL_SOURCE_FILE (decl)
       << ":" << DECL_SOURCE_LINE (decl) << endl;
}

If we now run this modified version of our plugin on the above two declarations, we will get:

class ???
type_decl t type record_type at test.cxx:3

Ok, so this works as expected. Now how can we get the name of the class from the RECORD_TYPE node? In the above code we could have passed the declaration node along with the type node to the print_class() function. But that’s not very elegant and is not always possible, as we will see in a moment. Instead, we can use the TYPE_NAME macro to get to the type’s declaration. There are a couple of caveats, however. First, remember that the same type can have multiple tree nodes in the AST. You can also get different declarations for different type nodes denoting the same type. Then the same type node can be declared with multiple declarations. For example, there could be multiple typedef names for the same type. So which declaration are we going to get? There is no simple answer to this question. However, if you get the primary type with TYPE_MAIN_VARIANT and then get its declaration with TYPE_NAME and if the type was named in the source code, then this will be the artificial declaration that we talked about before. Here is the new implementation of print_class() that uses this technique:

void
print_class (tree type)
{
  type = TYPE_MAIN_VARIANT (type);
 
  tree decl (TYPE_NAME (type));
  tree id (DECL_NAME (decl));
  const char* name (IDENTIFIER_POINTER (id));
 
  cerr << "class " << name << " at "
       << DECL_SOURCE_FILE (decl) << ":"
       << DECL_SOURCE_LINE (decl) << endl;
}

Running this version of the plugin on the above code fragment produces the expected output:

class c at test.cxx:1
type_decl t type record_type at test.cxx:2

Let’s now print some more information about the class. Things that we may be interested in include base classes, member variables, member functions, and nested type declarations. We will start with the list of base classes. The base classes of a particular class are represented as a vector of BINFO tree nodes and can be obtained with the TYPE_BINFO macro. To get the number of elements in this vector we use the BINFO_N_BASE_BINFOS macro. To get the Nth element we use the BINFO_BASE_BINFO macro. The macros that we can use on the BINFO node include BINFO_VIRTUAL_P which returns true if the base is virtual and BINFO_TYPE which returns the tree node for the base type itself. Naturally, you may also expect that there is a macro named something like BINFO_ACCESS which return the access specifier (public, protected, or private) for the base. If so, then you haven’t really gotten the spirit of the GCC AST design yet: if something would feel simple and intuitive, then find a way to make it convoluted and surprising. So, no, there is no macro to get the base access specifier. In fact, this information is not even stored in the BINFO node. Rather, it is stored in a vector that runs parallel to the BINFO nodes. The Nth element in this vector can be accessed with the BINFO_BASE_ACCESS macro. The following code fragment shows how to put all this information together:

enum access_spec
{
  public_, protected_, private_
};
 
const char* access_spec_str[] =
{
  "public", "protected", "private"
};
 
void
print_class (tree type)
{
  type = TYPE_MAIN_VARIANT (type);
 
  ...
 
  // Traverse base information.
  //
  tree biv (TYPE_BINFO (type));
  size_t n (biv ? BINFO_N_BASE_BINFOS (biv) : 0);
 
  for (size_t i (0); i < n; i++)
  {
    tree bi (BINFO_BASE_BINFO (biv, i));
 
    // Get access specifier.
    //
    access_spec a (public_);
 
    if (BINFO_BASE_ACCESSES (biv))
    {
      tree ac (BINFO_BASE_ACCESS (biv, i));
 
      if (ac == 0 || ac == access_public_node)
        a = public_;
      else if (ac == access_protected_node)
        a = protected_;
      else
        a = private_;
    }
 
    bool virt (BINFO_VIRTUAL_P (bi));
    tree b_type (TYPE_MAIN_VARIANT (BINFO_TYPE (bi)));
    tree b_decl (TYPE_NAME (b_type));
    tree b_id (DECL_NAME (b_decl));
    const char* b_name (IDENTIFIER_POINTER (b_id));
 
    cerr << "t" << access_spec_str[a]
         << (virt ? " virtual" : "")
         << " base " << b_name << endl;
  }
}

The list of member variable and nested type declarations can be obtained with the TYPE_FIELDS macro. It is a chain of *_DECL nodes, similar to namespaces. The declarations that can appear on this list include FIELD_DECL (non-static member variable declaration), VAR_DECL (static member variables), and TYPE_DECL (nested type declarations).

The list of member functions can be obtained with the TYPE_METHODS macro and can only contain the FUNCTION_DECL nodes. To determine if a function is static, use the DECL_STATIC_FUNCTION_P predicate. Other useful member function predicates include: DECL_CONSTRUCTOR_P, DECL_COPY_CONSTRUCTOR_P, and DECL_DESTRUCTOR_P.

To determine the access specifier for a member declaration you can use the TREE_PRIVATE and TREE_PROTECTED macros (note that TREE_PUBLIC appears to be used for a different purpose).

As with namespaces, the order of declarations on these lists is not preserved so if we want to traverse them in the source code order, we will need to employ the same technique as we used for traversing namespaces. The following code fragment shows how we can print some information about class members:

void
print_class (tree type)
{
  type = TYPE_MAIN_VARIANT (type);
 
  ...
 
  // Traverse members.
  //
  decl_set set;
 
  for (tree d (TYPE_FIELDS (type));
       d != 0;
       d = TREE_CHAIN (d))
  {
    switch (TREE_CODE (d))
    {
    case TYPE_DECL:
      {
        if (!DECL_SELF_REFERENCE_P (d))
          set.insert (d);
        break;
      }
    case FIELD_DECL:
      {
        if (!DECL_ARTIFICIAL (d))
          set.insert (d);
        break;
      }
    default:
      {
        set.insert (d);
        break;
      }
    }
  }
 
  for (tree d (TYPE_METHODS (type));
       d != 0;
       d = TREE_CHAIN (d))
  {
    if (!DECL_ARTIFICIAL (d))
      set.insert (d);
  }
 
  for (decl_set::iterator i (set.begin ()), e (set.end ());
       i != e; ++i)
  {
    print_decl (*i);
  }
}

We can now try to run all this code on a C++ class that has some bases and members, for example:

class b1 {};
class b2 {};
class c: protected b1,
         public virtual b2
{
  int i;
  static int s;
  void f ();
  c (int);
  ~c ();
  typedef int t;
  class n {};
};

And below is the output from our plugin. Here we use the version that prints fully-qualified names for declarations:

class ::b1 at test.cxx:1
class ::b2 at test.cxx:2
var_decl ::_ZTI1c type record_type at test.cxx:5
class ::c at test.cxx:5
        protected base ::b1
        public virtual base ::b2
field_decl ::c::i type integer_type at test.cxx:6
var_decl ::c::s type integer_type at test.cxx:7
function_decl ::c::f type method_type at test.cxx:8
function_decl ::c::c type method_type at test.cxx:9
function_decl ::c::__base_ctor  type method_type at test.cxx:9
function_decl ::c::__comp_ctor  type method_type at test.cxx:9
function_decl ::c::c type method_type at test.cxx:10
function_decl ::c::__base_dtor  type method_type at test.cxx:10
function_decl ::c::__comp_dtor  type method_type at test.cxx:10
type_decl ::c::t type integer_type at test.cxx:11
class ::c::n at test.cxx:12

Figuring out what the _ZTI1c, __base_ctor, __comp_ctor, __base_dtor, and __comp_dtor declarations are is left as an exercise for the reader.

And that’s it for today as well as for the series. There is a number of GCC AST areas, such as C++ templates, functions declarations, function bodied, #include information, custom #pragma’s and attributes, etc., that haven’t been covered. However, I believe the GCC plugin and AST basics that were discussed in this and the two previous posts should be sufficient to get you started should you need to parse some C++.

If you have any questions, comments, or know the answer to the exercise above, you are welcome to leave them below. The complete source code for the plugin we have developed in this post is available as the plugin-3.cxx file.

4 Responses to “Parsing C++ with GCC plugins, Part 3”

  1. yoco Says:

    Thank you for the great tutorial! This is what I am looking for for a long time!

  2. Tim Finer Says:

    Thank you for writing these tutorials.

  3. Jonas Bülow Says:

    Thank you for a great tutorial!

  4. KBurns Says:

    Thank you much for these tutorials !