Archive for September, 2009

CLI in C++: Separate vs Embedded Runtime

Monday, September 21st, 2009

This is the ninth installment in the series of posts about designing a Command Line Interface (CLI) parser for C++. The previous posts were:

In the last post we discussed the mapping of our CLI language to C++ and established that there will be some common support code, such as exception definitions, that we will need to place somewhere. There are two places where we can keep this code: We can either create a separate runtime library that will contain the support code and on which the generated code will depend. Or we can embed this code directly into the generated C++ files. Today we are going to consider the pros and cons of each approach.

The embedded runtime has the following advantages compared to the separate runtime library:

  • No external dependencies
  • Simple cases will not require extra generated files
  • Can have source code (compared to a header-only library)
  • Can easily support various naming conventions
  • Can minimize the code by only generating what’s needed
  • No runtime/generated code version mismatches
  • Makes the use of the generated code in CLI much easier

Let’s consider each of these points in order. The embedded runtime will not require inclusion of any external headers or linking to any external libraries other than the C++ standard library. This will make the adoption of CLI very easy, in fact, easier than adopting a header-only library. All that needs to be done is to generate the C++ files from the CLI definition and add them to the application source code. This is especially important for a relatively inconsequential functionality such as command line parsing. The requirement to add an extra dependency, even a header-only, may override all the benefits that the CLI compiler will bring.

Most applications that use the CLI language will only have one options file. When we have only one file we can generate the runtime code into the same set of C++ files as the one containing the option class(es). Things are a bit more complicated when we have multiple options files. In this case we cannot generate the runtime code directly into the resulting C++ files because this will lead to re-definitions (if two generated header files are included into the same translation unit) or duplicate symbols. In this case we will need to generate the runtime code into a separate set of C++ files and then include the resulting header into other generated files. For example, we could have the --generate-runtime file option which instructs the compiler to generate the runtime code in a separate set of C++ files and the --runtime file option which tells the compiler that the runtime is in these C++ files.

The embedded runtime can have C++ source code unlike a header-only external runtime library. We would want to restrict the external runtime to be a header-only library in order to simplify adoption, since a header-only library does not require building. However, this restriction may force us to declare certain functions inline even if they shouldn’t normally be inlined because of the potential code bloat.

One common complaint about generated code in general is that it fits poorly with the hand-written code. The major reason for this is that the generated code often doesn’t follow the same identifier naming convention as the one used in the project. For example, the project may be using “upper camel case” for type names (e.g., SimpleName) while the generated code uses the standard C++ lower case and underscores (e.g., simple_name). There is no technical reason (except for, maybe, complexity) why a code generator can’t support configurable naming conventions. In fact, that’s what we did in the C++/Tree mapping in XSD and it made a lot of people very happy. The only problem is that it is virtually impossible to support configurable naming conventions in a hand-written runtime library. But it should be quite easy to do with the embedded runtime since it is also generated by the compiler.

Because the code for the embedded runtime is generated for each application, we can minimize the output by omitting unused optional components. We can also decide whether to generate certain functions inline based on the application developer preferences.

Since with the embedded runtime there are no external dependencies, there are also no version mismatches that can occur when one of the components (generated code or runtime library) was upgraded and the other was not.

Finally, the embedded runtime approach makes it much easier to use the generated code in the CLI implementation itself. With the separate runtime library we will either have to keep an old copy around or risk breaking the generated code with backwards-incompatible changes that occur during development.

The embedded runtime approach also has a number of disadvantages:

  • Hard to develop and maintain
  • Bug fixes to the runtime require compiler rebuild
  • Impractical for large runtimes

The embedded runtime is harder to develop and maintain than a separate runtime library. This is because the code has to be emitted by the compiler instead of simply sitting in a file. In particular, because the runtime code is embedded into the compiler source code as a collection of strings, it is a lot harder to read and write.

Fixing any bug that is found in the embedded runtime code will require a compiler rebuild. In case of a header-only runtime library the same can be accomplished by patching a few files and recompiling the application.

Finally, the embedded runtime approach quickly becomes impractical as the size of the runtime code grows. The difficulty of development and maintenance is one reason. The other reason is the lack of separate compilation. All of the embedded runtime code is contained in a single generated C++ source file. As the amount of code in the runtime grows, this file takes longer and longer to compile.

Now, which approach should we use in our case? The CLI runtime is going to be pretty small, or, at least, I expect it to be. Initially it will contain a few exception definitions and maybe a few helper classes. So the size shouldn’t be an issue. On the other hand, as we discussed above, it is very important to make the generated code as easy to adopt as possible. Making it self-sufficient and dependency-free sounds very attractive. So it looks like in our situation the advantages of the embedded runtime significantly outweigh its disadvantages.

At this point we have covered enough ground to make the first usable release of the CLI compiler. From the last status update I have started working on the backend infrastructure and at this stage the compiler is able to generate the output C++ files with all the #include directives and the proper namespace structure. It is all in the source repository if you would like to take a look. From now on I will be working on generating the C++ mapping. Once this is done, we should be in good shape to release cli-1.0.0. As always, if you have any thoughts, feel free to add them in the comments.

CLI in C++: Status Update

Sunday, September 6th, 2009

The semantic graph for the CLI language and the traversal mechanism are ready. I have also updated the parser to build the graph as the parsing progresses. If you would like to check it out, you can download the source distribution via these links:

See the first status update for more information on what to do with these files. This release adds a dependency on libcutl which is a small C++ utility library that I developed. If you are not using the cli+dep package, you will also need to get this library:

You can follow the development, including accessing a more detailed log of changes, via the git repository:

The semantic graph and the traversal mechanisms are quite interesting pieces of software. I have developed the ideas behind them over several years, starting in CCF (CORBA Compiler Frontend), and then continuing in XSD and XSD/e. The semantic graph is an in-memory data structure that captures the semantics of a particular language in a way convenient for analysis and code generation. The traversal mechanism is similar to the visitor pattern except it is inheritance-aware and more flexible (more about this below). We can use it to implement passes over the semantic graph which perform various analysis, graph transformations, or code generation.

The semantic graph is a heterogeneous directed graph. Its nodes are concepts of a particular language. For our CLI language we have: namespace, class, option, C++ type, and expression. Edges represent relationships between nodes as dictated by the language semantics. For example, in our case, a namespace names one or more namespaces or classes. A class names one or more options. An option belongs to a type and can be initialized by an expression. So we have the following edges: names, belongs, and initialized. Notice that all nodes are nouns and all edges are verbs.

Some concepts in the language have a common behavior or trait. For example, both namespace and class are a kind of scope. Namespace, class, and option all have a name (or, more accurately, are named by a scope). To capture this, we define a few base nodes, such as scope, and nameable. The following listing shows the inheritance relationship between nodes:

scope: nameable
class: scope
namespace: scope
option: nameable

Each node and edge in the graph provides a way to navigate to other nodes/edges to which it is connected. For example, having a class node, we can get all its outgoing names edges and from each such edge get to the option node. While such manual navigation can be used to, say, generate code, it is quite tedious. It is much more convenient to use the traversal mechanism.

For each node and edge type in the semantic graph there is a traverser class that implements common navigation patterns for this node or edge. For example, a namespace traverser by default iterates over and traverses all its names edges. The traversal mechanism also does automatic, inheritance-aware type matching. An example will help illustrate what this means. Suppose we are traversing a namespace node. It can name two kinds of nodes: classes and namespaces. If we don’t care which kind of node it is, we can pass the generic scope-based traverser. Because both namespace and class are a kind of scope, our scope traverser will be called for both kinds of objects.

But suppose we need to do something special for classes. In this case, we can pass a class-based traverser in addition to the scope-based one. In this case, classes will be handled by the class traverser (because it is a better match than the scope traverser) and the rest will be handled by the scope traverser. If you would like to see how this type matching works in a very simple example, take a look at the tests/compiler/traverser test in libcutl.

The implementation of the semantic graph and the traversal mechanism completes the so-called frontend part of our compiler. Now we are ready to move to the backend part which is where we generate the code. Next time we will have the long promised discussion of the pros and cons of self-sufficient generated code vs generated code that depends on a runtime library. I will also start working on the backend infrastructure, such as opening the output C++ files, writing include guards, generating #include directives, etc.