Archive for the ‘Design’ Category

CLI in C++: Separate vs Embedded Runtime

Monday, September 21st, 2009

This is the ninth installment in the series of posts about designing a Command Line Interface (CLI) parser for C++. The previous posts were:

In the last post we discussed the mapping of our CLI language to C++ and established that there will be some common support code, such as exception definitions, that we will need to place somewhere. There are two places where we can keep this code: We can either create a separate runtime library that will contain the support code and on which the generated code will depend. Or we can embed this code directly into the generated C++ files. Today we are going to consider the pros and cons of each approach.

The embedded runtime has the following advantages compared to the separate runtime library:

  • No external dependencies
  • Simple cases will not require extra generated files
  • Can have source code (compared to a header-only library)
  • Can easily support various naming conventions
  • Can minimize the code by only generating what’s needed
  • No runtime/generated code version mismatches
  • Makes the use of the generated code in CLI much easier

Let’s consider each of these points in order. The embedded runtime will not require inclusion of any external headers or linking to any external libraries other than the C++ standard library. This will make the adoption of CLI very easy, in fact, easier than adopting a header-only library. All that needs to be done is to generate the C++ files from the CLI definition and add them to the application source code. This is especially important for a relatively inconsequential functionality such as command line parsing. The requirement to add an extra dependency, even a header-only, may override all the benefits that the CLI compiler will bring.

Most applications that use the CLI language will only have one options file. When we have only one file we can generate the runtime code into the same set of C++ files as the one containing the option class(es). Things are a bit more complicated when we have multiple options files. In this case we cannot generate the runtime code directly into the resulting C++ files because this will lead to re-definitions (if two generated header files are included into the same translation unit) or duplicate symbols. In this case we will need to generate the runtime code into a separate set of C++ files and then include the resulting header into other generated files. For example, we could have the --generate-runtime file option which instructs the compiler to generate the runtime code in a separate set of C++ files and the --runtime file option which tells the compiler that the runtime is in these C++ files.

The embedded runtime can have C++ source code unlike a header-only external runtime library. We would want to restrict the external runtime to be a header-only library in order to simplify adoption, since a header-only library does not require building. However, this restriction may force us to declare certain functions inline even if they shouldn’t normally be inlined because of the potential code bloat.

One common complaint about generated code in general is that it fits poorly with the hand-written code. The major reason for this is that the generated code often doesn’t follow the same identifier naming convention as the one used in the project. For example, the project may be using “upper camel case” for type names (e.g., SimpleName) while the generated code uses the standard C++ lower case and underscores (e.g., simple_name). There is no technical reason (except for, maybe, complexity) why a code generator can’t support configurable naming conventions. In fact, that’s what we did in the C++/Tree mapping in XSD and it made a lot of people very happy. The only problem is that it is virtually impossible to support configurable naming conventions in a hand-written runtime library. But it should be quite easy to do with the embedded runtime since it is also generated by the compiler.

Because the code for the embedded runtime is generated for each application, we can minimize the output by omitting unused optional components. We can also decide whether to generate certain functions inline based on the application developer preferences.

Since with the embedded runtime there are no external dependencies, there are also no version mismatches that can occur when one of the components (generated code or runtime library) was upgraded and the other was not.

Finally, the embedded runtime approach makes it much easier to use the generated code in the CLI implementation itself. With the separate runtime library we will either have to keep an old copy around or risk breaking the generated code with backwards-incompatible changes that occur during development.

The embedded runtime approach also has a number of disadvantages:

  • Hard to develop and maintain
  • Bug fixes to the runtime require compiler rebuild
  • Impractical for large runtimes

The embedded runtime is harder to develop and maintain than a separate runtime library. This is because the code has to be emitted by the compiler instead of simply sitting in a file. In particular, because the runtime code is embedded into the compiler source code as a collection of strings, it is a lot harder to read and write.

Fixing any bug that is found in the embedded runtime code will require a compiler rebuild. In case of a header-only runtime library the same can be accomplished by patching a few files and recompiling the application.

Finally, the embedded runtime approach quickly becomes impractical as the size of the runtime code grows. The difficulty of development and maintenance is one reason. The other reason is the lack of separate compilation. All of the embedded runtime code is contained in a single generated C++ source file. As the amount of code in the runtime grows, this file takes longer and longer to compile.

Now, which approach should we use in our case? The CLI runtime is going to be pretty small, or, at least, I expect it to be. Initially it will contain a few exception definitions and maybe a few helper classes. So the size shouldn’t be an issue. On the other hand, as we discussed above, it is very important to make the generated code as easy to adopt as possible. Making it self-sufficient and dependency-free sounds very attractive. So it looks like in our situation the advantages of the embedded runtime significantly outweigh its disadvantages.

At this point we have covered enough ground to make the first usable release of the CLI compiler. From the last status update I have started working on the backend infrastructure and at this stage the compiler is able to generate the output C++ files with all the #include directives and the proper namespace structure. It is all in the source repository if you would like to take a look. From now on I will be working on generating the C++ mapping. Once this is done, we should be in good shape to release cli-1.0.0. As always, if you have any thoughts, feel free to add them in the comments.

CLI in C++: CLI to C++ Mapping

Tuesday, August 4th, 2009

This is the eighth installment in the series of posts about designing a Command Line Interface (CLI) parser for C++. The previous posts were:

In the last post we designed our CLI language which is a domain-specific language (DSL) for defining a program’s command line interface. Today we are going to see how to map the CLI language constructs to C++ constructs.

At the end of the previous post we had a list of high-level CLI language features which I am going to repeat here:

  • comments
  • namespaces
  • option class
  • option declaration
  • option inheritance
  • C++ inclusion
  • CLI inclusion
  • using declarations/directives and typedef’s
  • option documentation

We also agreed that only a subset of them will end up being supported in the initial release. But for the same reasons as with the CLI language itself, we are going to discuss the mapping of all of these constructs to C++ even though initially we are only implementing a small subset of them.

On the file level, the CLI compiler will map each CLI file to a set of C++ files, such as C++ header file, C++ inline file, and C++ source file. In other words, if we compile options.cli, we will end up with options.hxx, options.ixx, options.cxx (or the same files but with alternative extensions).

Comments are ignored. In some cases it could make sense to copy comments over into the generated code. However, there is no way to distinguish between such “documentation” comments and comments that are for the CLI definition itself. For example:

class options
{
  // Show help.
  //
  bool --help|-h;
 
  // Note: version has two aliases.
  //
  bool --version|-v|-V;
};

In this example it could be useful to copy the first comment to the generated code but not the second. The first comment should actually be made a documentation string which can then be reproduced as a comment in the generated C++ code.

CLI namespaces are mapped to C++ namespaces. This one is simple.

Similarly, option classes are mapped to C++ classes. We will need to provide the copy constructor and assignment operator. We will also need to provide a constructor to instantiate this class from the argc, argv pair.

Since the options may be followed by a number of arguments, this last constructor will need a way to tell the caller where the options end and the arguments begin. There are two ways this can be done. The first is to pass by reference an index argument which is set by the constructor to the position of the first argument. The second approach is to modify the argc and argv data by removing the entries that were consumed by the options class. The second approach is more convenient but is not usable if we need to re-examine the argv elements corresponding to the options. Finally, both versions will have one additional argument which will allow us to specify where in the argv array we should start. By default it will be the second element, after the program name. Here is how all this will look in C++:

class options
{
public:
  options (int& argc,
           char** argv,
           size_t start = 1);
 
  options (int argc,
           char** argv,
           size_t& end,
           size_t start = 1);
 
  options (const options&);
  options& operator= (const options&);
 
  ...
};

Another aspect that we will need to take care of is error handling. In particular, the argc/argv parsing constructors may fail for a number of reasons, including because of an unknown option, missing option value, or invalid option value. The user of our class will need a way to access the declarations of the corresponding exceptions. To keep everything relevant to the functionality of our options parser in one place, we can add them to the generated options class, for example:

class options
{
public:
  // Exceptions.
  //
  typedef ... unknown_option;
  typedef ... missing_value;
  typedef ... invalid_value;
 
  // Constructors.
  //
  options (int& argc,
           char** argv,
           size_t start = 1);
 
  options (int argc,
           char** argv,
           size_t& end,
           size_t start = 1);
 
  options (const options&);
  options& operator= (const options&);
 
  ...
};

Now let’s consider the central construct of our language, the option declaration. For each option we will generate a set of accessors and, optionally, modifiers to access/modify this option’s value. Most applications won’t need to modify option values after parsing so we will only generate modifiers if explicitly requested by the user with a compiler flag (e.g., –generate-modifiers). We will also need to generate a member variable which will store the option’s value. The names of the accessors, modifiers, and member variable will be derived from the option name. Finally, if the option has the default value or if it is a flag, we will need to add initializers to the constructors. For example:

class options
{
public:
  ...
 
  options (int& argc, char** argv, size_t start = 1)
    : help_ (false), compression_ (5)
  {
  }
 
  ...
 
  bool help () const;
  void help (bool); // optional
 
  short compression () const;
  void compression (short); // optional
 
protected:
  bool help_;
  short compression_;
};

Option inheritance is naturally mapped to C++ public inheritance. For example, these CLI definitions:

class common_options
{
  bool --help|-h;
  bool --version|-v;
};
 
class options: common_options
{
  short --compression = 5;
};

will be mapped to the following C++ definitions:

class common_options
{
  ...
};
 
class options: public common_options
{
  ...
};

The C++ inclusion is mapped pretty much verbatim to the C++ preprocessor #include. The only thing that we may need to do is to strip the ‘cxx:’ prefix from the path.

The CLI inclusion is a bit more complex. The purpose of CLI inclusion is to make CLI class declarations in one file visible in another file. This is necessary to support option inheritance. Since option inheritance is mapped to C++ class inheritance, the derived C++ class declaration in one file will also need to “see” the base class declaration in another file. As a result, we will need to map CLI inclusions to C++ header inclusions. Consider the following two CLI files:

// file: common.cli
//
class common_options
{
  bool --help|-h;
  bool --version|-v;
};
 
// file: options.cli
//
include "cli:common.cli"
 
class options: common_options
{
  short --compression = 5;
};

When we compile these files, the generated C++ header files would look like this:

// file: common.hxx
//
class common_options
{
  ...
};
 
// file: options.hxx
//
#include "common.hxx"
 
class options: public common_options
{
  ...
};

Here, the CLI include is mapped to the C++ preprocessor #include with the CLI file name being transformed to the corresponding C++ header file name.

Using declarations and directives as well as typedef’s are copied verbatim to the generated C++ code.

Option documentation can be used to produce several kinds of output. Outside of C++ it can be used to generate man pages and HTML-formatted documentation (or fragments thereof). In C++, the user of the options class may want to print the usage information. To support this we can add the static usage() function to our class which prints the usage information to std::ostream, for example:

class options
{
public:
  ...
 
  static void usage (std::ostream&);
};

Some applications may also need to access individual option documentation in which case we can generate a set of static functions that will allow one to access this information. Finally, the short option documentation strings can be added as comments for the corresponding accessor and modifier functions.

And that covers the basics of the mapping between CLI and C++. Next time we will consider the pros and cons of self-sufficient generated code vs generated code that depends on a runtime library. Then we will need to decide which approach to use. In the meantime I am going to start working on the CLI language parser. Hopefully by next time I will have some code to show. As always, if you have any thoughts, feel free to add them in the comments.

CLI in C++: CLI Definition Language

Sunday, July 26th, 2009

This is the seventh installment in the series of posts about designing a Command Line Interface (CLI) parser for C++. The previous posts were:

In the last post we decided that the only way for us to achieve the ideal solution is to design our own domain-specific language (DSL) for command line interface definition. And this is what we are going to be doing today.

As I mentioned in one of the previous posts, I like to explore and think things through even though there may be no plans to implement everything in the early versions of the program. This helps to ensure that once I decide to implement more advanced features, they will fit into the existing model and won’t require a complete redesign. So today I am going to try to cover as many features of a CLI definition language as I can come up with. At the end of the post I will narrow the number of features to a much smaller subset that will be supported in the initial version.

Let’s start with the overall design principles. While it would be more expressive to introduce custom keywords, it is more practical to reuse C++ keywords whenever possible. By introducing new keywords we make more identifiers unavailable to the user. For example, we could use options as a keyword:

options foo
{
  ...
};

But then options cannot be used as an identifier and for most applications it is natural to call the options containers options. So it makes sense to use the already reserved class keyword, for example:

class options
{
  ...
};

We should also strive to make the CLI constructs model conceptually similar C++ constructs. For example, the option declaration needs to capture a type, a name, as well as the default value. The closest C++ construct is probably a variable declaration with initialization. For example:

class options
{
  bool --help;
  short --compression = 5;
};

An option can have a number of aliases and the idiomatic way to represent alternatives in C++ is the OR operator (|). So we can extend our option declaration syntax to allow several names:

class options
{
  bool --help|-c;
  short --compression|-c = 5;
};

It also seems natural to reuse the C++ type system for option types. The fundamental C++ types such as bool or short will be recognized by the CLI compiler. However, the user of our CLI language will most likely also want to use user-defined C++ types. While the CLI compiler may not need to do any type analysis (such as whether the type is actually defined), we need to provide a mechanism for inclusion of such user-defined type definitions into the generated C++ code. The most natural way to do this is to mimic the C++ preprocessor #include mechanism without actually doing any preprocessing. However, if at some later stage we decide to run a preprocessor on the CLI definition files, this choice will cause problems. The next best thing is then to use include without #. Here we have no choice but to introduce a new keyword since there are no existing C++ keywords with a similar meaning. [There is a module proposal for C++ which introduces the import keyword. However, the semantics of this new keyword will be very different from what we are trying to achieve here.] Plus, the use of include as an identifier does not seem very common. Here is an example:

include <string>
include <vector>
 
class options
{
  std::string --name = "foo";
  std::vector<std::string> --names;
};

Since we support user-defined types, the default value can actually be more complex than a single literal initialization, for example:

include <complex>
 
class options
{
  std::complex<float> --value = std::complex<float> (0, -1);
};

While this approach works, it is verbose. We can, therefore, support the construction syntax in addition to the assignment, for example:

include <complex>
 
class options
{
  std::complex<float> --value (0, -1);
};

Since we will be generating C++ code that may be used throughout the application, we will need to support namespaces sooner or later. Naturally, we will reuse the namespace C++ keyword:

namespace example
{
  class options
  {
    bool --help|-c;
    short --compression|-c = 5;
  };
}

Another feature that might come in handy in more complex applications is option inheritance. For example, the XSD and XSD/e compilers that I am working on support a number of XML Schema to C++ mappings. Each mapping has some unique command line options but also a large set of common options, for example, --output-dir, --namespace-map, --reserved-name, etc. It makes sense to factor such common options out into a separate option class that is then inherited by each mapping-specific option class. Here is an example:

include <string>
include <vector>
 
class common_ops
{
  std::string --output-dir;
  std::vector<std::string> --namespace-map;
  std::vector<std::string> --reserved-names;
};
 
class cxx_tree_ops: common_ops
{
  bool --generate-serialization;
};
 
class cxx_parser_ops: common_ops
{
  bool --generate-print-impl;
};

Once we start splitting option declarations into several classes, the next thing we will want to do is to place them into different files. And for that to work we will need an inclusion mechanism for CLI definition files.

It would be straightforward to reuse the include keyword that we already use to “include” C++ files. However, there is one problem. Since we are not actually parsing the C++ files but merely including them in the generated C++ code, it will be impossible to know whether we are including a C++ file (which we don’t need to parse) or a CLI file (which we do need to parse). As a result, we will need a way to distinguish between different include types. One way to achieve this would be to introduce a new keyword for CLI inclusion. Or we can add an inclusion type prefix to the file path, similar to the scheme part in URIs. For example:

include <cxx:string>
include "cli:common.cli"
 
class cxx_tree_ops: common_ops
{
  bool --generate-serialization;
};
 
class cxx_parser_ops: common_ops
{
  bool --generate-print-impl;
};

The type prefix approach is preferable because we don’t need to introduce yet another keyword. It also looks more consistent. Since there will most likely be more C++ inclusions than CLI, we should default to C++ when the prefix is not specified.

Other constructs that would be nice to have are comments, using declarations/directives, and typedef’s, for example:

include <string>
include <vector>
 
namespace example
{
  using namespace std;
 
  // Application options.
  //
  class options
  {
    typedef vector<string> strings;
 
    string --name = "foo";
    strings --names; /* List of names. */
  };
}

The last big feature that we need to consider is options documentation. In its simplest form we would like to associate a documentation string or two with each option. The first string may provide a short description that is used, for example, in the usage information. The second string may contain a more detailed description for, say, automatic man pages generation. The use of {} feels appropriate here, something along these lines:

namespace example
{
  class options
  {
    bool --help|-h {"Show usage and exit."};
    bool --version|-v {"Show version and exit."};
 
    bool --compression|-c = 5
    {
      "Set compression level.",
      "Set compression level between 0 (no compression) "
      "and 9 (maximum compression). 5 is the default."
    };
  };
}

For applications that need to support multiple languages, a separate file for each language or locale would be appropriate. Such a file would use a special CLI documentation format. Something along these lines:

include "options.cli"
 
namespace example
{
  documentation options ("en-US")
  {
 
    --help {"Show usage and exit."};
    --version {"Show version and exit."};
 
    --compression
    {
      "Set compression level.",
      "Set compression level between 0 (no compression) "
      "and 9 (maximum compression). 5 is the default."
    };
  }
}

Now that we have identified every major feature that could be useful in a CLI definition language, we can try to narrow them down to a set that is minimal but still complete enough to be usable by a typical application. We will then use this set of features for the initial implementation. Here are the core features that I have identified:

  • option class
  • option declaration (without documentation)
  • C++ inclusion
  • namespaces

And here is the list of features to be added in subsequent releases:

  • option inheritance
  • option documentation
  • CLI inclusion
  • using declarations/directives and typedef’s
  • comments

Next time we will start thinking about how to map these CLI definitions to C++. As always, if you have any thoughts, feel free to add them in the comments.