Archive for August, 2009

CLI in C++: Status Update

Monday, August 24th, 2009

Over the past two weekends I implemented the parser for our CLI language. If you would like to check it out, you can download the source distribution via these links:

See the previous status update for more information on what to do with these files. You can also follow the development, including accessing a more detailed log of changes, via the git repository:

At this stage the parser just checks that the stream of tokens as supplied by the lexical analyzer is a valid definition of the CLI language. If there is an error, it is reported and the parser tries to recover by skipping past the next ‘;‘. There is also the formal description of the language in the doc/language.txt file and the function names in the parser class roughly correspond to the production rules.

With the initial version of the parser complete, the next step is to build the so called semantic graph from the definitions parsed. We can then traverse this semantic graph to generate the C++ mapping. I should have the semantic graph ready by the end of next weekend. Will let you know as soon as I have something working.

CLI in C++: Status Update

Wednesday, August 12th, 2009

Over the weekend I went ahead and implemented the lexical analyzer for the CLI language. If you would like to check it out, you can download the source distribution via these links:

The +dep version includes the project’s dependencies. See the INSTALL file inside for details. Alternatively, you can follow the development in the git repository:

There are a couple of interesting things to note about the lexer design: First is the handling of the include paths. To mimic the C++ preprocessor as closely as possible, we decided to allow both "foo" and <foo> styles of paths. However, in our case, the include statement is part of the language and, therefore, is tokenized by the lexer. The dilemma then is how to handle < and >, which can be used in both paths and, later, in option types that contain template-ids (for example, std::vector<int>) or expressions (for example, 2 < 3). If we always treat them as separate tokens, then handling of the include paths becomes very tricky. For example, the <foo/bar.hxx> path would be split into several tokens. On the parser level it will be indistinguishable from, say, <foo / bar.hxx>, which may not be the same or even a valid path.

To overcome this problem, the lexer treats < after the include keyword as a start of a path literal instead of as a separate token (path literals are handled in the same way as string literals except for having < > instead of " "). That’s one area where we had to bring a little bit of language semantics knowledge into the lexical analyzer.

Another interesting thing to know is handling of option names. To be convenient to use, option names should allow various additional symbols such as -, /, etc., that are not allowed in C++ identifiers. Consider this option definition as an example:

class options
{
  bool --help-me|-h|/h;
};

The question is how do we treat these additional symbols: as part of the option identifier or as separate tokens? Handling them as separate tokens presents the same problem as in the include path handling. Namely, the option name can be written in many different ways which will all result in the same token sequence. Making these additional symbols a part of an option identifier can be done in two ways. We can either recognize and make option names as a special kind of identifier or we can “relax” all the identifiers in the language to include these symbols. The first approach would be cleaner but is hard to implement. The lexer would need to recognize the places where option names can appear and scan them accordingly. Since there is no keyword to easily identify such places, the lexer would need to implement pretty much the same language parsing logic as the parser itself. This is a bit more semantics knowledge than I am willing to bring into the lexical analyzer.

This leaves us with the relaxed identifier option. One major drawback of this approach is the difficulty of handling expressions that involve -, /, etc. Consider this definition:

class options
{
  int -a-b = -a-b;
};

Semantically, the first -a-b is the option name while the second is an expression. In the initial version of the CLI compiler I decided not to support expressions other than literals (negative integer literals, such as -5 are supported by recognizing them as a special case). The more complex expressions can be defined as constants in C++ headers and included into the CLI file, for example:

// types.hxx
//
const int a = 1;
const int b = 2;
const int a_b = -a-b;
// options.cli
//
include "types.hxx";
 
class options
{
  int -a-b = a_b;
};

In the future we can support full expressions by recognizing places in the language where they can appear. That is, after = as well as between ( ) and < >. For example:

class options
{
  int -a = 2-1;
  int -b (2-1);
  foo<2-1> -c;
};

On the parser level, we will also need to tighten these relaxed identifiers back to the C++ subset for namespaces, classes, and option types.

With the initial version of the lexer complete, the next thing to implement is the parser. I will let you know as soon as I have something working.

CLI in C++: CLI to C++ Mapping

Tuesday, August 4th, 2009

This is the eighth installment in the series of posts about designing a Command Line Interface (CLI) parser for C++. The previous posts were:

In the last post we designed our CLI language which is a domain-specific language (DSL) for defining a program’s command line interface. Today we are going to see how to map the CLI language constructs to C++ constructs.

At the end of the previous post we had a list of high-level CLI language features which I am going to repeat here:

  • comments
  • namespaces
  • option class
  • option declaration
  • option inheritance
  • C++ inclusion
  • CLI inclusion
  • using declarations/directives and typedef’s
  • option documentation

We also agreed that only a subset of them will end up being supported in the initial release. But for the same reasons as with the CLI language itself, we are going to discuss the mapping of all of these constructs to C++ even though initially we are only implementing a small subset of them.

On the file level, the CLI compiler will map each CLI file to a set of C++ files, such as C++ header file, C++ inline file, and C++ source file. In other words, if we compile options.cli, we will end up with options.hxx, options.ixx, options.cxx (or the same files but with alternative extensions).

Comments are ignored. In some cases it could make sense to copy comments over into the generated code. However, there is no way to distinguish between such “documentation” comments and comments that are for the CLI definition itself. For example:

class options
{
  // Show help.
  //
  bool --help|-h;
 
  // Note: version has two aliases.
  //
  bool --version|-v|-V;
};

In this example it could be useful to copy the first comment to the generated code but not the second. The first comment should actually be made a documentation string which can then be reproduced as a comment in the generated C++ code.

CLI namespaces are mapped to C++ namespaces. This one is simple.

Similarly, option classes are mapped to C++ classes. We will need to provide the copy constructor and assignment operator. We will also need to provide a constructor to instantiate this class from the argc, argv pair.

Since the options may be followed by a number of arguments, this last constructor will need a way to tell the caller where the options end and the arguments begin. There are two ways this can be done. The first is to pass by reference an index argument which is set by the constructor to the position of the first argument. The second approach is to modify the argc and argv data by removing the entries that were consumed by the options class. The second approach is more convenient but is not usable if we need to re-examine the argv elements corresponding to the options. Finally, both versions will have one additional argument which will allow us to specify where in the argv array we should start. By default it will be the second element, after the program name. Here is how all this will look in C++:

class options
{
public:
  options (int& argc,
           char** argv,
           size_t start = 1);
 
  options (int argc,
           char** argv,
           size_t& end,
           size_t start = 1);
 
  options (const options&);
  options& operator= (const options&);
 
  ...
};

Another aspect that we will need to take care of is error handling. In particular, the argc/argv parsing constructors may fail for a number of reasons, including because of an unknown option, missing option value, or invalid option value. The user of our class will need a way to access the declarations of the corresponding exceptions. To keep everything relevant to the functionality of our options parser in one place, we can add them to the generated options class, for example:

class options
{
public:
  // Exceptions.
  //
  typedef ... unknown_option;
  typedef ... missing_value;
  typedef ... invalid_value;
 
  // Constructors.
  //
  options (int& argc,
           char** argv,
           size_t start = 1);
 
  options (int argc,
           char** argv,
           size_t& end,
           size_t start = 1);
 
  options (const options&);
  options& operator= (const options&);
 
  ...
};

Now let’s consider the central construct of our language, the option declaration. For each option we will generate a set of accessors and, optionally, modifiers to access/modify this option’s value. Most applications won’t need to modify option values after parsing so we will only generate modifiers if explicitly requested by the user with a compiler flag (e.g., –generate-modifiers). We will also need to generate a member variable which will store the option’s value. The names of the accessors, modifiers, and member variable will be derived from the option name. Finally, if the option has the default value or if it is a flag, we will need to add initializers to the constructors. For example:

class options
{
public:
  ...
 
  options (int& argc, char** argv, size_t start = 1)
    : help_ (false), compression_ (5)
  {
  }
 
  ...
 
  bool help () const;
  void help (bool); // optional
 
  short compression () const;
  void compression (short); // optional
 
protected:
  bool help_;
  short compression_;
};

Option inheritance is naturally mapped to C++ public inheritance. For example, these CLI definitions:

class common_options
{
  bool --help|-h;
  bool --version|-v;
};
 
class options: common_options
{
  short --compression = 5;
};

will be mapped to the following C++ definitions:

class common_options
{
  ...
};
 
class options: public common_options
{
  ...
};

The C++ inclusion is mapped pretty much verbatim to the C++ preprocessor #include. The only thing that we may need to do is to strip the ‘cxx:’ prefix from the path.

The CLI inclusion is a bit more complex. The purpose of CLI inclusion is to make CLI class declarations in one file visible in another file. This is necessary to support option inheritance. Since option inheritance is mapped to C++ class inheritance, the derived C++ class declaration in one file will also need to “see” the base class declaration in another file. As a result, we will need to map CLI inclusions to C++ header inclusions. Consider the following two CLI files:

// file: common.cli
//
class common_options
{
  bool --help|-h;
  bool --version|-v;
};
 
// file: options.cli
//
include "cli:common.cli"
 
class options: common_options
{
  short --compression = 5;
};

When we compile these files, the generated C++ header files would look like this:

// file: common.hxx
//
class common_options
{
  ...
};
 
// file: options.hxx
//
#include "common.hxx"
 
class options: public common_options
{
  ...
};

Here, the CLI include is mapped to the C++ preprocessor #include with the CLI file name being transformed to the corresponding C++ header file name.

Using declarations and directives as well as typedef’s are copied verbatim to the generated C++ code.

Option documentation can be used to produce several kinds of output. Outside of C++ it can be used to generate man pages and HTML-formatted documentation (or fragments thereof). In C++, the user of the options class may want to print the usage information. To support this we can add the static usage() function to our class which prints the usage information to std::ostream, for example:

class options
{
public:
  ...
 
  static void usage (std::ostream&);
};

Some applications may also need to access individual option documentation in which case we can generate a set of static functions that will allow one to access this information. Finally, the short option documentation strings can be added as comments for the corresponding accessor and modifier functions.

And that covers the basics of the mapping between CLI and C++. Next time we will consider the pros and cons of self-sufficient generated code vs generated code that depends on a runtime library. Then we will need to decide which approach to use. In the meantime I am going to start working on the CLI language parser. Hopefully by next time I will have some code to show. As always, if you have any thoughts, feel free to add them in the comments.