Archive for the ‘CLI’ Category

CLI in C++: Status Update

Wednesday, August 12th, 2009

Over the weekend I went ahead and implemented the lexical analyzer for the CLI language. If you would like to check it out, you can download the source distribution via these links:

The +dep version includes the project’s dependencies. See the INSTALL file inside for details. Alternatively, you can follow the development in the git repository:

There are a couple of interesting things to note about the lexer design: First is the handling of the include paths. To mimic the C++ preprocessor as closely as possible, we decided to allow both "foo" and <foo> styles of paths. However, in our case, the include statement is part of the language and, therefore, is tokenized by the lexer. The dilemma then is how to handle < and >, which can be used in both paths and, later, in option types that contain template-ids (for example, std::vector<int>) or expressions (for example, 2 < 3). If we always treat them as separate tokens, then handling of the include paths becomes very tricky. For example, the <foo/bar.hxx> path would be split into several tokens. On the parser level it will be indistinguishable from, say, <foo / bar.hxx>, which may not be the same or even a valid path.

To overcome this problem, the lexer treats < after the include keyword as a start of a path literal instead of as a separate token (path literals are handled in the same way as string literals except for having < > instead of " "). That’s one area where we had to bring a little bit of language semantics knowledge into the lexical analyzer.

Another interesting thing to know is handling of option names. To be convenient to use, option names should allow various additional symbols such as -, /, etc., that are not allowed in C++ identifiers. Consider this option definition as an example:

class options
{
  bool --help-me|-h|/h;
};

The question is how do we treat these additional symbols: as part of the option identifier or as separate tokens? Handling them as separate tokens presents the same problem as in the include path handling. Namely, the option name can be written in many different ways which will all result in the same token sequence. Making these additional symbols a part of an option identifier can be done in two ways. We can either recognize and make option names as a special kind of identifier or we can “relax” all the identifiers in the language to include these symbols. The first approach would be cleaner but is hard to implement. The lexer would need to recognize the places where option names can appear and scan them accordingly. Since there is no keyword to easily identify such places, the lexer would need to implement pretty much the same language parsing logic as the parser itself. This is a bit more semantics knowledge than I am willing to bring into the lexical analyzer.

This leaves us with the relaxed identifier option. One major drawback of this approach is the difficulty of handling expressions that involve -, /, etc. Consider this definition:

class options
{
  int -a-b = -a-b;
};

Semantically, the first -a-b is the option name while the second is an expression. In the initial version of the CLI compiler I decided not to support expressions other than literals (negative integer literals, such as -5 are supported by recognizing them as a special case). The more complex expressions can be defined as constants in C++ headers and included into the CLI file, for example:

// types.hxx
//
const int a = 1;
const int b = 2;
const int a_b = -a-b;
// options.cli
//
include "types.hxx";
 
class options
{
  int -a-b = a_b;
};

In the future we can support full expressions by recognizing places in the language where they can appear. That is, after = as well as between ( ) and < >. For example:

class options
{
  int -a = 2-1;
  int -b (2-1);
  foo<2-1> -c;
};

On the parser level, we will also need to tighten these relaxed identifiers back to the C++ subset for namespaces, classes, and option types.

With the initial version of the lexer complete, the next thing to implement is the parser. I will let you know as soon as I have something working.

CLI in C++: CLI to C++ Mapping

Tuesday, August 4th, 2009

This is the eighth installment in the series of posts about designing a Command Line Interface (CLI) parser for C++. The previous posts were:

In the last post we designed our CLI language which is a domain-specific language (DSL) for defining a program’s command line interface. Today we are going to see how to map the CLI language constructs to C++ constructs.

At the end of the previous post we had a list of high-level CLI language features which I am going to repeat here:

  • comments
  • namespaces
  • option class
  • option declaration
  • option inheritance
  • C++ inclusion
  • CLI inclusion
  • using declarations/directives and typedef’s
  • option documentation

We also agreed that only a subset of them will end up being supported in the initial release. But for the same reasons as with the CLI language itself, we are going to discuss the mapping of all of these constructs to C++ even though initially we are only implementing a small subset of them.

On the file level, the CLI compiler will map each CLI file to a set of C++ files, such as C++ header file, C++ inline file, and C++ source file. In other words, if we compile options.cli, we will end up with options.hxx, options.ixx, options.cxx (or the same files but with alternative extensions).

Comments are ignored. In some cases it could make sense to copy comments over into the generated code. However, there is no way to distinguish between such “documentation” comments and comments that are for the CLI definition itself. For example:

class options
{
  // Show help.
  //
  bool --help|-h;
 
  // Note: version has two aliases.
  //
  bool --version|-v|-V;
};

In this example it could be useful to copy the first comment to the generated code but not the second. The first comment should actually be made a documentation string which can then be reproduced as a comment in the generated C++ code.

CLI namespaces are mapped to C++ namespaces. This one is simple.

Similarly, option classes are mapped to C++ classes. We will need to provide the copy constructor and assignment operator. We will also need to provide a constructor to instantiate this class from the argc, argv pair.

Since the options may be followed by a number of arguments, this last constructor will need a way to tell the caller where the options end and the arguments begin. There are two ways this can be done. The first is to pass by reference an index argument which is set by the constructor to the position of the first argument. The second approach is to modify the argc and argv data by removing the entries that were consumed by the options class. The second approach is more convenient but is not usable if we need to re-examine the argv elements corresponding to the options. Finally, both versions will have one additional argument which will allow us to specify where in the argv array we should start. By default it will be the second element, after the program name. Here is how all this will look in C++:

class options
{
public:
  options (int& argc,
           char** argv,
           size_t start = 1);
 
  options (int argc,
           char** argv,
           size_t& end,
           size_t start = 1);
 
  options (const options&);
  options& operator= (const options&);
 
  ...
};

Another aspect that we will need to take care of is error handling. In particular, the argc/argv parsing constructors may fail for a number of reasons, including because of an unknown option, missing option value, or invalid option value. The user of our class will need a way to access the declarations of the corresponding exceptions. To keep everything relevant to the functionality of our options parser in one place, we can add them to the generated options class, for example:

class options
{
public:
  // Exceptions.
  //
  typedef ... unknown_option;
  typedef ... missing_value;
  typedef ... invalid_value;
 
  // Constructors.
  //
  options (int& argc,
           char** argv,
           size_t start = 1);
 
  options (int argc,
           char** argv,
           size_t& end,
           size_t start = 1);
 
  options (const options&);
  options& operator= (const options&);
 
  ...
};

Now let’s consider the central construct of our language, the option declaration. For each option we will generate a set of accessors and, optionally, modifiers to access/modify this option’s value. Most applications won’t need to modify option values after parsing so we will only generate modifiers if explicitly requested by the user with a compiler flag (e.g., –generate-modifiers). We will also need to generate a member variable which will store the option’s value. The names of the accessors, modifiers, and member variable will be derived from the option name. Finally, if the option has the default value or if it is a flag, we will need to add initializers to the constructors. For example:

class options
{
public:
  ...
 
  options (int& argc, char** argv, size_t start = 1)
    : help_ (false), compression_ (5)
  {
  }
 
  ...
 
  bool help () const;
  void help (bool); // optional
 
  short compression () const;
  void compression (short); // optional
 
protected:
  bool help_;
  short compression_;
};

Option inheritance is naturally mapped to C++ public inheritance. For example, these CLI definitions:

class common_options
{
  bool --help|-h;
  bool --version|-v;
};
 
class options: common_options
{
  short --compression = 5;
};

will be mapped to the following C++ definitions:

class common_options
{
  ...
};
 
class options: public common_options
{
  ...
};

The C++ inclusion is mapped pretty much verbatim to the C++ preprocessor #include. The only thing that we may need to do is to strip the ‘cxx:’ prefix from the path.

The CLI inclusion is a bit more complex. The purpose of CLI inclusion is to make CLI class declarations in one file visible in another file. This is necessary to support option inheritance. Since option inheritance is mapped to C++ class inheritance, the derived C++ class declaration in one file will also need to “see” the base class declaration in another file. As a result, we will need to map CLI inclusions to C++ header inclusions. Consider the following two CLI files:

// file: common.cli
//
class common_options
{
  bool --help|-h;
  bool --version|-v;
};
 
// file: options.cli
//
include "cli:common.cli"
 
class options: common_options
{
  short --compression = 5;
};

When we compile these files, the generated C++ header files would look like this:

// file: common.hxx
//
class common_options
{
  ...
};
 
// file: options.hxx
//
#include "common.hxx"
 
class options: public common_options
{
  ...
};

Here, the CLI include is mapped to the C++ preprocessor #include with the CLI file name being transformed to the corresponding C++ header file name.

Using declarations and directives as well as typedef’s are copied verbatim to the generated C++ code.

Option documentation can be used to produce several kinds of output. Outside of C++ it can be used to generate man pages and HTML-formatted documentation (or fragments thereof). In C++, the user of the options class may want to print the usage information. To support this we can add the static usage() function to our class which prints the usage information to std::ostream, for example:

class options
{
public:
  ...
 
  static void usage (std::ostream&);
};

Some applications may also need to access individual option documentation in which case we can generate a set of static functions that will allow one to access this information. Finally, the short option documentation strings can be added as comments for the corresponding accessor and modifier functions.

And that covers the basics of the mapping between CLI and C++. Next time we will consider the pros and cons of self-sufficient generated code vs generated code that depends on a runtime library. Then we will need to decide which approach to use. In the meantime I am going to start working on the CLI language parser. Hopefully by next time I will have some code to show. As always, if you have any thoughts, feel free to add them in the comments.

CLI in C++: CLI Definition Language

Sunday, July 26th, 2009

This is the seventh installment in the series of posts about designing a Command Line Interface (CLI) parser for C++. The previous posts were:

In the last post we decided that the only way for us to achieve the ideal solution is to design our own domain-specific language (DSL) for command line interface definition. And this is what we are going to be doing today.

As I mentioned in one of the previous posts, I like to explore and think things through even though there may be no plans to implement everything in the early versions of the program. This helps to ensure that once I decide to implement more advanced features, they will fit into the existing model and won’t require a complete redesign. So today I am going to try to cover as many features of a CLI definition language as I can come up with. At the end of the post I will narrow the number of features to a much smaller subset that will be supported in the initial version.

Let’s start with the overall design principles. While it would be more expressive to introduce custom keywords, it is more practical to reuse C++ keywords whenever possible. By introducing new keywords we make more identifiers unavailable to the user. For example, we could use options as a keyword:

options foo
{
  ...
};

But then options cannot be used as an identifier and for most applications it is natural to call the options containers options. So it makes sense to use the already reserved class keyword, for example:

class options
{
  ...
};

We should also strive to make the CLI constructs model conceptually similar C++ constructs. For example, the option declaration needs to capture a type, a name, as well as the default value. The closest C++ construct is probably a variable declaration with initialization. For example:

class options
{
  bool --help;
  short --compression = 5;
};

An option can have a number of aliases and the idiomatic way to represent alternatives in C++ is the OR operator (|). So we can extend our option declaration syntax to allow several names:

class options
{
  bool --help|-c;
  short --compression|-c = 5;
};

It also seems natural to reuse the C++ type system for option types. The fundamental C++ types such as bool or short will be recognized by the CLI compiler. However, the user of our CLI language will most likely also want to use user-defined C++ types. While the CLI compiler may not need to do any type analysis (such as whether the type is actually defined), we need to provide a mechanism for inclusion of such user-defined type definitions into the generated C++ code. The most natural way to do this is to mimic the C++ preprocessor #include mechanism without actually doing any preprocessing. However, if at some later stage we decide to run a preprocessor on the CLI definition files, this choice will cause problems. The next best thing is then to use include without #. Here we have no choice but to introduce a new keyword since there are no existing C++ keywords with a similar meaning. [There is a module proposal for C++ which introduces the import keyword. However, the semantics of this new keyword will be very different from what we are trying to achieve here.] Plus, the use of include as an identifier does not seem very common. Here is an example:

include <string>
include <vector>
 
class options
{
  std::string --name = "foo";
  std::vector<std::string> --names;
};

Since we support user-defined types, the default value can actually be more complex than a single literal initialization, for example:

include <complex>
 
class options
{
  std::complex<float> --value = std::complex<float> (0, -1);
};

While this approach works, it is verbose. We can, therefore, support the construction syntax in addition to the assignment, for example:

include <complex>
 
class options
{
  std::complex<float> --value (0, -1);
};

Since we will be generating C++ code that may be used throughout the application, we will need to support namespaces sooner or later. Naturally, we will reuse the namespace C++ keyword:

namespace example
{
  class options
  {
    bool --help|-c;
    short --compression|-c = 5;
  };
}

Another feature that might come in handy in more complex applications is option inheritance. For example, the XSD and XSD/e compilers that I am working on support a number of XML Schema to C++ mappings. Each mapping has some unique command line options but also a large set of common options, for example, --output-dir, --namespace-map, --reserved-name, etc. It makes sense to factor such common options out into a separate option class that is then inherited by each mapping-specific option class. Here is an example:

include <string>
include <vector>
 
class common_ops
{
  std::string --output-dir;
  std::vector<std::string> --namespace-map;
  std::vector<std::string> --reserved-names;
};
 
class cxx_tree_ops: common_ops
{
  bool --generate-serialization;
};
 
class cxx_parser_ops: common_ops
{
  bool --generate-print-impl;
};

Once we start splitting option declarations into several classes, the next thing we will want to do is to place them into different files. And for that to work we will need an inclusion mechanism for CLI definition files.

It would be straightforward to reuse the include keyword that we already use to “include” C++ files. However, there is one problem. Since we are not actually parsing the C++ files but merely including them in the generated C++ code, it will be impossible to know whether we are including a C++ file (which we don’t need to parse) or a CLI file (which we do need to parse). As a result, we will need a way to distinguish between different include types. One way to achieve this would be to introduce a new keyword for CLI inclusion. Or we can add an inclusion type prefix to the file path, similar to the scheme part in URIs. For example:

include <cxx:string>
include "cli:common.cli"
 
class cxx_tree_ops: common_ops
{
  bool --generate-serialization;
};
 
class cxx_parser_ops: common_ops
{
  bool --generate-print-impl;
};

The type prefix approach is preferable because we don’t need to introduce yet another keyword. It also looks more consistent. Since there will most likely be more C++ inclusions than CLI, we should default to C++ when the prefix is not specified.

Other constructs that would be nice to have are comments, using declarations/directives, and typedef’s, for example:

include <string>
include <vector>
 
namespace example
{
  using namespace std;
 
  // Application options.
  //
  class options
  {
    typedef vector<string> strings;
 
    string --name = "foo";
    strings --names; /* List of names. */
  };
}

The last big feature that we need to consider is options documentation. In its simplest form we would like to associate a documentation string or two with each option. The first string may provide a short description that is used, for example, in the usage information. The second string may contain a more detailed description for, say, automatic man pages generation. The use of {} feels appropriate here, something along these lines:

namespace example
{
  class options
  {
    bool --help|-h {"Show usage and exit."};
    bool --version|-v {"Show version and exit."};
 
    bool --compression|-c = 5
    {
      "Set compression level.",
      "Set compression level between 0 (no compression) "
      "and 9 (maximum compression). 5 is the default."
    };
  };
}

For applications that need to support multiple languages, a separate file for each language or locale would be appropriate. Such a file would use a special CLI documentation format. Something along these lines:

include "options.cli"
 
namespace example
{
  documentation options ("en-US")
  {
 
    --help {"Show usage and exit."};
    --version {"Show version and exit."};
 
    --compression
    {
      "Set compression level.",
      "Set compression level between 0 (no compression) "
      "and 9 (maximum compression). 5 is the default."
    };
  }
}

Now that we have identified every major feature that could be useful in a CLI definition language, we can try to narrow them down to a set that is minimal but still complete enough to be usable by a typical application. We will then use this set of features for the initial implementation. Here are the core features that I have identified:

  • option class
  • option declaration (without documentation)
  • C++ inclusion
  • namespaces

And here is the list of features to be added in subsequent releases:

  • option inheritance
  • option documentation
  • CLI inclusion
  • using declarations/directives and typedef’s
  • comments

Next time we will start thinking about how to map these CLI definitions to C++. As always, if you have any thoughts, feel free to add them in the comments.