Archive for July, 2009

CLI in C++: CLI Definition Language

Sunday, July 26th, 2009

This is the seventh installment in the series of posts about designing a Command Line Interface (CLI) parser for C++. The previous posts were:

In the last post we decided that the only way for us to achieve the ideal solution is to design our own domain-specific language (DSL) for command line interface definition. And this is what we are going to be doing today.

As I mentioned in one of the previous posts, I like to explore and think things through even though there may be no plans to implement everything in the early versions of the program. This helps to ensure that once I decide to implement more advanced features, they will fit into the existing model and won’t require a complete redesign. So today I am going to try to cover as many features of a CLI definition language as I can come up with. At the end of the post I will narrow the number of features to a much smaller subset that will be supported in the initial version.

Let’s start with the overall design principles. While it would be more expressive to introduce custom keywords, it is more practical to reuse C++ keywords whenever possible. By introducing new keywords we make more identifiers unavailable to the user. For example, we could use options as a keyword:

options foo
{
  ...
};

But then options cannot be used as an identifier and for most applications it is natural to call the options containers options. So it makes sense to use the already reserved class keyword, for example:

class options
{
  ...
};

We should also strive to make the CLI constructs model conceptually similar C++ constructs. For example, the option declaration needs to capture a type, a name, as well as the default value. The closest C++ construct is probably a variable declaration with initialization. For example:

class options
{
  bool --help;
  short --compression = 5;
};

An option can have a number of aliases and the idiomatic way to represent alternatives in C++ is the OR operator (|). So we can extend our option declaration syntax to allow several names:

class options
{
  bool --help|-c;
  short --compression|-c = 5;
};

It also seems natural to reuse the C++ type system for option types. The fundamental C++ types such as bool or short will be recognized by the CLI compiler. However, the user of our CLI language will most likely also want to use user-defined C++ types. While the CLI compiler may not need to do any type analysis (such as whether the type is actually defined), we need to provide a mechanism for inclusion of such user-defined type definitions into the generated C++ code. The most natural way to do this is to mimic the C++ preprocessor #include mechanism without actually doing any preprocessing. However, if at some later stage we decide to run a preprocessor on the CLI definition files, this choice will cause problems. The next best thing is then to use include without #. Here we have no choice but to introduce a new keyword since there are no existing C++ keywords with a similar meaning. [There is a module proposal for C++ which introduces the import keyword. However, the semantics of this new keyword will be very different from what we are trying to achieve here.] Plus, the use of include as an identifier does not seem very common. Here is an example:

include <string>
include <vector>
 
class options
{
  std::string --name = "foo";
  std::vector<std::string> --names;
};

Since we support user-defined types, the default value can actually be more complex than a single literal initialization, for example:

include <complex>
 
class options
{
  std::complex<float> --value = std::complex<float> (0, -1);
};

While this approach works, it is verbose. We can, therefore, support the construction syntax in addition to the assignment, for example:

include <complex>
 
class options
{
  std::complex<float> --value (0, -1);
};

Since we will be generating C++ code that may be used throughout the application, we will need to support namespaces sooner or later. Naturally, we will reuse the namespace C++ keyword:

namespace example
{
  class options
  {
    bool --help|-c;
    short --compression|-c = 5;
  };
}

Another feature that might come in handy in more complex applications is option inheritance. For example, the XSD and XSD/e compilers that I am working on support a number of XML Schema to C++ mappings. Each mapping has some unique command line options but also a large set of common options, for example, --output-dir, --namespace-map, --reserved-name, etc. It makes sense to factor such common options out into a separate option class that is then inherited by each mapping-specific option class. Here is an example:

include <string>
include <vector>
 
class common_ops
{
  std::string --output-dir;
  std::vector<std::string> --namespace-map;
  std::vector<std::string> --reserved-names;
};
 
class cxx_tree_ops: common_ops
{
  bool --generate-serialization;
};
 
class cxx_parser_ops: common_ops
{
  bool --generate-print-impl;
};

Once we start splitting option declarations into several classes, the next thing we will want to do is to place them into different files. And for that to work we will need an inclusion mechanism for CLI definition files.

It would be straightforward to reuse the include keyword that we already use to “include” C++ files. However, there is one problem. Since we are not actually parsing the C++ files but merely including them in the generated C++ code, it will be impossible to know whether we are including a C++ file (which we don’t need to parse) or a CLI file (which we do need to parse). As a result, we will need a way to distinguish between different include types. One way to achieve this would be to introduce a new keyword for CLI inclusion. Or we can add an inclusion type prefix to the file path, similar to the scheme part in URIs. For example:

include <cxx:string>
include "cli:common.cli"
 
class cxx_tree_ops: common_ops
{
  bool --generate-serialization;
};
 
class cxx_parser_ops: common_ops
{
  bool --generate-print-impl;
};

The type prefix approach is preferable because we don’t need to introduce yet another keyword. It also looks more consistent. Since there will most likely be more C++ inclusions than CLI, we should default to C++ when the prefix is not specified.

Other constructs that would be nice to have are comments, using declarations/directives, and typedef’s, for example:

include <string>
include <vector>
 
namespace example
{
  using namespace std;
 
  // Application options.
  //
  class options
  {
    typedef vector<string> strings;
 
    string --name = "foo";
    strings --names; /* List of names. */
  };
}

The last big feature that we need to consider is options documentation. In its simplest form we would like to associate a documentation string or two with each option. The first string may provide a short description that is used, for example, in the usage information. The second string may contain a more detailed description for, say, automatic man pages generation. The use of {} feels appropriate here, something along these lines:

namespace example
{
  class options
  {
    bool --help|-h {"Show usage and exit."};
    bool --version|-v {"Show version and exit."};
 
    bool --compression|-c = 5
    {
      "Set compression level.",
      "Set compression level between 0 (no compression) "
      "and 9 (maximum compression). 5 is the default."
    };
  };
}

For applications that need to support multiple languages, a separate file for each language or locale would be appropriate. Such a file would use a special CLI documentation format. Something along these lines:

include "options.cli"
 
namespace example
{
  documentation options ("en-US")
  {
 
    --help {"Show usage and exit."};
    --version {"Show version and exit."};
 
    --compression
    {
      "Set compression level.",
      "Set compression level between 0 (no compression) "
      "and 9 (maximum compression). 5 is the default."
    };
  }
}

Now that we have identified every major feature that could be useful in a CLI definition language, we can try to narrow them down to a set that is minimal but still complete enough to be usable by a typical application. We will then use this set of features for the initial implementation. Here are the core features that I have identified:

  • option class
  • option declaration (without documentation)
  • C++ inclusion
  • namespaces

And here is the list of features to be added in subsequent releases:

  • option inheritance
  • option documentation
  • CLI inclusion
  • using declarations/directives and typedef’s
  • comments

Next time we will start thinking about how to map these CLI definitions to C++. As always, if you have any thoughts, feel free to add them in the comments.

CLI in C++: DSL-based Designs

Sunday, July 19th, 2009

This is the sixth installment in the series of posts about designing a Command Line Interface (CLI) parser for C++. The previous posts were:

In the last post we analyzed design approaches which have the command line interface defined in the C++ source code. Today we will start exploring designs that rely on domain-specific languages (DSL).

A DSL is a special-purpose language tailored for a specific domain or problem. We have two broad choices when it comes to the DSL-based designs. We can try to reuse or retrofit an existing language to describe the command line interface. Or we can design our own command line interface definition language. The main advantage of the first approach is the ability to base our implementation on an existing compiler implementation. The main disadvantage lies in the difficulty of reusing an existing language for a different purpose. If a language is fairly generic, then the resulting CLI definition will most likely end up overly verbose. On the other hand, if a language is tailored to address a more specific problem, we may be unable to use it to capture some of the aspects of the command line interface. A good example of this problem would be a hypothetical language that describes objects containing typed name-value pairs. We could use the pair’s name to capture the option name. However, options may have aliases (e.g., --help and -h) and it would be impossible to capture them in such a language. If we decide to design our own language for CLI definition, then we can make it a perfect fit for our requirements. However, we will have to implement the compiler from scratch.

One existing DSL language that was suggested by Malisha Mogilny is YANG. YANG is a data modeling language used to describe configuration and state data. Here is how we could model the CLI definition using YANG:

module example
{
  container options
  {
    leaf help
    {
      type boolean;
    }
 
    leaf version
    {
      type boolean;
    }
 
    leaf version
    {
      type uint16;
      default 5;
    }
  }
}

This definition would be mapped to C++ code along these lines:

namespace example
{
  class options
  {
  public:
    options ()
      : help_ (false),
        version_ (false),
        compression_ (5)
    {
    }
 
    bool help () const;
    bool version () const;
    unsigned short compression () const;
 
  private:
    bool help_;
    bool version_;
    unsigned short compression_;
  };
}

There is a number of problems with reusing YANG for command line interface definition. The language is very big and 90% of it does not apply to CLI. There is no easy way to define name aliases for options (we could use the extension mechanism, but it gets quite verbose). The YANG type system uses names for built-in types that differ from those in C++. As a result, we will need to provide a mapping between YANG types and C++ types. Finally, the definition presented above is verbose, it has too much syntax. Compare it to the following definition which we can achieve with our own language:

namespace example
{
  class options
  {
    bool --help|-h;
    bool --version;
    unsigned short --compression = 5;
  };
}

Which brings us to the custom DSL design alternative. The above example is the most elegant and concise CLI definition that we have seen so far. We can also support user-defined C++ type which won’t be possible if we are reusing an existing language. For example:

#include <string>
#include <vector>
#include <boost/regex.hpp>
 
namespace example
{
  class options
  {
    std::vector<std::string> --names;
    boost::regex --expr (".*", boost::regex::perl);
  };
}

Until now we have identified and analyzed three broad design alternatives: the native design, reusing an existing DSL, and creating our own language for CLI definition. The first approach is the simplest but, as we have discussed in the previous posts, it has a number of problems, including verbosity and implementation issues. Reusing an existing DSL will most likely also result in a sub-optimal solution as we have seen today. Designing our own language involves the largest amount of work but gives us complete control and theoretically allows us to design a truly ideal solution. Since we are after an ideal solution, having our own DSL appears to be the only viable way to achieve this. So next time we will start designing our own CLI definition language. As always, you are welcome to add your thoughts on this in the comments.

CLI in C++: Native Designs

Sunday, July 12th, 2009

This is the fifth installment in the series of posts about designing a Command Line Interface (CLI) parser for C++. The previous posts were:

Today we will start exploring the possible design alternatives for a CLI parser. But first, let’s divide all the possible designs into two categories. In the first category there are designs that define the command line interface in the C++ source code itself. We will call them native. In the second category there are designs that define the command line interface outside of C++, in the so-called domain-specific language (DSL). Such a definition is then translated to C++ using a DSL compiler. We will call these types of design DSL-based. The first approach is preferable since it is more flexible, easier to maintain, and, overall, keeps things simple. If we cannot achieve the ideal solution using this design, then we will need to decide whether the drawbacks of the best solutions from the first category outweigh the trouble of going the DSL route. Today we will concentrate on the native designs.

Let’s also reiterate the properties of the ideal solution that we have established so far:

  1. Aggregation: options are stored in an object
  2. Static naming: option accessors have names derived from option names
  3. Static typing: option accessors have return types fixed to option types
  4. No repetition: the option name and option type are specified only once for each option

The two native solutions that we have seen so far and that have come closest to the ideal are the functor-based design and the template-based design. Here is the recap of the functor-based CLI definition:

struct options: cli:options
{
  options ()
    : help (false, "--help"),
      version (false, "--version"),
      compression (5, "--compression")
  {
  }
 
  cli::option<bool> help;
  cli::option<bool> version;
  cli::option<unsigned short> compression;
};

And here is the template-based version:

extern const char help[] = "help";
extern const char version[] = "version";
extern const char compression[] = "compression";
 
typedef
cli::options<help, bool,
             version, bool,
             compression, unsigned short>
options;
 
typedef cli::options_spec<options> options_spec;
 
int main ()
{
  options_spec spec;
  spec.option<compression> ().default_value (5);
  ...
}

Both solutions satisfy the first three properties but fail the “No repetition” one. In both cases we have to repeat the option name at least three times.

To see whether we can improve on the functor-based design, we can try to analyze it on a more elementary level. To satisfy the second rule (static naming), we will have to have a C++ identifier (i.e., a function or a functor name) corresponding to the option name. We will also need to have a string representation of the option name so that we can compare it to command line array elements during parsing. Since there is no easy way to get one from the other (the easiest method would probably be to use the debug information), we will have to repeat the option name at least twice. Thus the best definition that we can hope to achieve would be something along these lines (pseudo C++):

struct options: cli:options
{
  cli::option<bool, "--help"> help;
  cli::option<bool, "--version"> version;
  cli::option<unsigned short, 
              "--compression",
              5> compression;
};

Unfortunately, string literals cannot be template arguments, neither in the current C++98 nor in the upcoming C++x0. As a result, the function/functor declaration and the place where it is “connected” to the string representation of the option name have to be separated. As a result, the number of required option name repetitions becomes three.

With the template-based design, even if we could use string literals directly as template arguments, it would violate the second property (static naming). The use of variable names in accessing the option values guarantees that if we misspell any of them, it will be detected by the compiler.

Each approach also has a number of implementation-related problems. In the functor-based design the use of functors instead of normal member functions makes the resulting options class harder to understand. Functors cannot be easily overridden should we decide to make some of the accessors virtual. This design also needs a global (or thread-local) variable to implement automatic option registration. There is nothing we can do about either of these drawbacks without greatly increasing the verbosity of the CLI definition.

As we have discussed in the previous post, the template-based approach does not scale to a large number of options. But can its implementation be improved using C++x0? At the first glance the variadic templates look promising . However, this feature only supports a single unbounded template argument. In other words there is no way to have a “parallel” pair of unbounded template arguments (option type and option name in our case). One way to resolve this is to wrap each option declaration into a separate type, for example:

typedef
cli::options<cli::option<help, bool>,
             cli::option<version, bool>,
             cli::option<compression, unsigned short>>
options;

So with the help of C++x0 we can make the template-based implementation scale but this comes at the cost of increased verbosity.

In the next post we will explore possible DSL-based design alternatives. Once this is done we will have to weigh the pros and cons of using native vs DSL-based designs and decide which way to go. If you have any thoughts or maybe another promising native design that I have missed, feel free to add them as comments.