Archive for the ‘Design’ Category

Options documentation in CLI

Sunday, November 1st, 2009

After announcing CLI 1.0.0, the feature that was requested the most was the automatic documentation generation in the form of the program usage information and man/html pages. I myself wished for this feature while writing essentially the same description of the CLI compiler options in three different places. This also seems to be the last point of defense for the Boost program_options advocates ;-).

We have already considered support for documentation when we first talked about the CLI language. At that point the goal was to think about it just enough to make sure it will be possible without a major language redesign. Now that we are ready to implement this, we will need to think things through more thoroughly. Based on my past experience of documenting a large number of options for the XSD and XSD/e compilers, I have identified the following requirements for this feature:

  • Support for both short (usage) and long (man/html pages) descriptions.
  • Support for basic text formatting, namely, italic, bold, and monospace (code) fonts as well as paragraphs.
  • The documentation in the .cli file should look as close to plain text as possible.
  • The CLI language syntax used to capture the documentation should model C++ as closely as possible.

The first requirement stems from the fact that the usage information printed by the application is usually an abridged version of the complete documentation found in man/html pages. There are several ways in which this can be achieved: We can provide two versions of the documentation: short and long. Or we can use the first sentence from the long description as the short version. Finally, for simple options, the short and long descriptions can be the same. All these alternatives can make sense in different situations and we will need to support all three of them.

When it comes to providing basic formatting support, there are many ways to implement this. We could use the HTML tag system but it is fairly obtrusive. Alternatively, we could use one of the Wiki notations, for example, ''italic'', '''bold''', etc., but that is also quite verbose. I am leaning towards a LaTeX-like notation that can also be viewed as an extension of the C++ character escaping mechanism: \i{italic} \b{bold}, \i{code}, \bc{boldcode}, etc. It is also fairly light on the eyes when viewed in the source code. For the paragraph separation, a blank line seems like a natural choice. There is also the option argument that is normally set out with the italic style (man/html pages) or by enclosing it in angle brackets (e.g., <name>). While we could use the above formatting mechanism for this, it would be convenient to provide a shortcut for this special case by automatically recognizing the angle brackets and replacing them with italicized text where possible.

One part of ensuring that the option documentation looks as close to plain text as possible is to carefully select the formatting mechanism, which we have already done. The other part is to make sure the language syntax is not too obtrusive. Ideally, we would allow straight plain text in certain parts of the language but that makes it difficult to figure out where the text stops. Plus, such a mechanism would be fairly foreign to C++ and thus require some getting used to. Furthermore, one of the reasons for keeping the CLI language as syntactically close to C++ as possible is to allow the use of existing C++ code editors and indenters on .cli files.

To represent arbitrary text in C++ we would use a string literal. Since we may need to provide more than one string, the string array initialization syntax seems like a good choice. For example:

class options
{
  bool --help {"Show usage and exit."}
 
  int --compression = 5
  {
    "Set compression level.",
    "Set compression level between 0 (no compression)
     and 9 (maximum compression). 5 is the default.
 
     Setting the level to a higher value i{may}
     result in smaller output but may also require
     more memory and CPU time."
  }
};

Notice that in the long documentation for the last option we use a multi-line string literal which is illegal in C++ due to the way the C++ preprocessor works. Since we don’t have a preprocessor, we can allow such multi-line strings since they are quite convenient.

When we print the usage information for the above options, we would expect an output along these lines:

--help               Show usage and exit.
--compression <num>  Set compression level.

As you can see, we haven’t specified the argument name (<num>) for the second option anywhere in the documentation. To capture this information we will need to introduce the third string for non-flag options (those of a type other than bool). For example:

class options
{
  bool --help {"Show usage and exit."}
 
  int --compression = 5
  {
    "<num>",
    "Set compression level.",
    "Set the compression level to <num> which should
     be between 0 (no compression) and 9 (maximum
     compression). 5 is the default.
 
     Setting the level to a higher value i{may}
     result in smaller output but may also require
     more memory and CPU time."
  }
};

The <num> word will be automatically converted to num in the option description when producing the man and html output.

While the option documentation mechanism should be sufficient for the majority of cases, there will be situations where more advanced formatting is required. To support such cases we can provide a compiler option which would allow specifying a pre-formatted description for individual options.

When we need to print usage, the option description is only a part of the output. There is normally at least the command line synopsis, for example:

Usage: program [options] argument
 
Options:
--help               Show usage and exit.
--compression <num>  Set compression level.

While the options class will only print the option information, the rest can be printed manually, for example:

cerr << "Usage: program [options] argument" << endl
     << endl
     << "Options:" << endl;
 
options::print_usage (cerr);

A similar situation arises when we create the man/html pages. That is, the beginning of a man page as well as the end would normally be written manually. The CLI compiler can output just the options description which can then be combined with the prologue and epilogue manually. We can also provide two options, --prologue and --epilogue, which would allow the caller to specify the documentation prologue and epilogue files that will be automatically copied to the output.

I am going to think about this feature for a few more days and hopefully implement it over the next weekend. As always, if you have any thoughts, feel free to add them in the comments.

CLI in C++: Separate vs Embedded Runtime

Monday, September 21st, 2009

This is the ninth installment in the series of posts about designing a Command Line Interface (CLI) parser for C++. The previous posts were:

In the last post we discussed the mapping of our CLI language to C++ and established that there will be some common support code, such as exception definitions, that we will need to place somewhere. There are two places where we can keep this code: We can either create a separate runtime library that will contain the support code and on which the generated code will depend. Or we can embed this code directly into the generated C++ files. Today we are going to consider the pros and cons of each approach.

The embedded runtime has the following advantages compared to the separate runtime library:

  • No external dependencies
  • Simple cases will not require extra generated files
  • Can have source code (compared to a header-only library)
  • Can easily support various naming conventions
  • Can minimize the code by only generating what’s needed
  • No runtime/generated code version mismatches
  • Makes the use of the generated code in CLI much easier

Let’s consider each of these points in order. The embedded runtime will not require inclusion of any external headers or linking to any external libraries other than the C++ standard library. This will make the adoption of CLI very easy, in fact, easier than adopting a header-only library. All that needs to be done is to generate the C++ files from the CLI definition and add them to the application source code. This is especially important for a relatively inconsequential functionality such as command line parsing. The requirement to add an extra dependency, even a header-only, may override all the benefits that the CLI compiler will bring.

Most applications that use the CLI language will only have one options file. When we have only one file we can generate the runtime code into the same set of C++ files as the one containing the option class(es). Things are a bit more complicated when we have multiple options files. In this case we cannot generate the runtime code directly into the resulting C++ files because this will lead to re-definitions (if two generated header files are included into the same translation unit) or duplicate symbols. In this case we will need to generate the runtime code into a separate set of C++ files and then include the resulting header into other generated files. For example, we could have the --generate-runtime file option which instructs the compiler to generate the runtime code in a separate set of C++ files and the --runtime file option which tells the compiler that the runtime is in these C++ files.

The embedded runtime can have C++ source code unlike a header-only external runtime library. We would want to restrict the external runtime to be a header-only library in order to simplify adoption, since a header-only library does not require building. However, this restriction may force us to declare certain functions inline even if they shouldn’t normally be inlined because of the potential code bloat.

One common complaint about generated code in general is that it fits poorly with the hand-written code. The major reason for this is that the generated code often doesn’t follow the same identifier naming convention as the one used in the project. For example, the project may be using “upper camel case” for type names (e.g., SimpleName) while the generated code uses the standard C++ lower case and underscores (e.g., simple_name). There is no technical reason (except for, maybe, complexity) why a code generator can’t support configurable naming conventions. In fact, that’s what we did in the C++/Tree mapping in XSD and it made a lot of people very happy. The only problem is that it is virtually impossible to support configurable naming conventions in a hand-written runtime library. But it should be quite easy to do with the embedded runtime since it is also generated by the compiler.

Because the code for the embedded runtime is generated for each application, we can minimize the output by omitting unused optional components. We can also decide whether to generate certain functions inline based on the application developer preferences.

Since with the embedded runtime there are no external dependencies, there are also no version mismatches that can occur when one of the components (generated code or runtime library) was upgraded and the other was not.

Finally, the embedded runtime approach makes it much easier to use the generated code in the CLI implementation itself. With the separate runtime library we will either have to keep an old copy around or risk breaking the generated code with backwards-incompatible changes that occur during development.

The embedded runtime approach also has a number of disadvantages:

  • Hard to develop and maintain
  • Bug fixes to the runtime require compiler rebuild
  • Impractical for large runtimes

The embedded runtime is harder to develop and maintain than a separate runtime library. This is because the code has to be emitted by the compiler instead of simply sitting in a file. In particular, because the runtime code is embedded into the compiler source code as a collection of strings, it is a lot harder to read and write.

Fixing any bug that is found in the embedded runtime code will require a compiler rebuild. In case of a header-only runtime library the same can be accomplished by patching a few files and recompiling the application.

Finally, the embedded runtime approach quickly becomes impractical as the size of the runtime code grows. The difficulty of development and maintenance is one reason. The other reason is the lack of separate compilation. All of the embedded runtime code is contained in a single generated C++ source file. As the amount of code in the runtime grows, this file takes longer and longer to compile.

Now, which approach should we use in our case? The CLI runtime is going to be pretty small, or, at least, I expect it to be. Initially it will contain a few exception definitions and maybe a few helper classes. So the size shouldn’t be an issue. On the other hand, as we discussed above, it is very important to make the generated code as easy to adopt as possible. Making it self-sufficient and dependency-free sounds very attractive. So it looks like in our situation the advantages of the embedded runtime significantly outweigh its disadvantages.

At this point we have covered enough ground to make the first usable release of the CLI compiler. From the last status update I have started working on the backend infrastructure and at this stage the compiler is able to generate the output C++ files with all the #include directives and the proper namespace structure. It is all in the source repository if you would like to take a look. From now on I will be working on generating the C++ mapping. Once this is done, we should be in good shape to release cli-1.0.0. As always, if you have any thoughts, feel free to add them in the comments.

CLI in C++: CLI to C++ Mapping

Tuesday, August 4th, 2009

This is the eighth installment in the series of posts about designing a Command Line Interface (CLI) parser for C++. The previous posts were:

In the last post we designed our CLI language which is a domain-specific language (DSL) for defining a program’s command line interface. Today we are going to see how to map the CLI language constructs to C++ constructs.

At the end of the previous post we had a list of high-level CLI language features which I am going to repeat here:

  • comments
  • namespaces
  • option class
  • option declaration
  • option inheritance
  • C++ inclusion
  • CLI inclusion
  • using declarations/directives and typedef’s
  • option documentation

We also agreed that only a subset of them will end up being supported in the initial release. But for the same reasons as with the CLI language itself, we are going to discuss the mapping of all of these constructs to C++ even though initially we are only implementing a small subset of them.

On the file level, the CLI compiler will map each CLI file to a set of C++ files, such as C++ header file, C++ inline file, and C++ source file. In other words, if we compile options.cli, we will end up with options.hxx, options.ixx, options.cxx (or the same files but with alternative extensions).

Comments are ignored. In some cases it could make sense to copy comments over into the generated code. However, there is no way to distinguish between such “documentation” comments and comments that are for the CLI definition itself. For example:

class options
{
  // Show help.
  //
  bool --help|-h;
 
  // Note: version has two aliases.
  //
  bool --version|-v|-V;
};

In this example it could be useful to copy the first comment to the generated code but not the second. The first comment should actually be made a documentation string which can then be reproduced as a comment in the generated C++ code.

CLI namespaces are mapped to C++ namespaces. This one is simple.

Similarly, option classes are mapped to C++ classes. We will need to provide the copy constructor and assignment operator. We will also need to provide a constructor to instantiate this class from the argc, argv pair.

Since the options may be followed by a number of arguments, this last constructor will need a way to tell the caller where the options end and the arguments begin. There are two ways this can be done. The first is to pass by reference an index argument which is set by the constructor to the position of the first argument. The second approach is to modify the argc and argv data by removing the entries that were consumed by the options class. The second approach is more convenient but is not usable if we need to re-examine the argv elements corresponding to the options. Finally, both versions will have one additional argument which will allow us to specify where in the argv array we should start. By default it will be the second element, after the program name. Here is how all this will look in C++:

class options
{
public:
  options (int& argc,
           char** argv,
           size_t start = 1);
 
  options (int argc,
           char** argv,
           size_t& end,
           size_t start = 1);
 
  options (const options&);
  options& operator= (const options&);
 
  ...
};

Another aspect that we will need to take care of is error handling. In particular, the argc/argv parsing constructors may fail for a number of reasons, including because of an unknown option, missing option value, or invalid option value. The user of our class will need a way to access the declarations of the corresponding exceptions. To keep everything relevant to the functionality of our options parser in one place, we can add them to the generated options class, for example:

class options
{
public:
  // Exceptions.
  //
  typedef ... unknown_option;
  typedef ... missing_value;
  typedef ... invalid_value;
 
  // Constructors.
  //
  options (int& argc,
           char** argv,
           size_t start = 1);
 
  options (int argc,
           char** argv,
           size_t& end,
           size_t start = 1);
 
  options (const options&);
  options& operator= (const options&);
 
  ...
};

Now let’s consider the central construct of our language, the option declaration. For each option we will generate a set of accessors and, optionally, modifiers to access/modify this option’s value. Most applications won’t need to modify option values after parsing so we will only generate modifiers if explicitly requested by the user with a compiler flag (e.g., –generate-modifiers). We will also need to generate a member variable which will store the option’s value. The names of the accessors, modifiers, and member variable will be derived from the option name. Finally, if the option has the default value or if it is a flag, we will need to add initializers to the constructors. For example:

class options
{
public:
  ...
 
  options (int& argc, char** argv, size_t start = 1)
    : help_ (false), compression_ (5)
  {
  }
 
  ...
 
  bool help () const;
  void help (bool); // optional
 
  short compression () const;
  void compression (short); // optional
 
protected:
  bool help_;
  short compression_;
};

Option inheritance is naturally mapped to C++ public inheritance. For example, these CLI definitions:

class common_options
{
  bool --help|-h;
  bool --version|-v;
};
 
class options: common_options
{
  short --compression = 5;
};

will be mapped to the following C++ definitions:

class common_options
{
  ...
};
 
class options: public common_options
{
  ...
};

The C++ inclusion is mapped pretty much verbatim to the C++ preprocessor #include. The only thing that we may need to do is to strip the ‘cxx:’ prefix from the path.

The CLI inclusion is a bit more complex. The purpose of CLI inclusion is to make CLI class declarations in one file visible in another file. This is necessary to support option inheritance. Since option inheritance is mapped to C++ class inheritance, the derived C++ class declaration in one file will also need to “see” the base class declaration in another file. As a result, we will need to map CLI inclusions to C++ header inclusions. Consider the following two CLI files:

// file: common.cli
//
class common_options
{
  bool --help|-h;
  bool --version|-v;
};
 
// file: options.cli
//
include "cli:common.cli"
 
class options: common_options
{
  short --compression = 5;
};

When we compile these files, the generated C++ header files would look like this:

// file: common.hxx
//
class common_options
{
  ...
};
 
// file: options.hxx
//
#include "common.hxx"
 
class options: public common_options
{
  ...
};

Here, the CLI include is mapped to the C++ preprocessor #include with the CLI file name being transformed to the corresponding C++ header file name.

Using declarations and directives as well as typedef’s are copied verbatim to the generated C++ code.

Option documentation can be used to produce several kinds of output. Outside of C++ it can be used to generate man pages and HTML-formatted documentation (or fragments thereof). In C++, the user of the options class may want to print the usage information. To support this we can add the static usage() function to our class which prints the usage information to std::ostream, for example:

class options
{
public:
  ...
 
  static void usage (std::ostream&);
};

Some applications may also need to access individual option documentation in which case we can generate a set of static functions that will allow one to access this information. Finally, the short option documentation strings can be added as comments for the corresponding accessor and modifier functions.

And that covers the basics of the mapping between CLI and C++. Next time we will consider the pros and cons of self-sufficient generated code vs generated code that depends on a runtime library. Then we will need to decide which approach to use. In the meantime I am going to start working on the CLI language parser. Hopefully by next time I will have some code to show. As always, if you have any thoughts, feel free to add them in the comments.