CLI in C++: Status Update
Over the weekend I went ahead and implemented the lexical analyzer for the CLI language. If you would like to check it out, you can download the source distribution via these links:
The +dep version includes the project’s dependencies. See the INSTALL file inside for details. Alternatively, you can follow the development in the git repository:
There are a couple of interesting things to note about the lexer design: First is the handling of the include paths. To mimic the C++ preprocessor as closely as possible, we decided to allow both "foo"
and <foo>
styles of paths. However, in our case, the include
statement is part of the language and, therefore, is tokenized by the lexer. The dilemma then is how to handle <
and >
, which can be used in both paths and, later, in option types that contain template-ids (for example, std::vector<int>
) or expressions (for example, 2 < 3
). If we always treat them as separate tokens, then handling of the include paths becomes very tricky. For example, the <foo/bar.hxx>
path would be split into several tokens. On the parser level it will be indistinguishable from, say, <foo / bar.hxx>
, which may not be the same or even a valid path.
To overcome this problem, the lexer treats <
after the include
keyword as a start of a path literal instead of as a separate token (path literals are handled in the same way as string literals except for having < >
instead of " "
). That’s one area where we had to bring a little bit of language semantics knowledge into the lexical analyzer.
Another interesting thing to know is handling of option names. To be convenient to use, option names should allow various additional symbols such as -
, /
, etc., that are not allowed in C++ identifiers. Consider this option definition as an example:
class options { bool --help-me|-h|/h; };
The question is how do we treat these additional symbols: as part of the option identifier or as separate tokens? Handling them as separate tokens presents the same problem as in the include path handling. Namely, the option name can be written in many different ways which will all result in the same token sequence. Making these additional symbols a part of an option identifier can be done in two ways. We can either recognize and make option names as a special kind of identifier or we can “relax” all the identifiers in the language to include these symbols. The first approach would be cleaner but is hard to implement. The lexer would need to recognize the places where option names can appear and scan them accordingly. Since there is no keyword to easily identify such places, the lexer would need to implement pretty much the same language parsing logic as the parser itself. This is a bit more semantics knowledge than I am willing to bring into the lexical analyzer.
This leaves us with the relaxed identifier option. One major drawback of this approach is the difficulty of handling expressions that involve -
, /
, etc. Consider this definition:
class options { int -a-b = -a-b; };
Semantically, the first -a-b
is the option name while the second is an expression. In the initial version of the CLI compiler I decided not to support expressions other than literals (negative integer literals, such as -5
are supported by recognizing them as a special case). The more complex expressions can be defined as constants in C++ headers and included into the CLI file, for example:
// types.hxx // const int a = 1; const int b = 2; const int a_b = -a-b;
// options.cli // include "types.hxx"; class options { int -a-b = a_b; };
In the future we can support full expressions by recognizing places in the language where they can appear. That is, after =
as well as between ( )
and < >
. For example:
class options { int -a = 2-1; int -b (2-1); foo<2-1> -c; };
On the parser level, we will also need to tighten these relaxed identifiers back to the C++ subset for namespaces, classes, and option types.
With the initial version of the lexer complete, the next thing to implement is the parser. I will let you know as soon as I have something working.