Intuitive explanation of the Monty Hall problem

May 11th, 2008

Yesterday I went to see 21 where one of the scenes brings up the Monty Hall problem: there are three doors behind which there are a car and two goats. You choose a door with the goal of winning the car. Then the host opens one of the remaining doors which hides a goat. The question is whether it is to your advantage to switch your choice to the other door. The answer is yes (and yes, I thought it does not matter while watching the movie). When asked to explain why it is a good idea to change the door the main character utters some gibberish about all the variables being changed, etc. Afterwards I checked this problem out on Wikipedia (follow the link above) which gave a few strict proofs making it clear that indeed changing the door increases your chances of winning the car by 1/3. While the formal proofs do their jobs just fine I always prefer to have an intuitive feeling of why a seemingly counter-intuitive answer is actually correct. In this case I wanted to understand what changes once the host opens one of the doors, what extra information is added that makes the difference.

The part of the rule which says that the host has to reveal the other goat brings in the extra information. This happens in the case when you initially selected the door with a goat behind it. In this situation the host is forced to eliminate the other goat: he cannot open the door you have selected and he cannot reveal where the car is. In other words we have two possible outcomes:

  • if you selected a goat then the remaining door hides the car
  • if you selected the car then the remaining door hides a goat

The probability of initially selecting the goat is 2/3 (two doors out of three hide goats) and the car — 1/3. Thus it is more likely that you will first select a goat instead of the car. And in this more likely case the host is forced to single-out the door which hides the car. Thus changing your selection gives you a better chance of winning the car.

Note also that the probability of your initial choice being the car remains 1/3 even after the host opened one of the doors. It is the probability of the other remaining door hiding the car that has changed (from 1/3 to 2/3) due to the rules of the game forcing the host not to reveal the car.

Virtual inheritance overhead in g++

April 17th, 2008

By now every C++ engineer worth her salt knows that virtual inheritance is not free. It has object code, runtime (both CPU and memory), as well as compilation time and memory overheads (for an in-depth discussion on how virtual inheritance is implemented in C++ compilers see “Inside the C++ Object Model” by Stanley Lippman). In this post I would like to consider the object code as well as compilation time and memory overheads since in modern C++ implementations these are normally sacrificed for the runtime speed and can present major surprises. Unlike existing studies on this subject, I won’t bore you with “academic” metrics such as per class or per virtual function overhead or synthetic tests. Such metrics and tests have two main problems: they don’t give a feeling of the overhead experienced by real-world applications and they don’t factor in the extra code necessary to account for the lack of functionality otherwise provided by virtual inheritance.

It is hard to come by non-trivial applications that can provide the same functionality with and without virtual inheritance. I happened to have access to such an application and what follows is a quick description of the problem virtual inheritance was used to solve. I will then present some measurements of the overhead by comparing to the same functionality implemented without virtual inheritance.

The application in question is XSD/e, validating XML parser/serializer generator for embedded systems. Given a definition of an XML vocabulary in XML Schema it generates a parser skeleton (C++ class) for each type defined in that vocabulary. Types in XML Schema can derive from each other and if two types are related by inheritance then it is often desirable to be able to reuse the base parser implementation in the derived one. To support this requirement, the current implementation of XSD/e uses the C++ mixin idiom that relies on virtual inheritance:

// Parser skeletons. Generated by XSD/e.
//
struct base
{
  virtual void
  foo () = 0;
};
 
struct derived: virtual base
{
  virtual void
  bar () = 0;
};
 
// Parser implementations. Hand-written.
//
struct base_impl: virtual base
{
  virtual void
  foo ()
  {
    ...
  }
};
 
struct derived_impl: virtual derived,
                     base_impl
{
  virtual void
  bar ()
  {
    ...
  }
};

This approach works well but we quickly found out that for large vocabularies with hundreds of types the resulting object code produced by g++ was unacceptably large. Furthermore, on a schema with a little more than a thousand types, g++ with optimization turned on (-O2) runs out of memory on a machine with 2GB of RAM.

After some analysis we determined that virtual inheritance was to blame. To resolve this problem we have developed an alternative, delegation-based implementation reuse method (will appear in the next release of XSD/e) that is almost as convenient to use as mixin (this is the case because all the support code is automatically generated by the XSD/e compiler). The idea behind the delegation-based approach is illustrated in the following code fragment:

// Parser skeletons. Generated by XSD/e.
//
struct base
{
  virtual void
  foo () = 0;
};
 
struct derived: base
{
  derived (base* impl)
    : impl_ (impl)
  {
  }
 
  virtual void
  bar () = 0;
 
  virtual void
  foo ()
  {
    assert (impl_);
    impl_->foo ();
  }
 
private:
  base* impl_;
};
 
// Parser implementations. Hand-written.
//
struct base_impl: base
{
  virtual void
  foo ()
  {
    ...
  }
};
 
struct derived_impl: derived
{
  derived_impl ()
    : derived (&base_impl_)
  {
  }
 
  virtual void
  bar ()
  {
    ...
  }
 
private:
  base_impl base_impl_;
};

The optimized for size (-Os) and stripped test executable built for the above-mentioned thousand-types schema using virtual inheritance is 15MB in size. It also takes 19 minutes to build and peak memory usage of the C++ compiler is 1.6GB. For comparison, the same executable built using the delegation-based approach is 3.7MB in size, takes 14 minutes to build, and peak memory usage is 348MB. That’s right, the executable is 4 times smaller. Note also that the generated parser skeletons are not just a bunch of pure virtual function signatures. They include XML Schema validation, data conversion, and dispatch code. The measurements also showed that the runtime performance of the two reuse approaches is about the same (most likely because g++ performs a similar delegation under the hood except that it has to handle all possible use-cases thus the object code overhead).

End user or development-oriented build system?

March 24th, 2008

I spent the past three weeks working on Xerces-C++ 3.0.0 which uses automake-based build system. Our own projects here at Code Synthesis all use the build system called build. The work on Xerces-C++ made me realize just how awkward the automake-based build systems are to develop with. It also made me realize that most build systems can be placed into one of the two categories: the ones that are optimized for the end user and the ones that are optimized for development (the Boost build system is a notable exception for it is a pain to use for both end users and, I suspect, the Boost developers).

The primary goal of an end user-oriented build system is to make once-off builds from scratch as straightforward as possible. Because the user can choose to build the software on any platform and installation of additional tools is an inconvenience, the following requirements are imposed on user-oriented build systems:

  • Support for a wide range of platforms
  • Least common denominators in the tools and features used

On the other hand, the primary goal of a development-oriented build system is to make the common development tasks as easy and fast as possible. This translates to the following requirements:

  • Ease of adding/removing files from the build
  • Complete dependency tracking for fast incremental builds

To realize how big a difference a development-oriented build system can make, let’s examine the fairly common development task of implementing a new feature in a library and adding a test for it. Assuming we already made the changes in the library source code as well as added the directory with the new test, here is the list of steps required in an automake-based project:

  1. Add the new test directory into higher-level Makefile.am
  2. Add the new test Makefile.am to configure.ac
  3. Run the bootstrapping script to generate configure, Makefile.in, etc.
  4. Run configure
  5. Run make in the library source directory to update the library
  6. Run make in the test directory

Instead of the last two steps one can run make in the top-level directory which will update the library, update (at least relink) all the tests and examples and finally run all the tests. In my experience, some people prefer this method because while taking longer it requires less manual work and ensures that everything that the test may depend on is up to date. In contrast, here is the list of steps required in a build-based project:

  1. Add the new test directory into higher-level Makefile
  2. Run make in the test directory

The last step automatically updates the library as well as any other parts of the project on which this test depends and which are out of date.

The steps in the build-based project take hardly one-tenth of the time required by the automake-based project. Someone may say that the task of adding a new test is not very frequent in most projects. Let’s then consider another common task: making a change in the library source code and running a specific test. For automake the list is as follows:

  1. Run make in the library source directory to update the library
  2. Run make in the test directory

As in the previous example, instead of these two steps some people prefer to just run make check from the top-level directory. The equivalent one step for the build-based project is:

  1. Run make in the test directory

The automake steps take at least several times longer to complete and can be much more than that if make is run from the top-level directory. In my experience these delays result in a much smaller number of development iterations I could do on a project as well as reluctance to make changes that are not absolutely necessary (e.g., code quality improvements).

It is clear that the constraints imposed by the two orientations are often incompatible: the development-oriented build system requires powerful tools while the user-oriented one requires us not to depend on anything but the bare minimum.

It is hard to say which build system a project should prefer if the goal is to be successful. On one hand, if the speed of development is restricted to a crawl by the build system, then you are unlikely to produce something worth using in a reasonable time. On the other hand, if potential users are bogged down with numerous build-time dependencies that your project imposes then they are less likely to try it.

Another alternative, which we are using in some of our projects, is to provide two parallel build systems. The obvious drawback of this approach is the need to maintain the two systems. In our case the second build system is only provided for a small sub-set of the project (examples) which helps minimize the negative impact of this approach.

The natural improvement of the two build systems idea is the development-oriented build system that can automatically generate makefiles for the end user build system. Note that this is not the same solution as offered by some build system generators (for example, CMake and MPC) since the overhead of running the generator every time a file is added or removed from the project makes them much less suitable for a development-oriented build system.