libstudxml – modern XML API for C++

My talk at this year’s C++Now was about an XML API for modern C++. An API that I believe should have already been in Boost or even in the C++ standard library. Presenting an API without an implementation would be rather lame, so during my talk I also announced libstudxml, which is an open source (MIT) compact, external dependency-free, and reasonably efficient XML library for modern, standard C++. In other words, a library that you can use in pretty much any project and on any platform without much fuss.

A piece of code is worth a thousand words, so let me give you a taste of the API. For this XML:

<person id="123">
  <name>John Doe</name>
  <age>23</age>
  <gender>male</gender>
</person>

The parsing code could look like this:

enum class gender {...};
 
ifstream ifs (argv[1]);
parser p (ifs, argv[1]);
 
p.next_expect (parser::start_element, "person", content::complex);
 
long id = p.attribute<long> ("id");
 
string n = p.element ("name");
short a = p.element<short> ("age");
gender g = p.element<gender> ("gender");
 
p.next_expect (parser::end_element); // person

And that’s with all the validation necessary for this XML vocabulary. But I don’t see any exceptions being thrown, you might say. And that’s exactly the point. Here is the list of interesting features this API has:

Streaming pull parser and streaming serializer
Two-level API: minimum overhead low-level & more convenient high-level
Content model-aware (empty, simple, complex, mixed)
Whitespace processing based on content model
Validation based on content model
Validation of missing/extra attributes
Validation of unexpected events (elements, etc)
Data extraction to value types
Attribute map with extended lifetime (high-level API)

The XML parser in libstudxml is a conforming, non-validating XML 1.0 implementation that is based on tested and proven code (see Implementation Notes for details). A lot of people ask me why not use one of the new, claimed to be super fast and/or compact XML libraries for C++ that are already out there (RapidXML, PugiXML, TinyXML, etc)? The main reason is that they are not real, as in conforming, XML parsers. I discuss why you should stick to real XML parsers in my talk. Hopefully the videos will be posted soon.

Interested? For more information on the API you can jump directly to the Introduction which shows a lot of examples. Or you can grab and build the source code distribution from the libstudxml project page. On Unix, building the library is a matter of ./configure && make. On Windows, projects/solutions are provided for VC++ 9, 10, 11, and 12. There are also quite a few interesting examples inside the distribution.

This entry was posted on Tuesday, May 20th, 2014 at 2:03 pm and is filed under XML, C++. You can follow any responses to this entry through the RSS 2.0 feed. Both comments and pings are currently closed.

4 Responses to “libstudxml – modern XML API for C++”

Arseny Kapoulkine Says:
May 21st, 2014 at 10:02 am
As an author of pugixml, I can’t help but comment on the “real XML parser” thing… Your link is very misleading - you stumbled upon a parser that can’t read content from valid XML at all, CDATA of all things. Great.

Fast/compact XML libraries solve real problems. As long as a parser can read a valid XML disregarding DTD entities, it’s applicable to 99% real-world problems.

If I were to consider libstudxml vs pugixml in a project, I would *not* use this as a disctinction point. A much more important thing is that all three parsers that you specified are DOM, and this parser is pull-based - so depending on the task at hand one is more applicable.
Boris Kolpackov Says:
May 21st, 2014 at 10:30 am
Arseny, as I mentioned in my post, I discuss the issue of real XML parsers in my talk in much more detail, so I suggest that you check it out when the video is available. In a nutshell, the argument boils down to this: The intended use of XML is as a data interchange format, not just a data storage format. If your application is the sole producer and consumer of the data, then you might as well choose a more natural and efficient format than XML. So assuming we use XML for data interchange, while your code may not use any of the CDATA’s or DTD’s, it is only a matter of time before someone sends you a perfectly valid XML that your application won’t be able to parse. In fact, most of the “subset” parsers, including pugixml, don’t even document what happens when valid but unsupported XML constructs are encountered. Are they ignored? Is there an error? Crash? Nobody knows. In fact, you don’t even document that your XML parser only supports a subset of XML, which is what I find misleading.

So in my talk I suggested that people don’t corner themselves and instead stick to real XML parsers. There are plenty of conforming and fast implementations out there. And not a single person in the audience raised your “but it works in 99% of use cases” objection.

Regarding the in-memory vs streaming API argument (which is also covered in the talk extensively), most people think they need DOM but I think this is just because of the really bad streaming APIs that were available up to this point. So I tried to convince the audience that streaming is actually sufficient for the majority of cases. Plus, it is easy to go from streaming to in-memory but not the other way around. In fact, libstudxml has the ‘hybrid’ example which shows how to do hybrid, partially streaming/partially in-memory parsing and serialization.
DeadMG Says:
May 21st, 2014 at 3:09 pm
Why on earth does the parser need both the filename and the stream? Shouldn’t it only need the stream?

And how do you convert from “male” to gender::male? C++ does not support reflection.
Boris Kolpackov Says:
May 21st, 2014 at 3:16 pm
The parser needs the document name for diagnostics (error messages will look like “input.xml:12:23 …”).

As for conversion of “male” to gender, that’s a good question that is answered in the documentation.