Validating against external schemas in Xerces-C++
One of the recurring questions on the Xerces-C++ mailing lists is this: given an XML document and a schema as two files, how to validate one against the other. The examples that come with Xerces-C++ are of no help. They only support validation if the schema file is specified in the XML document with the xsi:schemaLocation
or xsi:noNamespaceSchemaLocation
attributes.
If you think about it, the schemaLocation
attributes are quite useless as the location specification mechanism. The schemas that the application is going to use in 99.9% of cases are bundled with the application itself. Otherwise what use is it to know that the XML the application is about to process is valid with regards to some schema? We need to make sure it is valid against the schema that the application was built to handle. So in essence we are asking the producers of XML documents to embed location information that points to schemas that are installed somewhere with our application. Doesn’t make much sense, does it?
Most production applications would use the following strategy instead: (1) pre-load the schemas that are supplied with the application, and (2) disable loading of schemas specified using any other methods (e.g., the schemaLocation
attributes). The second part in this strategy is there to remove security concerns that arise when an application tries to load something specified by an untrusted source.
So how does one achieve this with Xerces-C++? All the parsers that come with Xerces-C++ (DOM, SAX, SAX2) include the API that allows you to load the schemas from the filesystem and then use them during parsing for validation. There is also a way to disable loading of any external schemas. To demonstrate all this, I have created two simple command line programs that allow one to load a bunch of schemas and then validate a bunch of XML files against them. The SAX2 version is load-grammar-sax.cxx and the DOM one is load-grammar-dom.cxx. To build, compile and link with the Xerces-C++ library.