[xsd-users] Returning data by criteria

Wed Feb 17 08:50:57 EST 2010

Hi Bidski,

Bidski <bidski at bigpond.net.au> writes:

> I was wondering if it is possible to selectively return certain data 
> from a xml file, the same as you would if you were using SQL statements 
> with a database. As an example, say this is our xml file.
> 
> <?xml version="1.0"?>
> <people>
>     <person> 
>         <first-name>John</first-name>
>         <last-name>Doe</last-name>
>         <gender>male</gender>
>         <age>32</age>
>     </person>
> 
>     <person>
>         <first-name>Jane</first-name>
>         <last-name>Doe</last-name>
>         <gender>female</gender>
>         <age>28</age>
>     </person>
> </people>
> 
> Is it possible to just get the data pertaining to John Doe's record? 
> Or to return, say, the age of every male that is listed in the xml 
> file (assuming that we had more records). I know that with the 
> cxx/parser you can retrieve the data of specific fields (i.e. age 
> and gender only) but from what I can tell you still have to get 
> that information for every record. Am I wrong here? Is it possible 
> to return only the information pertaining to 1 (or less than all) 
> records?

With the C++/Parser you can do some selective parsing though the
XML will still need to be parsed completely since there is no way
to only parse records that contain certain data.

In essence, with C++/Parser two operations are performed: (1) the
XML is parsed to a stream of "raw" events (element, attribute, text), 
and (2) these raw events are then converted to vocabulary-specific
types (e.g., age string "28" is converted to the 28 int) and this
data is dispatched to the callbacks. So there is no way to get rid
of the first step but you can avoid doing the second step for data
that you don't need by not providing the parser for the corresponding
elements and attributes. To use your example about getting the age 
of every male, you would only provide parsers for the age and gender
elements, and leave first-name and last-name without the parsers.

> My problem is that each record in my xml file has a minimum of 39 
> fields (with an extra 4 fields being able to occur anywhere from 
> 0-????? times) and, currently, I have about 1,800 records. I 
> originally tried using the cxx/tree method, just to see how it 
> worked, with the results being slow and very memory consuming 
> (this file in particular took up about 25MB memory) and caused 
> me to re-shuffle some of my variables to avoid stack overflows. 
> Now I am using the cxx/parser method, which has reduced memory 
> consumption down to about 3MB or so, but the parsing is still 
> sluggish (takes roughly 4-7 seconds to parse the file, depending 
> on how many fields I am interested in).

That's unusually slow, assuming the document is about 25MB. There 
are several things you can do to improve this:

1. Compile both Xerces-C++ and your application with optimization
   turned on (I assume you are already doing this, but just to make
   sure).

2. Disable XML Schema validation.

3. If you cannot disable validation and you plan to parse more
   than one document, consider pre-loading the schemas. For a
   guide on how to do this see the 'performance' example in
   the examples/cxx/parser/ directory. 

4. Consider using Expat instead of Xerces-C++ as the underlying
   XML parser. Expat is a bit faster than Xerces-C++ though note
   that there is no support for wchar_t as the character type.

Let us know if this helps.

Boris