[xsd-users] streaming example: memory consumption increases over time

Boris Kolpackov boris at codesynthesis.com
Tue May 25 11:01:34 EDT 2010


Hi Erik,

Erik Sjölund <erik.sjolund at gmail.com> writes:

> I tried making very big ( 250 Mb ) version of
> 
> examples/cxx/tree/streaming/position.xml
> 
> I then ran
> ./driver position.xml 2> /dev/null
> 
> and saw that the memory consumption of the "driver" process increases 
> over time.
> 
> Maybe there is some deallocation missing?

This has to do with the way XML Schema validation is implemented in
Xerces-C++. If you disable validation (pass false as the last argument
to start() on line 63), then the memory used by the application stays
constant.

Some more background on the validation case: in Xerces-C++ the parser
and validator are separate entities (there are actually two validator
implementations: DTD and XML Schema). The parser collects information
describing the document fragment and then, at certain points, calls 
the validator to validate the part of the document. Relevant to our
case is the information describing the content model. In this case
the parser creates a list of the nested elements seen and then passes 
this list to the validator when it sees the closing tag of the outer
element. If an XML document contains a large sequence of elements,
this list can grow pretty large. And that's exactly what happens
in our case.

Now the XML presented in the streaming example is quite extreme in 
that there are a lot of position elements with very little data. For
example, I have created a 400Mb document and it needed 10M position
elements to get to this size. Most real-world documents will have
fewer elements and more data.

On my 64-bit GNU/Linux box to parse this 400Mb document the example
used about 80Mb of extra memory. This translates to about 8 bytes
per element which is about right, since the element list mentioned
above contains an index to the element pool and on 64-bit machines
the index uses 8 bytes.

There are also a number of ways this memory usage can be reduced. 
Something that we may do for the next release of Xerces-C++.

Boris



More information about the xsd-users mailing list