partial in-memory streaming example (was: [xsd-users] XSD 3.3.0 released)

Thu Apr 29 10:40:17 EDT 2010

Hi Eric,

Eric Niebler <eric at boostpro.com> writes:

> I got very excited when I read this because my application is running
> out of memory when serializing large DOM's. I had a look at this
> example, but sadly it doesn't look like it will solve my problem.
> 
> IIUC, you're doing incremental DOM serialization by loading a shell for
> the root, and then serializing children of the root one at a time. That
> works for large documents that are flat; that is, most of their content
> are direct children of the root. But what about deeply nested documents?
> Is there a way to incrementally parse/serialize such a document using
> the C++/Tree mapping?

It is a bit more work but the principle is the same. Most (all) large
documents have repetitive elements that account for the size. The idea
is to get to that repetitive "level" and then parse/serialize a chunk 
at a time.

If your document looks like this:

<root>
  <nested>
    <level>
      <data>...</data>
        ...
      <data>...</data>
    </level>
  </nested>
</root>

Then you would construct the parser (the parser here is vocabulary-
specific) to handle the <root>, <nested>, and <level> elements 
internally and start returning <data> as document fragments.

With serialization, you will have to recreate the 

<root>
  <nested>
    <level>

and 

    </level>
  </nested>
</root>

fragments manually and serialize <data> one at a time. This will
also work if your document looks like this:

<root>
  <nested>
    <level>
      <data>...</data>
        ...
      <data>...</data>
    </level>
  </nested>

  <other>
    ...
  </other>

  <nested>
    <level>
      <data>...</data>
        ...
      <data>...</data>
    </level>
  </nested>
</root>

That is, you have multiple places with repetitive content.

> Aside: it looks like the XSD serialization routines are first
> serializing XSD DOM to xerces DOM, and then using xerces routines to
> serialize this to XML. Serializing to xerces DOM is where my app runs
> out of memory -- sad because it seems like this step could be eliminated
> entirely if there were some way to go straight from XSD DOM to XML.
> Thoughts?

Yes, that is theoretically possible though (1) there is some flexibility
in being able to "post-process" DOM before writing it out to XML and (2)
your case is from a (small) subset of situations where the document is
still small enough to fit into memory as aither object model or DOM but
not both. The much bigger set of cases is where the document simply
cannot fit into memory in either representation. The streaming approach
tries to address this more general set of problems. Also, I don't think
it will be long before your case will migrate to the bigger set ;-).

Boris