[xsd-users] dealing with xml written/read on-the-fly

Boris Kolpackov boris at codesynthesis.com
Mon Oct 19 11:13:50 EDT 2009


Hi Cerion,

Cerion Armour-Brown <cerion at kestrel.ws> writes:

> Boris Kolpackov wrote:
>
> > If the stream ends with EOF then the parser assume there is no more
> > data available. And if the document is incomplete, then you will get
> > a parsing error. In your case, I guess, you will need to provide a
> > custom std::istream implementation (or xercesc::InputSource -- that  
> > could actually be easier) that doesn't return on EOF but instead keeps
> > polling the file for more data (e.g., you could save the offset of the
> > last byte read, wait some time, re-open the file, seek to that saved  
> > offset, and see if there is more data). I assume you will need to
> > implement this logic somewhere in the application in any case. With
> > this approach it will just be in the stream.
> >   
>
> I had a look at doing this, but this I'm not happy about this direction.  
> Xerces buffers the file data, and if the buffer gets low, it reads  
> ahead. This means there may be data available to xerces (in its buffer),  
> but we're going to block on the file anyway.

What actually happens is this: if the raw character buffer has less than 
100 bytes when Xerces-C++ tries to transcode the next batch of characters,
then it will try to read some more. There is actually a technical reason 
for this other than efficiency (it has to do with multi-byte encodings 
and the buffer containing only some of the bytes constituting a code
point).

Because Xerces-C++ won't keep trying to read more if the stream returned
less than 100 bytes, one way to mitigate this would be to return the data 
from InputSource::readBytes() in small chunks. If you return it one byte
at a time, there will be no buffering at all.


> Plus I would need to take a  look at the data last read from the file (i.e.
> in xerces buffer, or seek back in the file), to see if EOF has been reached 
> correctly (closing tag has been read in).

You mean you will need to check if "real" EOF has been reached, not the
"fake" one ;-)? This is what happens when you try to "reuse" the same
concept for different things. I wonder if there is better design for
this? Can't you use a pipe or socket instead?


> If I can avoid it, I'd prefer not to work with separate threads at all  
> (the above blocking read solution would need that). I imagined my Qt app  
> could be the driver, with a loop to pull in the next (few) top level  
> tags,  and then update the GUI, and so on. This simplifies the whole  
> setup, and keeps Qt in control.

Hm, that's hard to achieve. You want to pass the data and query the next
construct. Something like this:

parser p;
p.here_is_more_data (buf, n);
chunk c = p.give_me_next_construct ();

The problem is that you may not pass enough data so there is no construct 
to return. While it is probably possible to implement an XML parser like
this, it will complicate the design significantly since the parser must
be prepared to stop parsing at any point, return control to the user and
then resume parsing from that point again.


> Qt solves this EOF problem by returning an UnexpectedEOF error, but make  
> this recoverable, so we can continue parsing.
>
> From what I understand from the docs and source code, XSD / Xerces don't 
> (yet) support recovery from this?

No, and probably never will. I don't think such "EOF overloading" is a 
very common practice (or good design, for that matter).


> If they do, how is this possible, and is this a way forward?

I think the way forward would be to lower the chunk size returned by
readBytes() as suggested above. If Qt must be in control, then I don't
see any way to achieve this other than using a separate thread.

I would also suggest that you use something other than a file to
communicate the data between the two processes so that you don't 
need to play this real/fake EOF game.


> P.S. Do you have plans to make a xml binder for the Qt parsers? ;-)

We may implement the "Qt/Tree" mapping one day which will use the "Qt 
way of doing things", including XML parsers. But there are no immediate
plans.

Boris



More information about the xsd-users mailing list