[xsd-users] dealing with xml written/read on-the-fly

Mon Oct 19 12:50:43 EDT 2009

Hiya Boris,

Boris Kolpackov wrote:
>> Boris Kolpackov wrote
>>> If the stream ends with EOF then the parser assume there is no more
>>> data available. And if the document is incomplete, then you will get
>>> a parsing error. In your case, I guess, you will need to provide a
>>> custom std::istream implementation (or xercesc::InputSource -- that  
>>> could actually be easier) that doesn't return on EOF but instead keeps
>>> polling the file for more data (e.g., you could save the offset of the
>>> last byte read, wait some time, re-open the file, seek to that saved  
>>> offset, and see if there is more data). I assume you will need to
>>> implement this logic somewhere in the application in any case. With
>>> this approach it will just be in the stream.
>>>   
>>>       
>> I had a look at doing this, but this I'm not happy about this direction.  
>> Xerces buffers the file data, and if the buffer gets low, it reads  
>> ahead. This means there may be data available to xerces (in its buffer),  
>> but we're going to block on the file anyway.
>>     
> What actually happens is this: if the raw character buffer has less than 
> 100 bytes when Xerces-C++ tries to transcode the next batch of characters,
> then it will try to read some more. There is actually a technical reason 
> for this other than efficiency (it has to do with multi-byte encodings 
> and the buffer containing only some of the bytes constituting a code
> point).
>   
Indeed.
> Because Xerces-C++ won't keep trying to read more if the stream returned
> less than 100 bytes, one way to mitigate this would be to return the data 
> from InputSource::readBytes() in small chunks. If you return it one byte
> at a time, there will be no buffering at all.
>   
Eugh - that's horrible! :-)

>> Plus I would need to take a  look at the data last read from the file (i.e.
>> in xerces buffer, or seek back in the file), to see if EOF has been reached 
>> correctly (closing tag has been read in).
>>     
> You mean you will need to check if "real" EOF has been reached, not the
> "fake" one ;-)? This is what happens when you try to "reuse" the same
> concept for different things.
Exactly.

> I wonder if there is better design for this? Can't you use a pipe or socket instead?
>   
Moving to a pipe/socket might be the only way to go, indeed.
I would _really_ have liked to use a file for debugging purposes: I 
don't trust the XML source (Valgrind) not to mess things up, and I 
wanted to allow users of my program to send me the XML file so I could 
reproduce the error.

>> If I can avoid it, I'd prefer not to work with separate threads at all  
>> (the above blocking read solution would need that). I imagined my Qt app  
>> could be the driver, with a loop to pull in the next (few) top level  
>> tags,  and then update the GUI, and so on. This simplifies the whole  
>> setup, and keeps Qt in control.
>>     
> Hm, that's hard to achieve. You want to pass the data and query the next
> construct. Something like this:
>
> parser p;
> p.here_is_more_data (buf, n);
> chunk c = p.give_me_next_construct ();
>
> The problem is that you may not pass enough data so there is no construct 
> to return. While it is probably possible to implement an XML parser like
> this, it will complicate the design significantly since the parser must
> be prepared to stop parsing at any point, return control to the user and
> then resume parsing from that point again.
>   
What I did before with Qt3 was fairly straightforward: SAX reader, 
callbacks on the end-tags to construct a DOM model.
The Qt SAX parser gives 'parse' and 'parseContinue' functions, which 
keep track of the file position and buffer the XML data until it's 
handed off via the end-tag callback.
All works well, and is simple.
Unfortunately, there's just no binding, so updates to the XML protocol 
are horrible to maintain  :-(

>> Qt solves this EOF problem by returning an UnexpectedEOF error, but make  
>> this recoverable, so we can continue parsing.
>>
>> From what I understand from the docs and source code, XSD / Xerces don't 
>> (yet) support recovery from this?
>>     
> No, and probably never will. I don't think such "EOF overloading" is a 
> very common practice (or good design, for that matter).
>   
Fair enough, although I'm not sure you understand - Qt4 doesn't use EOF 
overloading: just as Xerces does, the parser throws the error, but it 
isn't _fatal_, and is easily recoverable from. One just needs to handle 
that EOF exception, wait for more data, and continue parsing.
>> If they do, how is this possible, and is this a way forward?
>>     
> I think the way forward would be to lower the chunk size returned by
> readBytes() as suggested above. If Qt must be in control, then I don't
> see any way to achieve this other than using a separate thread.
>
> I would also suggest that you use something other than a file to
> communicate the data between the two processes so that you don't 
> need to play this real/fake EOF game.
>   
Ok, I will ponder upon this a little more.

>> P.S. Do you have plans to make a xml binder for the Qt parsers? ;-)
>>     
>
> We may implement the "Qt/Tree" mapping one day which will use the "Qt 
> way of doing things", including XML parsers. But there are no immediate
> plans.
>
> Boris
>   
Thanks Boris,
Cerion