[xsd-users] How to get around invalid characters in UTF-8 string

Boris Kolpackov boris at codesynthesis.com
Fri Mar 18 09:42:51 EDT 2011


Hi,

Homer J S <js.homer at yahoo.com> writes:

> Is there a way to to get the parser to bypass those characters, strip them 
> out, or replace them with something else?

There is no out of the box support for this. And I agree with Florian
that this is something that is better to handle before XML parsing since
"bypassing", "stripping", and "replacing" can be very application-
specific. Also note that such stripping can render the resulting XML
malformed (e.g., by removing '<' from a closing tag).

The best way to do this would be to filter the input by providing a
custom input stream (e.g., an implementation of std::istream or
xercesc::InputSource; the latter is probably easier). In this
implementation you can either use some existing library or validate
and "correct" UTF-8 yourself. You can base this on the 'compression'
example from the XSD distribution which uses this technique to inflate
compressed XML on the fly.

Boris



More information about the xsd-users mailing list