[xsd-users] Non-xml data in attributes with xsd:string, xsd:normalizedString

Bill Pringlemeir bpringle at sympatico.ca
Thu Feb 19 11:17:56 EST 2009


On 19 Feb 2009, boris at codesynthesis.com wrote:

> Bill Pringlemeir <bpringle at sympatico.ca> writes:

>> Harness code with the likes of,

>> normTest copy("file"); //...
>> copy.e1().append("\t\t\t\t\t\tno\ttabs?");   
>> copy.a1().append("\t\t\t\t\t\tno\ttabs?");

>> Can generate serialization errors (actually parser errors on the other
>> side, but it was actually an error to generate invalid XML unless it
>> is program error not to put non-printables in the std::string?).

> I tried the above fragment with XSD 3.2.0 and both Xerces-C++ 2.8.0
> and 3.0.0. I get the following XML:

> <normTest a1="0&#x9;&#x9;&#x9;&#x9;&#x9;&#x9;no&#x9;tabs?">
> <e1>file no tabs?</e1>
> </normTest>

> Is this not what you expect to get? Which version of Xerces-C++ are you
> using?

You are correct.  I think that I had put an invalid character entity
like "&#x9" without an semi-colon and didn't pay attention to the
Xerces message (and believed it was the same problem).

I am using XSD 3.2.0 and Xerces-C++ 3.0.0.  And this is the actual case,

 copy.e1().append("\x4");   
 copy.a1().append("\x4");

It appears that 'tab', etc are escaped.  I guess that there is no
allowable XML character reference for some values.  I thought that
only 'null'/zero would be disallowed.

I see here,

 http://www.w3.org/TR/REC-xml/#NT-Char

that XML has this restriction.  In theory even '\0' can be put in a
std::string; so it appears that decimal values of 0-8,11,12,14-31 are
not allowed in the strings.  It appears that 'CDATA' also does not
allow escaping of these values.

So it seems if the data contains this range, you must use
'base64Binary'?  A problem is that the serializer doesn't bother to
tell you that the value is illegal.  If we are scannning for character
entity escaping, cann't an exception be thrown when this value range
is encountered?  That would be more friendly in that the error would
be closer to the source (as opposed to the opposite side which will
retreives the bad XML).  Alternatively the values could be stripped.
Selectable via a flag to the serializer or class specific modifier.

Ie,

  copy.a1().stripInvalidXml(true);  // or some other simple_type method.
  copy.e1().stripInvalidXml(true);

Sorry, my mistake with the 'semi-colon' confused me and made my
original report nebulous.  The invalid stripping could just as well be
accomplished by user code, so perhaps an 'everything is stripped' or
an exception is the best approach; especially as the modifier probably
implies extra memory overhead.  I will try to strip the values in my
code that comes from 'untrusted sources'.  This is difficult to
guarantee and a XSD/serializer mechanism would be nicer.

Thanks,
Bill Pringlemeir.

-- 
How  do we  make long-term  thinking automatic  and common  instead of
difficult and rare? - Stewart Brand




More information about the xsd-users mailing list