[xsd-users] Non-xml data in attributes with xsd:string, xsd:normalizedString

Fri Jun 19 13:18:54 EDT 2009

On 22 Feb 2009, boris at codesynthesis.com wrote:

> Bill Pringlemeir <bpringle at sympatico.ca> writes:

>> It appears that 'tab', etc are escaped.  I guess that there is no
>> allowable XML character reference for some values.  I thought that
>> only 'null'/zero would be disallowed.

> XML 1.0 disallows quite a few control characters. XML 1.1 only 
> disallows zero.

I have used the FAQ,

 http://wiki.codesynthesis.com/Tree/FAQ

section "3.2 How do I serialize a Xerces-C++ DOM document to XML?" to
specify the XML 1.1 as the destination.  This works well.

>> So it seems if the data contains this range, you must use
>> 'base64Binary'?

> Correct. Or XML 1.1 which is supported by Xerces-C++.

There are still some issue, if one just converts to XML 1.1.  It seems
that 'transcode()' converts to either UTF-8 or UTF-16.  Are these the
default encoding for XMLChar* that Xerces expects?

The 'transcode()' method will throw exceptions for a large amount
(perhaps all) characters over 0x80.  These are the UTF-8 escapes.
There is no documentation on this in section 4.4 of the user manual
(ie, the transcode() exceptions).

I have read the FAQ on encodings.  I think that there are several
different places that encodings can be used.

  std::string         -> locale or straight binary?
  xml_schema::string  -> UTF-8 or identical to std::string.

If I am using the generated accessors and place '8-bit' data there, is
it the callers responsibility to replace these by entities?  The
transcode() functionality will seem to interpret these as UTF-8
multiple character encodings.

  Xerces              -> UTF-8, UTF-16 dependant on wchar_t.
  resultant XML       -> as specified in xml header.

I guess that Xerces mandates a UTF-? encoding.  Does Xereces then
convert to the specified encoding, for instance 'US-ASCII' [called
'USASCII' by XSD information **1].  It is still unclear to me if this
is simple a string put in the XML header or if Xerces actually
performs translation.

Is the best tact to create a DOM tree and walk this converting all
strings to entity declarations above 0x7f or can one replace/extend
the transcode functionality to do this?  Is this the most effective
way to do this if the source/destination for the XML are in the same
locale and binary preservation is a goal?

Thanks,
Bill Pringlemeir.

** 1 The IANA recommendation is to use 'US-ASCII'. And this seems to
   be popular with other XML decoders.  Where is 'USASCII' from.  This
   seems to be a defacto encoding naming and is not specified by W3C.

-- 
Little girls, like butterflies need no excuses.  - Robert Heinlein