[xsd-users] Non-xml data in attributes with xsd:string,
xsd:normalizedString
Bill Pringlemeir
bpringle at sympatico.ca
Fri Jun 19 13:18:54 EDT 2009
On 22 Feb 2009, boris at codesynthesis.com wrote:
> Bill Pringlemeir <bpringle at sympatico.ca> writes:
>> It appears that 'tab', etc are escaped. I guess that there is no
>> allowable XML character reference for some values. I thought that
>> only 'null'/zero would be disallowed.
> XML 1.0 disallows quite a few control characters. XML 1.1 only
> disallows zero.
I have used the FAQ,
http://wiki.codesynthesis.com/Tree/FAQ
section "3.2 How do I serialize a Xerces-C++ DOM document to XML?" to
specify the XML 1.1 as the destination. This works well.
>> So it seems if the data contains this range, you must use
>> 'base64Binary'?
> Correct. Or XML 1.1 which is supported by Xerces-C++.
There are still some issue, if one just converts to XML 1.1. It seems
that 'transcode()' converts to either UTF-8 or UTF-16. Are these the
default encoding for XMLChar* that Xerces expects?
The 'transcode()' method will throw exceptions for a large amount
(perhaps all) characters over 0x80. These are the UTF-8 escapes.
There is no documentation on this in section 4.4 of the user manual
(ie, the transcode() exceptions).
I have read the FAQ on encodings. I think that there are several
different places that encodings can be used.
std::string -> locale or straight binary?
xml_schema::string -> UTF-8 or identical to std::string.
If I am using the generated accessors and place '8-bit' data there, is
it the callers responsibility to replace these by entities? The
transcode() functionality will seem to interpret these as UTF-8
multiple character encodings.
Xerces -> UTF-8, UTF-16 dependant on wchar_t.
resultant XML -> as specified in xml header.
I guess that Xerces mandates a UTF-? encoding. Does Xereces then
convert to the specified encoding, for instance 'US-ASCII' [called
'USASCII' by XSD information **1]. It is still unclear to me if this
is simple a string put in the XML header or if Xerces actually
performs translation.
Is the best tact to create a DOM tree and walk this converting all
strings to entity declarations above 0x7f or can one replace/extend
the transcode functionality to do this? Is this the most effective
way to do this if the source/destination for the XML are in the same
locale and binary preservation is a goal?
Thanks,
Bill Pringlemeir.
** 1 The IANA recommendation is to use 'US-ASCII'. And this seems to
be popular with other XML decoders. Where is 'USASCII' from. This
seems to be a defacto encoding naming and is not specified by W3C.
--
Little girls, like butterflies need no excuses. - Robert Heinlein
More information about the xsd-users
mailing list