[xsd-users] Non-xml data in attributes with xsd:string, xsd:normalizedString

Boris Kolpackov boris at codesynthesis.com
Fri Jun 26 15:13:28 EDT 2009


Hi Bill,

Bill Pringlemeir <bpringle at sympatico.ca> writes:
 
> There are still some issue, if one just converts to XML 1.1.  It seems
> that 'transcode()' converts to either UTF-8 or UTF-16.  Are these the
> default encoding for XMLChar* that Xerces expects?

I assume you are talking about xercesc::XMLString::transcode(). XMLCh
in Xerces-C++ always contains UTF-16. XMLString::transcode() converts
to the "local code page" encoding. What that means depends on the
operating system. On some platforms (e.g., UNIX-like) you could set
the local code page with the call to setlocale. On other platforms
(e.g., Windows), the local code page is preset and cannot be changed.

Then there are xsd::cxx::xml::transcode() (from XMLCh* to char* or
wchar_t*) and xsd::cxx::xml::transcode_to_xmlch() (from char* or
wchar_t* to XMLCh*; the xsd::cxx::xml::string class is normally
used instead of using this function directly). These two functions
assume that char* is UTF-8 unless the application is compiled with
the XSD_USE_LCP macro. See FAQ #1.2 for more information:

http://wiki.codesynthesis.com/Tree/FAQ


> The 'transcode()' method will throw exceptions for a large amount
> (perhaps all) characters over 0x80. These are the UTF-8 escapes.

Yes, by default (and when the character type is 'char') all the text
in the object model is expected to be in UTF-8.


> There is no documentation on this in section 4.4 of the user manual
> (ie, the transcode() exceptions).

Will address for the next release, thanks.


> I have read the FAQ on encodings.  I think that there are several
> different places that encodings can be used.
> 
>   std::string         -> locale or straight binary?
>   xml_schema::string  -> UTF-8 or identical to std::string.

Not sure what you mean here. All the text data in the object model
is stored in types that are derived (directly or indirectly) from 
std::string. They all by default contain and are expect to contain
UTF-8-encoded text. 


> If I am using the generated accessors and place '8-bit' data there,
> is it the callers responsibility to replace these by entities? 

No, entities are handled by the serializer. The caller's responsibility
is to place valid UTF-8 data in there. I guess in your situation, if
you can control the local code page, it might be more straightforward
to use XSD_USE_LCP and set the code page to something like US-ASCII.

 
> I guess that Xerces mandates a UTF-? encoding.  Does Xereces then
> convert to the specified encoding, for instance 'US-ASCII'.

There are three encoding at play here, the second is normally not 
visible to the end-user, except for some situations:

1. The encoding in the object model. This is by default UTF-8.

2. The UTF-16 encoding used in Xerces-C++.

3. The encoding of the resulting XML document. This can be specified
   in the serialization function.

Conversion between (1) and (2) is performed by the object model,
between (2) and (3) -- by Xerces-C++.

 
> Is the best tact to create a DOM tree and walk this converting all
> strings to entity declarations above 0x7f or can one replace/extend
> the transcode functionality to do this?

No, the entities are handled automatically by the XML serializer.
All you need to do is provide the correctly encoded text in the 
object model.

> Is this the most effective way to do this if the source/destination
> for the XML are in the same locale and binary preservation is a goal?

I think the simplest way would be to use XSD_USE_LCP and set the
local code page to US-ASCII. But the availability of this approach
depends on the OS(es) you are targeting. Otherwise you will need
to make sure your binary data is properly UTF-8-encoded (i.e.,
characters above 0x7F are replaced with two-byte sequences). 

 
> ** 1 The IANA recommendation is to use 'US-ASCII'. And this seems to
>    be popular with other XML decoders.  Where is 'USASCII' from.  This
>    seems to be a defacto encoding naming and is not specified by W3C.

Xerces-C++ supports both US-ASCII and USASCII. I am not sure why they
use USASCII in their documentation. I have updated the XSD documentation
to use US-ASCII. Thanks for letting me know.


Boris




More information about the xsd-users mailing list