[xsd-users] customising XSD to use ICU

Bradley Beddoes beddoes at intient.com
Mon Jan 8 04:23:18 EST 2007


Hi Boris,
Thanks for the information, I will have a bit more of a poke around and 
see what I can do before sending any code examples in, what you have 
said below about enums is the answer i was expecting I just wanted to 
check that out, I feel I should be able to make it work the way I need 
even with a little hand customization.

Some comments below:

Boris Kolpackov wrote:
> Hi Bradley,
> 
> Some quick notes before you send the code.
> 
> Bradley Beddoes <beddoes at intient.com> writes:
> 
>> At the moment I am attempting to customize XSD generated output to take
>> advantage of the ICU library from (http://icu.sourceforge.net) to ensure
>> we have true unicode support cross platform and aren't relying on the
>> terrible wchar_t.
> 
> What is so terrible about wchar_t? It is true it can be 2 bytes long
> (e.g., Windows) or 4 bytes (most UNIXes). XSD detects the size of
> wchar_t and uses UTF-16 for 2-byte wchar_t and UTF-32/UCS-4 for
> 4-byte ones. If you don't search/test for characters outside the
> Basic Plane (those that don't require 4-byte encoding in UTF-16)
> then you should be fine. You can also write a small wrapper for
> ICU if you do need to work with chars outside of the Basic Plane.

There is a possibility that in the future we'll need to look outside the 
  BMP its a minor one but there, more important is the fact that we need 
to use UTF-16 on every platform we are running on and be able to 
reliably perform a bunch of string manipulation and regex operations on 
those systems. ICU provides a very decent library for doing this along 
with some extensions which are present in boost.

There is a pretty good list of stuff I am looking to avoid here: 
http://icu.sourceforge.net/userguide/posix.html

> 
> Another alternative would be to use char with UTF-8 encoding.
> 
> 
>> In particular at the moment I am redefining xsd:string to be represented
>> by UnicodeString (
>> http://icu.sourceforge.net/apiref/icu4c/classUnicodeString.html ), I may
>> look at UDate amongst others as well.
> 
> This is not going to be easy. The XSD runtime and generated code assume
> an std::basic_string-based string and use string literals (e.g., "foo",
> L"foo"). The best you could probably do is to customize all (or most)
> of the user-visible API to use ICU UnicodeString but still use a char
> (UTF-8) or wchar_t(UTF-16/32) -based encoding in the runtime, which may
> not be too bad actually. You will probably need to derive from
> UnicodeString and provide some constructors to allow implicit
> construction from std::basic_string and string literals.

Yes this sounds about where I have gotten to right now with pretty much 
everything and it seems to be working ok thus far, though I am yet to 
perform extensive multilingual checks and this will probably take some time.

> 
> Another alternative would be to use std::basic_string with ICU Unicode
> character type by using --char-type option (you will need to specialize
> std::char_traits for this type). There could still be issues with
> character literals though.

Was looking to test this as well.

> 
> 
>> Firstly xml:lang as it seems simple enough, I can't seem to work out how
>> to customize that type "--custom-type lang" does not seem to provide
>> anything in my generated header at all, it seems to be generated as a
>> struct internally to for example localizedURIType in the below schema.
> 
> Are you using --morph-anonymous option? You will need to compile xml.xsd
> with this option and --custom-type lang. The result will be a forward
> declaration of struct lang; you will have to provide custom implementation.
> 

Excellent this is what I was after thanks.

> 
>> We regards to enums these appear to have the same constructor problem as
>> noted about (use of basic_string<char>). Additionally I seem to be
>> missing a non equivalence operator and I can't seem to figure out
>> exactly what I need to implement (UnicodeString defines this method so
>> its not that).
> 
> The problem with enums is that they use std::basic_string internally
> as well as an array of string literals for enumerators. The only way
> to overcome this (without using one of the strategies outlines above)
> is to completely customize every string-based enum.

I'll have a look into this in more depth, manually at first as I only 
have about 4 small enums at present but it may be worthwhile me 
investigating putting that support into the schema compilation process. 
Would you be interested in a patch?

regards,
Bradley

> 
> hth,
> -boris
> 


-- 
Bradley Beddoes
Lead Software Architect

Intient - "Open Source, Open Standards"




More information about the xsd-users mailing list