[xsd-users] String length

Boris Kolpackov boris at codesynthesis.com
Mon Oct 27 10:26:28 EDT 2008


Hi David,

David Kelvin <dktemp at hotmail.co.uk> writes:

> For example, if the character in the XML file is a normal letter, 
> digit or symbol, then it the length is 1.
>  
> However, we quite often use the character '»' ("0xbb" or Alt + 0187) 
> and this returns the length of 2 and v.c_str() is "0xc2bb", although 
> when the output is redirected to a text file (UTF-8), only the single 
> character "»" appears.  If left to print to the screen, it is a 
> different combination.

When the mapping character type is char (currently the only option for
the Expat as the underlying parser), all text is passed as UTF-8. UTF-8
is a variable-length encoding which represents each character as a 1 to 
4 octet sequence. In your case, Unicode character 0xbb is represented
as a 2-octet sequence. Standard C++ string class (std::string) does not
know anything about the encoding of the string it holds and it operates
in terms of 8-bit octets. This is why length() returns 2.

If you need the string length in terms of symbols then you will need
to either implement this function yourself or you can use a third-
party library, for example ICU (International Components for Unicode):

http://icu-project.org/

Boris




More information about the xsd-users mailing list