[xsd-users] Ignoring unknown elements

Thu Oct 16 05:58:16 EDT 2014

Hi Boris,

I see now that I must clarify a bit what I described in my previous email.
And it seems I already made attempts to do so.
http://codesynthesis.com/pipermail/xsd-users/2014-May/004292.html

So, from the discussion above I disabled XML validation and it worked fine for
XML attributes. Obviously, it didn't work for XML elements as we discuss it now.

Now from the very beginning. We have 2 sets of schema files: one for Office Open
XML and one for ODF. While specification for Office Open XML is bundled with its
XML schema files ODF provides only RelaxNG schema and we had to covert it to XML
Schema. Both of these schemas are very big and consequently XSD generates a lot
of code from them.

Back to the problem. Both these specifications allow extensions to the file
format and this is true that those extensions elements will be in different
namespace. However, preprocessing XML DOM will require adding that preprocessing
code (implying higher maintenance cost) and will require more processing in
runtime. :( Adding wildcards will certainly work but this will also add more
generated code and probably higher memory consumption for things that we don't
need. What we need instead is processing of elements that we know (i.e. we copy
data from those elements to our representation) and ignore everything else.

There is another related problem. When we compile generated code in debug mode
we obtain 2 libraries (OOXML and ODF) of comparable size. It's about 2.5GB each
which results in longer compilation time and dramatical linking time. On MacOS we
crash linker (it asserts on some internal data which exceeds int32 value) and on
Linux linker requires so much memory that the only way to link executable is to
do it in 1 job (compilation in, say, 8 jobs is fine). So, you see. We would
rather remove some code instead of adding more. :)

In order to achieve this we are ready to rewrite our XSD schema files (we do this
anyway for ODF) and remove from them things that we don't currently support so
that what was in the specification will become unknown content to the generated
code. We also remove any derivation (by extension or by restriction) for all
complex types with complex content (i.e. we copy content from base type to the
derived type). We are fine with this and you see that the change in XSD that we
originally proposed will work for us and will not break anything.

Note to the possible implementation. The mode in which generated code will ignore
unknown elements must be turned on with XSD command line option and certainly
this mode is not compatible with XML Schema files in which complex types with
complex content use derivation mechanisms. Code generation in this case should
fail, I think.

Thanks,
Vladimir Zykov
Software Engineer
New Cloud Technologies, LLC

On Oct 14, 2014, at 8:38 AM, Boris Kolpackov <boris at codesynthesis.com> wrote:

> Hi Vladimir,
> 
> Vladimir Zykov <vladimir.zykov at ncloudtech.ru> writes:
> 
>> In our problem domain we cannot be sure about validity (in XML sense) of
>> input XML documents.
>> 
>> [...]
>> 
>> Shallow inspection of generated code shows that the problem can be fixed
>> by mere removal of 'break' statement at the end of a for-loop that
>> processes child elements in generated parse() function. Certainly this
>> should be managed by command line option.
> 
> I am not certain about this. You are trying to parse an invalid XML
> instance and this is quite unusual.
> 
> While the "fix" seems simple, the problem is that it will break (at
> runtime) if someone derives from such a type. So the XSD compiler
> would probably need to check that this doesn't happen. But that is
> not always possible (e.g., if the derived type is in another schema).
> So the "fix" becomes a lot less trivial and robust.
> 
> So before we go deeper into this, isn't there a cleaner way to
> achieve the same end result. The two options that I see:
> 
> 1. Add wildcards to your XML Schema types. I remember you told me that
>   you auto-generate the schema so maybe this can be done fairly easily?
> 
> 2. If the "extra" elements have a regular form (e.g., they are in a
>   different namespace), then what you could do is pre-process the
>   DOM representation of your document before handing it off to the
>   XSD-generated code. The idea is to remove all the extra elements
>   during this step.
> 
> Boris