[xsd-users] import, include, namespaces, restriction and schema versioning

Eric Niebler eric at boostpro.com
Fri Aug 21 11:52:58 EDT 2009


Boris Kolpackov wrote:
> Hi Eric,
> 
> Eric Niebler <eric at boostpro.com> writes:
> 
>> Right. Once we detect a missing element on read, we'll programmatically  
>> fill in a default. That will happen in our serialization routines.
> 
> Is it going to be done as a reaction to a validation during serialization
> error or proactively before the validation?

Well, the idea is that the element will be optional, so it's absence on 
read will not be a schema violation. But I've already told you that, so 
I feel like I must not be understanding your question. Can you clarify?

In our tool, after we (de-)serialize from XML to DOM, we walk the DOM 
and find the missing elements and fill in defaults. Does that clear it up?

Ideally, we would mark up the xsd and have CodeSynth fill in the 
defaults for us, but that's something else entirely.

>>> If all you need is more strict write validation (there is actually
>>> not support for validation during writing in C++/Tree so you will
>>> need to re-parse the XML to detect any errors)
>> 
>> This surprises me. If I write a schema that enforces that a particular  
>> sequence have 3 or more elements, and I only insert 2 elements into the  
>> sequence, you're saying that this schema violation won't be detected on  
>> write, but only when reading the instance document back in?
> 
> Yes, that's what will happen by default. If you need validation on
> serialization, the only way to get it with C++/Tree is to re-parse
> the resulting document after serialization. For example, serialize
> it into a memory buffer if it is not very big, re-parse it (perhaps
> using SAX2 for speed), and, if everything is ok, write the memory
> buffer to a file, etc.

Ouch. These documents are large (10-100 Mb uncompressed XML) and writing 
them is slow. It seems to me that an extra write/read/validate is not 
going to be a satisfactory solution, except as a debugging tool.

> On the surface validation during serialization often seem like a good
> idea. However, once you start thinking about what to do in case of
> an error, its usefulness becomes questionable, except, maybe, for
> debugging

I disagree, and this is where I see a situation where CodeSynth can 
provide a value-add over plain mapping and validation. Imagine that we 
have optional elements with special CodeSynth markup for what the value 
the element should take when it's missing. If CodeSynth did a schema 
validation pass on serialization, this is where the value for missing 
elements could be filled in automatically. That is, CodeSynth can 
proactively correct simple schema violations with a little guidance from 
the xsd author.

> , in which case the re-parsing approach works just fine.
> For more information see the following post. It is about in-memory
> validation but a lot of questions raised there also apply to
> validation during serialization:
> 
> http://www.codesynthesis.com/pipermail/xsd-users/2008-January/001443.html

If the issue is merely one of error detection and reporting, then yes I 
agree that validation on serialization doesn't make much sense except 
for debugging. But if you're willing to consider error correction, then 
it does make sense.

>>> Then you can convert this schema to get a "write version" by adjusting
>>> minOccurs for elements with the writeRequired attribute. In fact, you
>>> don't even need to have two files: you can process this schema on-the-fly
>>> with a simple DOM function, serialize the result to an in-memory buffer
>>> and then load it into a grammar cache to be used by Xerces-C++ for  
>>> validation.
>> 
>> Interesting suggestion! What do you mean by a "simple DOM function"? Is  
>> this something simpler than a full-blown XSLT transform?
> 
> By simple DOM function I mean a function in your program that will load
> the schema as a DOM tree, find all the elements with the writeRequired
> attribute, change their minOccurs to 1, and serialize the modified DOM
> tree into a memory buffer. This memory buffer can then be passed
> directly to loadGrammar().
> 
> Full-blown XSLT will also work and is probably simpler and quicker
> to implement (especially if you have multiple schema files connected
> via include/import). But the DOM approach is tidier since you don't
> need to carry two sets of schemas with your application.

Confused. If I use XSLT to process the schema, then I can still ship 
only 1 set of schema, right?

>> I looked into default values for schema elements but support in XML  
>> Schema is very weak. The element type must be primitive or a simpleType,  
>> IIRC. And the behavior of CodeSynth XSD wasn't appropriate. Empty  
>> elements are handled differently than missing elements, which differs  
>> from how defaulted attributes are handled. I don't understand the  
>> reasoning here, but it seems to make default values for elements useless  
>> for versioning. Please correct me if I'm wrong.
> 
> Default elements in XML Schema are a misnomer. The spec requires the
> empty element to be present in the XML instance in order for it to
> have the default value. This makes default elements practically
> unusable.

That's the conclusion I reached. I see this as an opportunity for CodeSynth.

>> In-tool support for versioning would rock. Consider this a +1 for an  
>> xse:writeRequired attribute that CodeSynth XSD recognizes and does  
>> something sensible with.
> 
> Yes, I agree. We just need to figure out what it is that we can do 
> that is sensible ;-).

So maybe xse:writeRequired isn't quite what I want. I'd like a way to 
provide default values for optional elements in a way that they are 
filled in automatically on read; and on write, either (a) fill them in 
automatically, or (b) flag their absence as an error.

This would address the large majority of our versioning scenarios.

>> Yes, it's a complicated problem. In our tool, we've identified several  
>> versioning scenarios. Far and away the most common versioning scenario  
>> is the one I described above: adding a new element to an existing schema  
>> type. That's the narrow problem I'm currently trying to address.
> 
> I think that's the only scenario that could be practically addressed
> or helped by the tool. For example, for elements that are optional
> but marked as required during serialization we could generate an
> interface that is something between optional and required. That is,
> the user can still query whether the element is present or not but
> it cannot set it to the "not present" state. 

Right.

> For such elements we
> could also require initialization during construction. 

Ah, but we are using the --generate-default-ctor option, so that 
wouldn't help us.

> In other
> words, it could be just like a required element except it may not
> be set during parsing and there is a way to detect this situation.
> 
> We could also implement a check during serialization to make sure
> these elements are actually specified.

Yes. Or to fill in a default.

> But I feel that it is only a part of the solution. The user of the
> mapping still has to detect the missing elements and provide some
> default values manually. 

We're already doing that in our tool, so that's ok.

> I am wondering if there is a way to maybe
> automate it somehow.

Me too!

> Are the default values that you assign to missing elements known
> during schema compilation or are they computed based on the other
> parts of the document at runtime?

For the most part, they would be known during schema compilation. In the 
cases where they must be computed from other parts of the document, we 
can fall back to doing that part ourselves. No big deal.

-- 
Eric Niebler
BoostPro Computing
http://www.boostpro.com




More information about the xsd-users mailing list