[xsd-users] Large XSD-schema, speed and identity constraint validation
Stefan de Konink
stefan at konink.de
Tue May 12 10:41:17 EDT 2020
Dear Boris,
Thanks for your in depth reply.
On Tuesday, May 12, 2020 1:11:57 PM CEST, Boris Kolpackov wrote:
> Based on your reference to identity constraints further down in your
> email I am going to assume that by "constraint validation" above you
> mean "identity constraint validation".
Correct.
> Over all these years, I can't remember seeing many cases where this was
> an issue. Which probably explains why there is no better/faster way.
I don't mind to get my hands dirty on this subject.
> Also, keep in mind that CodeSynthesis XSD delegates XML Schema validation,
> including identity constraint validation, to Xerces-C++.
Does this practically mean that if I would only care about XSD-validation,
there would not be any net benefit to use the XSD toolset, because the
resulting code is not used to generate a specific parser that is employed
while doing a XSD validation? I am thinking in the direction of XML
Screamer research.
> There is no way to get a single header/source set from multiple schema
> files. And, for a large schema, you probably wouldn't want to, since you
> may not be able to compile the resulting source file (yes, we have run
> into this and have a mechanism to split single source file into multiple
> parts; see the --parts* options).
>
> But seeing that you depend on GML 3.2, file-per-type is probably your
> only option (you could try to compile your schemas in the file-per-schema
> mode to minimize the number of files, but getting that to work is more
> of an art than science). See these release notes for background on this
> mode:
Understood. As you may have see the "art" of designing light XSD's, that
only define a single profile (where the net effect is that a validator
would complain about extra elements) is something that could greatly
optimise the performance of the validator. Obviously this is expected
behavior but not many XSD tools support cutting unused the bloat in a
consistent matter. Meaning that designing a smaller XSD is typically bottom
up again.
>> real 17m6.611s
>> user 17m1.399s
>> sys 0m3.917s
>>
>> real 5m21.199s
>> user 5m19.587s
>> sys 0m1.450s
>
> I am confused, what are these two results for? Hot vs cold?
Same machine, same data, multiple runs, same code, showing the min and max.
>From my benchmarking background I would consider them both cold. I cannot
explain (other than hardware reasons, tested it on a laptop Ryzen 2500U)
why the results give huge outliers for both libxml2 and xerces-c. I cannot
exclude the initial loading (i/o) of the XSD-schema either.
> Overall, if you know that the identity constraint validation is your
> bottleneck, I wouldn't expect pre-loading the schemas to help much.
I am considering to create an alternative identity constraint validation
mechanism. But I would have to dive into the current mechanism if a novel
approach is actually improving anything over the lack of work on the
subject in the last 10 years.
>> Question 3;
>>
>> One of the other things I noticed is that the Codesynthesis identity
>> constraint validation only reports the location of the end-tag (and
>> therefore an ocean of duplicates), missing the exact location that xmllint
>> does produce, for example: ...
>
> That would most likely require improvements to Xerces-C++.
I think this was partially a wrong statement.
Within Xerces-Java the Line Number does represent the expected tag.
org.xml.sax.SAXParseException; lineNumber: 131; columnNumber: 27; Not
enough values specified for <key name="VehicleType_AnyVersionedKey">
identity constraint specified for element "PublicationDelivery".
The ouput I get from the XSD-validation, thus probably Xerces-C++:
/var/tmp/NeTEx_SYNTUS_20200422_New_NDOV-pushed.xml:131:27 error: element
'PublicationDelivery' does not have enough values for identity constraint
key 'VehicleType_AnyVersionedKey'
But specifically the Java version is capable of doing this:
org.xml.sax.SAXParseException; lineNumber: 1499081; columnNumber: 23; Key
'StopArea_KeyRef' with value 'SYNTUS:StopArea:60103,20200422' not found for
identity constraint of element 'PublicationDelivery'.
org.xml.sax.SAXParseException; lineNumber: 1499081; columnNumber: 23; Key
'ScheduledStopPoint_KeyRef' with value
'SYNTUS:ScheduledStoppoint:50203005,20200422' not found for identity
constraint of element 'PublicationDelivery'.
org.xml.sax.SAXParseException; lineNumber: 1499081; columnNumber: 23; Key
'TransportAdministrativeZone_KeyRef' with value
'NL:AdministrativeZone:AL,any' not found for identity constraint of element
'PublicationDelivery'.
org.xml.sax.SAXParseException; lineNumber: 1499081; columnNumber: 23; Key
'Operator_KeyRef' with value 'SYNTUS,20200422' not found for identity
constraint of element 'PublicationDelivery'.
While the C++ version does:
/var/tmp/NeTEx_SYNTUS_20200422_New_NDOV-pushed.xml:1499081:23 error:
identity constraint key for element 'PublicationDelivery' not found
(duplicated: 1196 times)
So I am missing the "Key/Value" report but get an ocean of duplicates where
I can't find out the reason. I'll drop the Xerces-C++ mailinglist a line.
--
Stefan
More information about the xsd-users
mailing list