[xsd-users] Large XSD-schema,
speed and identity constraint validation
Stefan de Konink
stefan at konink.de
Mon May 11 08:58:42 EDT 2020
Hello,
I am part of the standardisation group that works on a public transport
standard for network and timetable exchange. It is available as XSD on
github <https://github.com/NeTEx-CEN/NeTEx> under a GPL license. I noticed
that RailML is part of the wiki, I hope we can do the same for NeTEx.
One of the main problems that we face is the syntax validation of 100MB+
XML-document with this schema, but especially: constraint validation.
Practically I am looking for a better than libxml2/xmllint speed, where I
notice that many - if not all - tools have a direct single threaded
performance bottleneck. I am trying to find a generic form to overcome
this, I am surprised that it is difficult to find one. Practically parallel
syntax validation using sharding could work for us, but identity constraint
validation needs all parts of the document, hence I would expect a "better
way".
Question 1;
My first question is concerning the c++ code generation. I am currently
using the following command for generating the XML interface. How can I
generate a single file instead of 4011 individual files? When I omit
--file-per-type I don't get all the types in the single file.
In order to successfully run the above command, our schema had to be
modified. I assume the root cause lies in duplicated QNames. This should
obviously be investigated. I also noticed some compilation errors in the
generated code with "any". Might file a bug report later.
xsdcxx cxx-tree --file-per-type --generate-polymorphic --generate-wildcard
--namespace-map "http://www.opengis.net/gml/3.2=gml"
/home/skinkie/Sources/NeTEx/xsd/NeTEx_publication.xsd
Question 2;
When I am comparing the cold performance of the following code, the millage
may vary. I would state the performance is similar to Xerces in Java. Which
makes me wonder if he 'hot' performance would be much better? Or that I am
trying to do something that even with the generated C++ code is not
optimised. I am aware of the performance example in the source code, that
could preload the schema once and run from it many times.
xml_schema::properties props;
props.schema_location ("http://www.netex.org.uk/netex",
"file:///home/skinkie/Sources/NeTEx/xsd/NeTEx_publication.xsd");
netex::PublicationDelivery ("/var/tmp/NeTEx_SYNTUS_20200422_New_NDOV.xml",
0, props);
This process takes:
time ./test 1>/tmp/xsd.txt 2>&1
real 17m6.611s
user 17m1.399s
sys 0m3.917s
real 5m21.199s
user 5m19.587s
sys 0m1.450s
Opposed to:
time xmllint --noout --schema
/home/skinkie/Sources/NeTEx/xsd/NeTEx_publication.xsd
/var/tmp/NeTEx_SYNTUS_20200422_New_NDOV.xml 1>/tmp/xmllint.txt 2>&1
real 18m13.272s
user 18m8.838s
sys 0m3.097s
real 9m3.236s
user 9m1.706s
sys 0m1.259s
If I change the XSD to a much lighter version, tailored to the information
profile we exchange, the validation occurs within 8 seconds, such XSD can
be found here: <https://github.com/BISONNL/NeTEx-NL/tree/master/xsd>
Question 3;
One of the other things I noticed is that the Codesynthesis identity
constraint validation only reports the location of the end-tag (and
therefore an ocean of duplicates), missing the exact location that xmllint
does produce, for example:
/var/tmp/NeTEx_SYNTUS_20200422_New_NDOV.xml:1817356:42 error: element
'PublicationDelivery' does not have enough values for identity constraint
key 'Journey_AnyVersionedKey'
/var/tmp/NeTEx_SYNTUS_20200422_New_NDOV.xml:13495: Schemas validity error :
Element '{http://www.netex.org.uk/netex}FromPointRef': No match found for
key-sequence ['SYNTUS:RoutePoint:30000018'] of keyref
'{http://www.netex.org.uk/netex}FromPointRef'.
Is there a way to get the invalid line, with the correct type?
I already found this discussion:
<https://lists.w3.org/Archives/Public/xmlschema-dev/2007May/0020.html>
--
Stefan
More information about the xsd-users
mailing list