[xsd-users] Large XSD-schema, speed and identity constraint
validation
Boris Kolpackov
boris at codesynthesis.com
Tue May 12 07:11:57 EDT 2020
Stefan de Konink <stefan at konink.de> writes:
> One of the main problems that we face is the syntax validation of 100MB+
> XML-document with this schema, but especially: constraint validation.
> Practically I am looking for a better than libxml2/xmllint speed, where I
> notice that many - if not all - tools have a direct single threaded
> performance bottleneck. I am trying to find a generic form to overcome this,
> I am surprised that it is difficult to find one. Practically parallel syntax
> validation using sharding could work for us, but identity constraint
> validation needs all parts of the document, hence I would expect a "better
> way".
Based on your reference to identity constraints further down in your
email I am going to assume that by "constraint validation" above you
mean "identity constraint validation".
Over all these years, I can't remember seeing many cases where this was
an issue. Which probably explains why there is no better/faster way.
Also, keep in mind that CodeSynthesis XSD delegates XML Schema validation,
including identity constraint validation, to Xerces-C++.
> Question 1;
>
> My first question is concerning the c++ code generation. I am currently
> using the following command for generating the XML interface. How can I
> generate a single file instead of 4011 individual files? When I omit
> --file-per-type I don't get all the types in the single file.
There is no way to get a single header/source set from multiple schema
files. And, for a large schema, you probably wouldn't want to, since you
may not be able to compile the resulting source file (yes, we have run
into this and have a mechanism to split single source file into multiple
parts; see the --parts* options).
But seeing that you depend on GML 3.2, file-per-type is probably your
only option (you could try to compile your schemas in the file-per-schema
mode to minimize the number of files, but getting that to work is more
of an art than science). See these release notes for background on this
mode:
http://www.codesynthesis.com/~boris/blog/2008/02/13/codesynthesis-xsd-3-1-0-released/
> Question 2;
>
> When I am comparing the cold performance of the following code, the millage
> may vary. I would state the performance is similar to Xerces in Java. Which
> makes me wonder if he 'hot' performance would be much better? Or that I am
> trying to do something that even with the generated C++ code is not
> optimised. I am aware of the performance example in the source code, that
> could preload the schema once and run from it many times.
>
> [...]
>
> real 17m6.611s
> user 17m1.399s
> sys 0m3.917s
>
> real 5m21.199s
> user 5m19.587s
> sys 0m1.450s
I am confused, what are these two results for? Hot vs cold?
Overall, if you know that the identity constraint validation is your
bottleneck, I wouldn't expect pre-loading the schemas to help much.
> Question 3;
>
> One of the other things I noticed is that the Codesynthesis identity
> constraint validation only reports the location of the end-tag (and
> therefore an ocean of duplicates), missing the exact location that xmllint
> does produce, for example:
>
> [...]
>
> Is there a way to get the invalid line, with the correct type?
That would most likely require improvements to Xerces-C++.
More information about the xsd-users
mailing list