<?xml version="1.0" encoding="UTF-8"?><!-- generator="wordpress/2.2.2" -->
<rss version="2.0" 
	xmlns:content="http://purl.org/rss/1.0/modules/content/">
<channel>
	<title>Comments on: Are you using a real XML parser</title>
	<link>https://codesynthesis.com/~boris/blog//2008/05/19/real-xml-parser/</link>
	<description>Boris Kolpackov's blog about software</description>
	<pubDate>Sat, 28 Oct 2023 15:53:46 +0000</pubDate>
	<generator>http://wordpress.org/?v=2.2.2</generator>

	<item>
		<title>By: Lars D</title>
		<link>https://codesynthesis.com/~boris/blog//2008/05/19/real-xml-parser/#comment-450</link>
		<author>Lars D</author>
		<pubDate>Tue, 27 May 2008 13:35:23 +0000</pubDate>
		<guid>https://codesynthesis.com/~boris/blog//2008/05/19/real-xml-parser/#comment-450</guid>
		<description>Google gives a few examples, but I guess here is one from Apple, which strongly recommend not to use CDATA:

http://www.apple.com/dk/itunes/store/podcaststechspecs.html

However, most examples that I have seen, are non-published standards used internally between vendors in large organizations. They're basically open, but only relevant for 2-3 companies and therefore not published.</description>
		<content:encoded><![CDATA[<p>Google gives a few examples, but I guess here is one from Apple, which strongly recommend not to use CDATA:</p>
<p><a href="http://www.apple.com/dk/itunes/store/podcaststechspecs.html" rel="nofollow">http://www.apple.com/dk/itunes/store/podcaststechspecs.html</a></p>
<p>However, most examples that I have seen, are non-published standards used internally between vendors in large organizations. They&#8217;re basically open, but only relevant for 2-3 companies and therefore not published.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Boris Kolpackov</title>
		<link>https://codesynthesis.com/~boris/blog//2008/05/19/real-xml-parser/#comment-449</link>
		<author>Boris Kolpackov</author>
		<pubDate>Thu, 22 May 2008 19:59:49 +0000</pubDate>
		<guid>https://codesynthesis.com/~boris/blog//2008/05/19/real-xml-parser/#comment-449</guid>
		<description>Lars, are any of these cases perhaps open standards that you can refer me to? I truly have never seen anybody restricting the physical XML representation mechanisms offered by the XML spec.</description>
		<content:encoded><![CDATA[<p>Lars, are any of these cases perhaps open standards that you can refer me to? I truly have never seen anybody restricting the physical XML representation mechanisms offered by the XML spec.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Lars D</title>
		<link>https://codesynthesis.com/~boris/blog//2008/05/19/real-xml-parser/#comment-448</link>
		<author>Lars D</author>
		<pubDate>Thu, 22 May 2008 19:20:22 +0000</pubDate>
		<guid>https://codesynthesis.com/~boris/blog//2008/05/19/real-xml-parser/#comment-448</guid>
		<description>I have, many times, made by many different organizations.</description>
		<content:encoded><![CDATA[<p>I have, many times, made by many different organizations.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Boris Kolpackov</title>
		<link>https://codesynthesis.com/~boris/blog//2008/05/19/real-xml-parser/#comment-447</link>
		<author>Boris Kolpackov</author>
		<pubDate>Thu, 22 May 2008 05:13:36 +0000</pubDate>
		<guid>https://codesynthesis.com/~boris/blog//2008/05/19/real-xml-parser/#comment-447</guid>
		<description>Lars, the mechanisms I described (like entity references, CDATA sections, DTD) are available to any XML vocabulary. They are there to allow you to efficiently encode your data in XML.

I have never seen an XML vocabulary specification that says something like "You cannot use entity references." That's also the reason why vocabulary specification languages (e.g., DTD, XML Schema) don't have any mechanisms to control the use of these features.</description>
		<content:encoded><![CDATA[<p>Lars, the mechanisms I described (like entity references, CDATA sections, DTD) are available to any XML vocabulary. They are there to allow you to efficiently encode your data in XML.</p>
<p>I have never seen an XML vocabulary specification that says something like &#8220;You cannot use entity references.&#8221; That&#8217;s also the reason why vocabulary specification languages (e.g., DTD, XML Schema) don&#8217;t have any mechanisms to control the use of these features.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Lars D</title>
		<link>https://codesynthesis.com/~boris/blog//2008/05/19/real-xml-parser/#comment-445</link>
		<author>Lars D</author>
		<pubDate>Wed, 21 May 2008 20:21:49 +0000</pubDate>
		<guid>https://codesynthesis.com/~boris/blog//2008/05/19/real-xml-parser/#comment-445</guid>
		<description>It doesn't make sense to make an application that can parse any XML file, unless it's a generic XML tool. It make much more sense to specify that you use XML, and then some more information. For instance, you could specify that you're using Google's KML format.

In the same way, many business applications can specify that they're exporting XML data with specific tags and a specific character set. In that case, your application needs to support specific XML tags and a specific character set, and then it makes sense to use an XML parser with certain features, if it perfoms significantly better than a standards-complying XML parser.</description>
		<content:encoded><![CDATA[<p>It doesn&#8217;t make sense to make an application that can parse any XML file, unless it&#8217;s a generic XML tool. It make much more sense to specify that you use XML, and then some more information. For instance, you could specify that you&#8217;re using Google&#8217;s KML format.</p>
<p>In the same way, many business applications can specify that they&#8217;re exporting XML data with specific tags and a specific character set. In that case, your application needs to support specific XML tags and a specific character set, and then it makes sense to use an XML parser with certain features, if it perfoms significantly better than a standards-complying XML parser.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: John Snelson</title>
		<link>https://codesynthesis.com/~boris/blog//2008/05/19/real-xml-parser/#comment-443</link>
		<author>John Snelson</author>
		<pubDate>Wed, 21 May 2008 11:03:47 +0000</pubDate>
		<guid>https://codesynthesis.com/~boris/blog//2008/05/19/real-xml-parser/#comment-443</guid>
		<description>When parsing XML documents, the performance bottleneck is usually the part which reads the character content of elements. The slowest operation that a conforming XML parser has to perform in that inner loop (besides the mandatory memory read to get the character) is the check against the "Char" grammar production.

To check the "Char" production using simple if() statements, you need 9 comparisons (2 comparisons for each range, 1 for each single char). A binary search over the grammar production as sorted ranges only requires worst case 4 comparisons. Even so, 4 comparisons in the inner loop is a lot of extra overhead.</description>
		<content:encoded><![CDATA[<p>When parsing XML documents, the performance bottleneck is usually the part which reads the character content of elements. The slowest operation that a conforming XML parser has to perform in that inner loop (besides the mandatory memory read to get the character) is the check against the &#8220;Char&#8221; grammar production.</p>
<p>To check the &#8220;Char&#8221; production using simple if() statements, you need 9 comparisons (2 comparisons for each range, 1 for each single char). A binary search over the grammar production as sorted ranges only requires worst case 4 comparisons. Even so, 4 comparisons in the inner loop is a lot of extra overhead.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: boris</title>
		<link>https://codesynthesis.com/~boris/blog//2008/05/19/real-xml-parser/#comment-442</link>
		<author>boris</author>
		<pubDate>Tue, 20 May 2008 15:55:58 +0000</pubDate>
		<guid>https://codesynthesis.com/~boris/blog//2008/05/19/real-xml-parser/#comment-442</guid>
		<description>Hm, not sure I understand what you mean here. The rule specifies three individual characters and three ranges which should be pretty straightforward and fast to check with simple if() statements. This will probably get a bit slower if the document (and application) encoding is, say, UTF-8 and the parser needs to decode each multi-byte sequence before it can check the rule.

I've seen the NameChar rule implemented as a lookup table, though.</description>
		<content:encoded><![CDATA[<p>Hm, not sure I understand what you mean here. The rule specifies three individual characters and three ranges which should be pretty straightforward and fast to check with simple if() statements. This will probably get a bit slower if the document (and application) encoding is, say, UTF-8 and the parser needs to decode each multi-byte sequence before it can check the rule.</p>
<p>I&#8217;ve seen the NameChar rule implemented as a lookup table, though.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: John Snelson</title>
		<link>https://codesynthesis.com/~boris/blog//2008/05/19/real-xml-parser/#comment-441</link>
		<author>John Snelson</author>
		<pubDate>Tue, 20 May 2008 11:25:38 +0000</pubDate>
		<guid>https://codesynthesis.com/~boris/blog//2008/05/19/real-xml-parser/#comment-441</guid>
		<description>Everyone (including me) thinks that writing an XML parser will be relatively simple when they start off. However, parsing elements and attributes is probably only a quarter of the work needed to get things like DTD parsing and entity replacement done. It took a long time for me to even understand how entity replacement was meant to work!

http://snelson.org.uk/archives/2008/03/fast_xml_pull_p.php

However, I now think that the greatest barrier to writing faster XML parsers is checking that every character in it matches the "Char" grammar production:

http://www.w3.org/TR/2006/REC-xml-20060816/#charsets

I don't see any way to do this other than a hash-lookup or binary search for every relevant character - and that's not going to be fast.</description>
		<content:encoded><![CDATA[<p>Everyone (including me) thinks that writing an XML parser will be relatively simple when they start off. However, parsing elements and attributes is probably only a quarter of the work needed to get things like DTD parsing and entity replacement done. It took a long time for me to even understand how entity replacement was meant to work!</p>
<p><a href="http://snelson.org.uk/archives/2008/03/fast_xml_pull_p.php" rel="nofollow">http://snelson.org.uk/archives/2008/03/fast_xml_pull_p.php</a></p>
<p>However, I now think that the greatest barrier to writing faster XML parsers is checking that every character in it matches the &#8220;Char&#8221; grammar production:</p>
<p><a href="http://www.w3.org/TR/2006/REC-xml-20060816/#charsets" rel="nofollow">http://www.w3.org/TR/2006/REC-xml-20060816/#charsets</a></p>
<p>I don&#8217;t see any way to do this other than a hash-lookup or binary search for every relevant character - and that&#8217;s not going to be fast.</p>
]]></content:encoded>
	</item>
</channel>
</rss>
