Xerces-C++ DOM Potholes

If you are using Xerces-C++ DOM then you might want to know about a few functions that you probably shouldn’t use. Or, at least, think twice before using. These are getChildNodes and getTextContent.

There is nothing wrong with getChildNodes per se. It returns DOMNodeList which has the DOMNode* item (size_t index) member function. The problem is actually with the item function which does its job in O(n) instead of O(1) as one would expect. As a result, you would be better off rewriting your DOMNodeList-based iterations like this:

for (DOMNode* n (e.getFirstChild ());
     n != 0;
     n = n->getNextSibling ())
{
    ...
}

The problem with getTextContent lies in the memory management area. This function goes over child nodes accumulating text in a buffer which it returns to you at the end. Important part to know is that this buffer is allocated on the document heap and will only be freed when you destroy the document. Imagine an application that loads a DOM document at the beginning and then performs multiple queries (which involve calling getTextContent) on this single document.

Here is my implementation of text_content which does its job without leaking memory. Note that it has a bit different semantic compared to the standard getTextContent. In particular, it only checks for the child text nodes and it throws if it sees nested DOMElement (no mixed content):

#include <string>
 
#include <xercesc/dom/DOMNode.hpp>
#include <xercesc/dom/DOMText.hpp>
#include <xercesc/dom/DOMElement.hpp>
 
#include <xercesc/util/XMLString.hpp>
 
struct mixed_content {};
 
std::string
text_content (const xercesc::DOMElement& e)
{
  std::string r;
 
  using xercesc::DOMNode;
  using xercesc::DOMText;
  using xercesc::XMLString;
 
  for (DOMNode* n (e.getFirstChild ());
       n != 0;
       n = n->getNextSibling ())
  {
    switch (n->getNodeType ())
    {
    case DOMNode::TEXT_NODE:
    case DOMNode::CDATA_SECTION_NODE:
      {
        DOMText* t (static_cast<DOMText*> (n));
 
        char* str (XMLString::transcode (t->getData ()));
        r += str;
        XMLString::release (&str);
 
        break;
      }
    case DOMNode::ELEMENT_NODE:
      {
        throw mixed_content ();
      }
    }
  }
 
  return r;
}

Comments are closed.