A Study of XPath Performance in .NET Programming

One day, I received an e-mail from a customer complaining that there was 100% CPU occupancy on our program, EDC (Engineering Data Collection) service, while handling certain XPath queries. Well, that specific XPath was really a bit complicated as you can see:

//CDResults[../../../TargetName/@Value=//SiteInformation[TargetName/@Value!=//SiteInformation[1]/TargetName/@Value and TargetName/@Value!=//SiteInformation[TargetName/@Value!=//SiteInformation[1]/TargetName/@Value][1]/TargetName/@Value][1]/TargetName/@Value]/BottomCD/@Value

I decided to do some tests on the program and some other alternative solutions. I set two goals for this test:

  1. To verify if the XML parser is the part causing 100% CPU usage.
  2. If so, to try to find alternative solutions for better performance.

Methodology
A test program was built to implement four different solutions but achieve the same functionality, which was to retrieve the value of a given XML based on a certain XPath query. The four solutions included the current implementation in the EDC service and three alternatives. The major difference among these four solutions was:

Timestamps were recorded at the beginning and end of each solution. Then, the time span for each solution was calculated. All this information was stored in a log file. A CPU usage history graph was captured to illustrate the difference between the solutions. Data analysis and extra study and research was conducted after each test was done and the data become available.

Test Environment

Raw Data
The source code can be downloaded from here.

//CDResults[../../../TargetName/@Value=//SiteInformation[TargetName/@Value!=//SiteInformation[1]/TargetName/@Value and TargetName/@Value!=//SiteInformation[TargetName/@Value!=//SiteInformation[1]/TargetName/@Value][1]/TargetName/@Value][1]/TargetName/@Value]/BottomCD/@Value

Test Result and Analysis
CPU Usage
The CPU occupancy rose to 100% immediately after the test application started. It could confirm that the 100%-CPU-usage issue is caused by the XML parser (see Figure 1).

Result of Each Solution
All four solutions ran correctly and got the same result: 9.161745E-02. So all the solutions are workable.

All four solutions mean 100% CPU usage, but a dramatically different time to finish. I ran the test program twice. Table 1 illustrates the time used for each solution during the two runs.

  1. Time format is HH:MM:SS
  2. First run ran under Visual Studio debug mode
  3. Second run ran after the program was compiled as a standalone executable.

Figure 2 shows the CPU usage of the current solution. The time span is 1 hour 10 minutes and 18 seconds.

Figure 3 shows the CPU usage of our proposed new solution. The time span is 2 minutes and 56 seconds.

XmlDocument versus XPathDocument
The major differences between XmlDocument and XPathDocument are:

  1. XmlDocument is editable, while XPathDocument is read-only
  2. They are based on different data models.

XmlDocument is based on the W3C XML DOM, which is an object model that basically covers all XML syntaxes, including low-level syntax sugar such as entities, CDATA sections, DTD, notations, etc. That's a document-centric model and it allows for full fidelity when loading/saving XML documents.

XPathDocument is based on an XPath 1.0 data model that is a read-only XML Infoset-compatible data-centric object model that covers only semantically significant parts of XML, leaving out insignificant syntax details - no DTD, no entities, no CDATA, no adjacent text nodes, only significant data expressed as a tree with seven types of nodes. Simple and lightweight. That's why XPathDocument is a preferred data store for read-only scenarios, especially with XPath or XSLT involved.

Converting XmlDocument into XPathDocument
There are a wide variety of ways to convert XmlDocument to XPathDocument. I tried two of them. These two methods are very straightforward. Their general algorithms are:

Method 1:

Method 2:

For implementation details refer to the source code in the Appendix.

I tested these two methods with a dummy XML file that was 11,267,545 bytes. This is a huge XML file. It's not likely to happen in real production. We got the result shown in Table 2:

Memory consumption for both methods were almost equal.

In Figures 4 and 5, the red line shows the CPU and memory use of Method 1, and the blue line shows the CPU and Memory use of Method 2.

Conclusion and Recommendations
As you can see, both Solution 2 and 3, which implement XPathDocument, are about 30 times faster than the current solution that implements XmlDocument to get the same result. So XPathDocument is recommended.

If an update/modification of XML is required, XmlDocument can be used first before the program reaches the XPath query part. XmlDocument can be converted to XPathDocument then the program proceeds to do an XPath query.

Below is a piece of sample code in C# showing how to convert XMLDocument to XPathDocument.

using System.Xml.XPath;
using System.IO;

// variables definition
XPathDocument xpathDoc = null;
XmlDocument xmlDoc = null;

// xml file and xpath query
const string FILE = "<xml file path goes here>";

// load xml file, initialize XmlDocument
xmlDoc = new XmlDocument( FILE );

// save XmlDocument into a memory stream
MemoryStream memStream = new MemoryStream();
xmlDoc.Save( memStream );
memStream.Position = 0;

// create XPathDocument with memory stream
xpathDoc = new XPathDocument( memStream );

Solution 2 is our proposed solution. Compared to Solution 3, the Evaluate method supports more features, such as the XPath function call, while Select doesn't. See the Appendix for the implementation details of Solution 2.

Below is a piece of sample code in C# showing how to query XML by XPathDocument.

using System.Xml.XPath;

// variables definition
XPathDocument xpathDoc = null;
XPathNavigator nav = null;
XPathExpression expression = null;
XPathNodeIterator iterator = null;

// xml file and xpath query
const string FILE = "<xml file path goes here>";
const string XPATH = "<xpath string goes here>";
xpathDoc = new XPathDocument( FILE );
nav = xpathDoc.CreateNavigator();
expression = nav.Compile( XPATH );
iterator = (XPathNodeIterator) nav.Evaluate( expression );
while( iterator.MoveNext() )
{
// gets value here
}

Proof of Our Recommendation
To prove that our recommended solution is practical, I repeated this test in a different hardware environment, more like a powerful server in a real production environment.

The hardware and software configuration was:

This was a powerful recently built server. Solutions 2 and 3 were tested in one go. No significant difference in performance was discovered in the different versions of the .NET Framework. The data in Table 3 shows the test result of the .NET Framework 3.0.

  1. The time format is HH:MM:SS

A maximum CPU usage of 29% was recorded by eye witnessing the Windows Task Manager CPU use meter. Figure 6 shows CPU usage history during the whole test.

Compared to the previous implementation, this result is rather exciting and satisfying. I'd recommend this solution. Roughly a 20-second process time and less than 30% CPU usage is practical.

How to Write Better XPath Queries
Avoiding the double slash "//" is an important factor, since it will recursively search for the whole tree and return matched elements no matter where they are in the document. That's really time consuming.

Concerning the "[]" index, we discovered something interesting. Microsoft IE5 and later implements [0] as the first node, but according to the W3C standard it should be [1].

Writing a better XPath query is a relatively open and big topic. I just did some very primitive studies. The powerful XPath also gives the user enough flexibility to construct various queries. That's what I need to continue working on to find.

© 2008 SYS-CON Media