Really, Catalogs Matter

This week I learned that XML Catalogs are very important.

This started when I updated Marc Lyanage’s PHP binary for Mac OS X on my development machine.

Pages went from taking miliseconds to over a minute to render. To say I was puzzled would be an understatement. I rolled back to an earlier version.

Looking for Clues

Some initial testing on another machine determined that the slowdown was in the DOMXML extensions to PHP. The extension exposes the Gnome XML and XSLT libraries as functions and objects to PHP.

After searching Google, php.net, and xmlsoft.com, I sent an email to Christian Stocker in Zurich. Christian works on the DOMXML extensions, and he might know of a bug.

I had gotten it in my head that the problem lay in nesting XInclude statements. XInclude is a specification for including one XML document inside another. We use XInclude to keep content for one of our sites isolated to a well-formed, valid XHTML document that can be edited in BBEdit.

A section of the intranet is described as an Atom feed, and each article’s contents included into the feed. The Atom feed is included in an envelope document that contains the rest of the XML needed to render any page in the section.

I had jumped on the conclusion that somehow LibXML2 had changed and it had become inefficent at resolving nested XIncludes.

Christian wrote back that there weren’t any issues he knew of, but asked me to send a test case.

The Wrong Test

I had devised the test case:

foo.xml:

<?xml version="1.0" encoding="utf-8"?>

<foo xmlns:xi="http://www.w3.org/2001/XInclude">

<xi:include href="bar.xml" />

</foo>

bar.xml

<?xml version="1.0" encoding="utf-8"?>

<bar xmlns:xi="http://www.w3.org/2001/XInclude">

<xi:include href="baz.xml" />

</bar>

baz.xml

<?xml version="1.0" encoding="utf-8"?>

<baz>Content!</baz>

When run with:

<?php

$dom = domxml_open_file ("foo.xml");

$start1=gettimeofday();

$dom->xinclude ();

$end1=gettimeofday();

$totaltime1 = (float)($end1['sec'] - $start1['sec'])

  + ((float)($end1['usec'] - $start1['usec'])/1000000);

echo "Time to handle includes: $totaltime1<br>";

echo $dom->dump_mem ();

?>

That should return:

<?xml version="1.0" encoding="utf-8"?>

<foo xmlns:xi="http://www.w3.org/2001/XInclude">

<bar xmlns:xi="http://www.w3.org/2001/XInclude">

<baz>Content!</baz>

</bar>

</foo>

Which it did, but faster than I expected. It timed at less than a second instead of over a minute.

I changed bar.xml to:

<?xml version="1.0" encoding="utf-8"?>

<bar xmlns:xi="http://www.w3.org/2001/XInclude">

<xi:include href="baz.html" />

</bar>

and baz.html was:

<?xml version="1.0"?>

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"

        "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">

<html xmlns="http://www.w3.org/1999/xhtml">

<head>

	<title>Untitled</title>

</head>

<body>

<p>New document</p>

</body>

</html>

Which did take several seconds as I thought it would.

The Right Test

That’s where it dawned on me that XInclude between the version of the libraries PHP used, had started validating by default.

The XHTML DTD URL: http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd is a busy place. Try loading it and see. And I bet that URL is loaded because a lot of people didn’t know their tool was calling over there every time it needed to load or validate something.

Commenting out the DTD declaration in baz.html and re-running the test brings back the earlier level of performance. However, I don’t want to comment out the DTD references in my documents.

Going to Catalogs

I wrote back to Christian asking if LibXML, as built for PHP, honored XML Catalog files.

With a catalog file, I can tell my validating processor to resolve any reference to “http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd” as a local file. Catalogs can do more than that, but the local resolution of DTD files is important.

Christian replied that by default, LibXML looks for a catalog at /etc/xml/catalog. So I created a catalog there.

<?xml version="1.0"?>

<!DOCTYPE catalog PUBLIC "-//OASIS//DTD Entity Resolution XML Catalog V1.0//EN"

"http://www.oasis-open.org/committees/entity/release/1.0/catalog.dtd">

<catalog xmlns="urn:oasis:names:tc:entity:xmlns:xml:catalog">

   <public publicId="-//W3C//DTD XHTML 1.0 Transitional//EN"

       uri="file:///etc/xml/xhtml/DTD/xhtml1-transitional.dtd" />

</catalog>

I pointed “http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd” to a local directory Apache could read from, put a copy of the DTD files there, and tried the tests again. Not as fast as without validation, but certainly faster since it didn’t have to go over the Internet to validate the included file.

So there you go, catalog files, really important. I suitably chastend now.

Thanks to Christian for getting me pointed in the right direction on this.

Update: The W3C would appreciate it if you stopped hitting their servers every time you parse a document.