Using LINQ To XML… and Handling Namespaces
August 31, 2010 2 Comments
While working on the Biodiversity Heritage Library (BHL), I have incorporated the open source book viewer software from OpenLibrary. This book viewer offers an interface for viewing a set of images of a book’s pages.
The basic page metadata that is required by the book viewer is the path to the page image and the dimensions of that image. The majority of the books available in BHL have been scanned by Internet Archive, and one of the outputs of their scanning process is an XML document that contains the dimensions of each page image.
So, we have an XML document that contains a bunch of information about the pages in a book, including the height and width of each page. And, we have a process (the web service) implemented in C# that needs to return the height and width metadata for each page.
Enter LINQ to XML.
Microsoft says “LINQ to XML provides an in-memory XML programming interface that leverages the .NET Language-Integrated Query (LINQ) Framework. LINQ to XML uses the latest .NET Framework language capabilities and is comparable to an updated, redesigned Document Object Model (DOM) XML programming interface.” In other words, it’s the perfect tool for parsing the information that is needed from the XML files provided by Internet Archive.
AN INCOMPLETE SOLUTION
Let’s take a look at a simplified view of one of the XML files. The complete file can be found here.
You can see that within the <pageData> element are multiple <page> elements; one for each page. And, within each <page> element are <w> and <h> elements that contain the width and height data for each page. One more thing to note is the <addToAccessFormats> elements that appear within each <page> element. If the text value of <addToAccessFormats> is false, then that page is not displayed in the book viewer, and the dimensions of that pages don’t need to be read.
Here’s a simple function that will parse the height and width metadata for each displayed page.
Pretty simple stuff. Loading the XML document and many of the methods of parsing of that document are very similar to the usual methods for parsing the XML DOM. The one major difference is the LINQ query itself, which selects all of the <page> elements where the “<addToAccessFormats>” element has a value of “true”.
THE BETTER SOLUTION (OR, DON’T FORGET THE NAMESPACE)
Unfortunately, there’s a problem.. The example just given will parse the example XML document just fine. But some of the XML documents at Internet Archive are just slightly different. Consider the following example (full file found here):
Looks pretty much the same, right? Not quite. Notice the very first line of the file. A default namespace (http://www.archive.org/scribe/xml) has been supplied. If you attempt to parse this XML document with the function above, the LINQ query will throw an exception. Because no namespace has been used in the query, the elements “book”, “pageData”, and “page” will not be found.
Here is a corrected version of the original function:
The difference between this function and the first is the use of a namespace when accessing the elements of the XML document. The original ReadPageDimensionsBAD function queries the XML document without consideration for any namespaces that might be defined in the document. ReadPageDimensionsGOOD, on the other hand, first tries to access only the root of the document without using a namespace. If that works, great; it continues without defining any namespaces. However, if it cannot access the root without a namespace, it defines a namespace for use in querying the document. It does this by first checking for a default namespace defined on the root element, and falling back to a hardcoded value of “http://www.archive.org/scribe/xml”.
Once a namespace is defined, the ReadPageDimensionsGOOD function can use that namespace to access the elements of the XML document. Notice how the name of each element is prefixed with the namespace definition in the LINQ to XML query.
The ReadPageDimensionsGOOD function will work for either of the XML documents that were presented here, and so it should be able to read any of the page metadata XML documents at Internet Archive.
Using methods similar to the ones shown here, I was able to successfully implement the book viewer at BHL. An example of a book containing pages of various sizes can be found here.