Using LINQ To XML… and Handling Namespaces
August 31, 2010 Leave a comment
OVERVIEW
While working on the Biodiversity Heritage Library (BHL), I have incorporated the open source book viewer software from OpenLibrary. This book viewer offers an interface for viewing a set of images of a book’s pages.
The basic requirements for using the BookViewer are 1) to embed the Javascript library that implements the book viewer in a web page and 2) write Javascript functions that supply the book viewer with page metadata. That is perhaps a bit of an oversimplification, but it gives the general idea of the effort required.
The basic page metadata that is required by the book viewer is the path to the page image and the dimensions of that image. The majority of the books available in BHL have been scanned by Internet Archive, and one of the outputs of their scanning process is an XML document that contains the dimensions of each page image.
In BHL, the web page that contains the book viewer is a plain HTML page. Javascript and jQuery are used to call a web service implemented in C# to get the page metadata required by the viewer.
So, we have an XML document that contains a bunch of information about the pages in a book, including the height and width of each page. And, we have a process (the web service) implemented in C# that needs to return the height and width metadata for each page.
Enter LINQ to XML.
Microsoft says “LINQ to XML provides an in-memory XML programming interface that leverages the .NET Language-Integrated Query (LINQ) Framework. LINQ to XML uses the latest .NET Framework language capabilities and is comparable to an updated, redesigned Document Object Model (DOM) XML programming interface.” In other words, it’s the perfect tool for parsing the information that is needed from the XML files provided by Internet Archive.
AN INCOMPLETE SOLUTION
Let’s take a look at a simplified view of one of the XML files. The complete file can be found here.
<bookData>
<bookId>bulletindelasoci14pari</bookId>
<leafCount>482</leafCount>
</bookData>
<pageData>
<page leafNum="0">
<addToAccessFormats>false</addToAccessFormats>
<cropBox>
<w>1747</w>
<h>2620</h>
</cropBox>
</page>
<page leafNum="1">
<addToAccessFormats>true</addToAccessFormats>
<cropBox>
<w>1669</w>
<h>2776</h>
</cropBox>
</page>
</pageData>
</book>
You can see that within the <pageData> element are multiple <page> elements; one for each page. And, within each <page> element are <w> and <h> elements that contain the width and height data for each page. One more thing to note is the <addToAccessFormats> elements that appear within each <page> element. If the text value of <addToAccessFormats> is false, then that page is not displayed in the book viewer, and the dimensions of that pages don’t need to be read.
Here’s a simple function that will parse the height and width metadata for each displayed page.
{
string width = "1200"; // default width
string height = "1600"; // default height
// Load the XML document
var xml = XDocument.Load(filePath);
// Use a LINQ query to read the page elements from the document
var pages = from page in xml.Element("book").Element("pageData").Descendants("page")
where (string)page.Element("addToAccessFormats") == "true"
select page;
foreach (XElement page in pages)
{
// Make sure we have a cropBox element
XElement cropBox = page.Element("cropBox");
if (cropBox != null)
{
// Read the height and width
XElement widthElement = cropBox.Element("w");
XElement heightElement = cropBox.Element("h");
width = (widthElement == null ? width : widthElement.Value);
height = (heightElement == null ? height : heightElement.Value);
// Now save the width and height for the current page (not shown)
}
}
}
Pretty simple stuff. Loading the XML document and many of the methods of parsing of that document are very similar to the usual methods for parsing the XML DOM. The one major difference is the LINQ query itself, which selects all of the <page> elements where the “<addToAccessFormats>” element has a value of “true”.
THE BETTER SOLUTION (OR, DON’T FORGET THE NAMESPACE)
Unfortunately, there’s a problem.. The example just given will parse the example XML document just fine. But some of the XML documents at Internet Archive are just slightly different. Consider the following example (full file found here):
<bookData>
<bookId>systematikundfau19textgies</bookId>
<leafCount>858</leafCount>
</bookData>
<pageData>
<page leafNum="1">
<addToAccessFormats>true</addToAccessFormats>
<cropBox>
<w>2724.0</w>
<h>3768.0</h>
</cropBox>
</page>
<page leafNum="2">
<addToAccessFormats>true</addToAccessFormats>
<cropBox>
<w>2608.0</w>
<h>3494.0</h>
</cropBox>
</page>
</pageData>
</book>
Looks pretty much the same, right? Not quite. Notice the very first line of the file. A default namespace (http://www.archive.org/scribe/xml) has been supplied. If you attempt to parse this XML document with the function above, the LINQ query will throw an exception. Because no namespace has been used in the query, the elements “book”, “pageData”, and “page” will not be found.
Here is a corrected version of the original function:
{
string width = "1200";
string height = "1600";
// Load the XML document
var xml = XDocument.Load(filePath);
XNamespace ns = string.Empty;
// Try accessing the root without using a namespace
if (xml.Element(ns + "book") == null)
{
// Add a namespace to the query
XAttribute nsAttrib = xml.Root.Attribute("xmlns");
if (nsAttrib != null)
{
// Use the default namespace specified on the root element, if one exists
ns = nsAttrib.Value;
}
else
{
// Try this namespace if no default namespace found on the root
ns = "http://archive.org/scribe/xml";
}
}
// Use a LINQ query to read the page elements from the document
var pages = from page in xml.Element(ns + "book").Element(ns + "pageData").Descendants(ns + "page")
where (string)page.Element(ns + "addToAccessFormats") == "true"
select page;
foreach (XElement page in pages)
{
// Make sure we have a cropBox element
XElement cropBox = page.Element(ns + "cropBox");
if (cropBox != null)
{
// Read the height and width
XElement widthElement = cropBox.Element(ns + "w");
XElement heightElement = cropBox.Element(ns + "h");
width = (widthElement == null ? width : widthElement.Value);
height = (heightElement == null ? height : heightElement.Value);
// Now save the width and height for the current page (not shown)
}
}
}
The difference between this function and the first is the use of a namespace when accessing the elements of the XML document. The original ReadPageDimensionsBAD function queries the XML document without consideration for any namespaces that might be defined in the document. ReadPageDimensionsGOOD, on the other hand, first tries to access only the root of the document without using a namespace. If that works, great; it continues without defining any namespaces. However, if it cannot access the root without a namespace, it defines a namespace for use in querying the document. It does this by first checking for a default namespace defined on the root element, and falling back to a hardcoded value of “http://www.archive.org/scribe/xml”.
Once a namespace is defined, the ReadPageDimensionsGOOD function can use that namespace to access the elements of the XML document. Notice how the name of each element is prefixed with the namespace definition in the LINQ to XML query.
The ReadPageDimensionsGOOD function will work for either of the XML documents that were presented here, and so it should be able to read any of the page metadata XML documents at Internet Archive.
Using methods similar to the ones shown here, I was able to successfully implement the book viewer at BHL. An example of a book containing pages of various sizes can be found here.