Using LINQ To XML… and Handling Namespaces

OVERVIEW
While working on the Biodiversity Heritage Library (BHL), I have incorporated the open source book viewer software from OpenLibrary.  This book viewer offers an interface for viewing a set of images of a book’s pages.

The basic requirements for using the BookViewer are 1) to embed the Javascript library that implements the book viewer in a web page and 2) write Javascript functions that supply the book viewer with page metadata.  That is perhaps a bit of an oversimplification, but it gives the general idea of the effort required.

The basic page metadata that is required by the book viewer is the path to the page image and the dimensions of that image.  The majority of the books available in BHL have been scanned by Internet Archive, and one of the outputs of their scanning process is an XML document that contains the dimensions of each page image.

In BHL, the web page that contains the book viewer is a plain HTML page.  Javascript and jQuery are used to call a web service implemented in C# to get the page metadata required by the viewer.

So, we have an XML document that contains a bunch of information about the pages in a book, including the height and width of each page.  And, we have a process (the web service) implemented in C# that needs to return the height and width metadata for each page.

Enter LINQ to XML.

Microsoft says “LINQ to XML provides an in-memory XML programming interface that leverages the .NET Language-Integrated Query (LINQ) Framework. LINQ to XML uses the latest .NET Framework language capabilities and is comparable to an updated, redesigned Document Object Model (DOM) XML programming interface.”  In other words, it’s the perfect tool for parsing the information that is needed from the XML files provided by Internet Archive.

AN INCOMPLETE SOLUTION
Let’s take a look at a simplified view of one of the XML files.  The complete file can be found here.

<book>
<bookData>
   <bookId>bulletindelasoci14pari</bookId>
   <leafCount>482</leafCount>
</bookData>
<pageData>
   <page leafNum="0">
     <addToAccessFormats>false</addToAccessFormats>
     <cropBox>
       <w>1747</w>
       <h>2620</h>
     </cropBox>
   </page>
   <page leafNum="1">
     <addToAccessFormats>true</addToAccessFormats>
     <cropBox>
       <w>1669</w>
       <h>2776</h>
     </cropBox>
   </page>
</pageData>
</book>

You can see that within the <pageData> element are multiple <page> elements; one for each page.  And, within each <page> element are <w> and <h> elements that contain the width and height data for each page.  One more thing to note is the <addToAccessFormats> elements that appear within each <page> element.  If the text value of <addToAccessFormats> is false, then that page is not displayed in the book viewer, and the dimensions of that pages don’t need to be read.

Here’s a simple function that will parse the height and width metadata for each displayed page.

public string ReadPageDimensionsBAD(string filePath)
{
   string width = "1200";   // default width
   string height = "1600";   // default height

   // Load the XML document
   var xml = XDocument.Load(filePath);

   // Use a LINQ query to read the page elements from the document
   var pages = from page in xml.Element("book").Element("pageData").Descendants("page")
               where (string)page.Element("addToAccessFormats") == "true"
               select page;

   foreach (XElement page in pages)
   {
       // Make sure we have a cropBox element
       XElement cropBox = page.Element("cropBox");
       if (cropBox != null)
       {
           // Read the height and width
           XElement widthElement = cropBox.Element("w");
           XElement heightElement = cropBox.Element("h");
           width = (widthElement == null ? width : widthElement.Value);
           height = (heightElement == null ? height : heightElement.Value); 

           // Now save the width and height for the current page (not shown)
       }
   }
}

Pretty simple stuff.  Loading the XML document and many of the methods of parsing of that document are very similar to the usual methods for parsing the XML DOM.  The one major difference is the LINQ query itself, which selects all of the <page> elements where the “<addToAccessFormats>” element has a value of “true”.

THE BETTER SOLUTION (OR, DON’T FORGET THE NAMESPACE)
Unfortunately, there’s a problem..  The example just given will parse the example XML document just fine.  But some of the XML documents at Internet Archive are just slightly different.  Consider the following example (full file found here):

<book xmlns="http://archive.org/scribe/xml"&gt;
<bookData>
   <bookId>systematikundfau19textgies</bookId>
   <leafCount>858</leafCount>
</bookData>
<pageData>
   <page leafNum="1">
     <addToAccessFormats>true</addToAccessFormats>
     <cropBox>
       <w>2724.0</w>
       <h>3768.0</h>
     </cropBox>
   </page>
   <page leafNum="2">
     <addToAccessFormats>true</addToAccessFormats>
     <cropBox>
       <w>2608.0</w>
       <h>3494.0</h>
     </cropBox>
   </page>
</pageData>
</book>

Looks pretty much the same, right?  Not quite.  Notice the very first line of the file.  A default namespace (http://www.archive.org/scribe/xml) has been supplied.  If you attempt to parse this XML document with the function above, the LINQ query will throw an exception.  Because no namespace has been used in the query, the elements “book”, “pageData”, and “page” will not be found.

Here is a corrected version of the original function:

public string ReadPageDimensionsGOOD(string filePath)
{
   string width = "1200";
   string height = "1600"; 

   // Load the XML document 
   var xml = XDocument.Load(filePath);

   XNamespace ns = string.Empty;

   // Try accessing the root without using a namespace 
   if (xml.Element(ns + "book") == null)
   {
       // Add a namespace to the query
       XAttribute nsAttrib = xml.Root.Attribute("xmlns");
       if (nsAttrib != null)
       {
           // Use the default namespace specified on the root element, if one exists
           ns = nsAttrib.Value;
       }
       else
       {
           // Try this namespace if no default namespace found on the root
           ns = "http://archive.org/scribe/xml&quot;;
       }
   }

   // Use a LINQ query to read the page elements from the document
   var pages = from page in xml.Element(ns + "book").Element(ns + "pageData").Descendants(ns + "page")
               where (string)page.Element(ns + "addToAccessFormats") == "true"
               select page;

   foreach (XElement page in pages)
   { 
       // Make sure we have a cropBox element 
       XElement cropBox = page.Element(ns + "cropBox");
       if (cropBox != null)
       { 
           // Read the height and width 
           XElement widthElement = cropBox.Element(ns + "w");
           XElement heightElement = cropBox.Element(ns + "h");
           width = (widthElement == null ? width : widthElement.Value);
           height = (heightElement == null ? height : heightElement.Value);

           // Now save the width and height for the current page  (not shown)
       }
   }
}

The difference between this function and the first is the use of a namespace when accessing the elements of the XML document.   The original ReadPageDimensionsBAD function queries the XML document without consideration for any namespaces that might be defined in the document.  ReadPageDimensionsGOOD, on the other hand, first tries to access only the root of the document without using a namespace.  If that works, great; it continues without defining any namespaces.  However, if it cannot access the root without a namespace, it defines a namespace for use in querying the document.  It does this by first checking for a default namespace defined on the root element, and falling back to a hardcoded value of “http://www.archive.org/scribe/xml”. 

Once a namespace is defined, the ReadPageDimensionsGOOD function can use that namespace to access the elements of the XML document.  Notice how the name of each element is prefixed with the namespace definition in the LINQ to XML query.

The ReadPageDimensionsGOOD function will work for either of the XML documents that were presented here, and so it should be able to read any of the page metadata XML documents at Internet Archive.

Using methods similar to the ones shown here, I was able to successfully implement the book viewer at BHL.  An example of a book containing pages of various sizes can be found here.

Advertisements

2 Responses to Using LINQ To XML… and Handling Namespaces

  1. Hey this is kinda of off topic but I was wanting to know
    if blogs use WYSIWYG editors or if you have to manually code with HTML.

    I’m starting a blog soon but have no coding expertise
    so I wanted to get guidance from someone with experience.
    Any help would be greatly appreciated!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: