Using Internet Archive’s S3(ish) Interface

In my work for the Biodiversity Heritage Library, I transfer large amounts of data to and from the Internet Archive.  Last week (September 23, 2010) I gave a presentation to the BHL Global Tech Meeting in Woods Hole, Massachusetts about some of the methods I use to do these data transfers.  The meeting included BHL representatives from the United States, England, Germany, Egypt, Brazil, Costa Rica, and Australia. 

The slide deck is available for viewing and download at http://www.slideshare.net/mlichtenberg1/bhl-global-tech-meeting-internet-archive-data-transfer.

One topic that did not get much coverage in my presentation was programmatically uploading files to Internet Archive via their S3-like storage API.  (I say “S3-like” because it is meant to mirror the interface to Amazon.com’s S3 service.  And It is similar, but not identical.  For more information, see http://www.archive.org/help/abouts3.txt.)  The topic is mentioned on the 2nd-to-last slide of the presentation, but I did not discuss it in depth or include any source code to illustrate how it works.

To remedy that oversight, I present here a class that is used in a production application to upload files to Internet Archive, using their “S3-like” storage API.  Written in C#, it uses the WebClient class from the System.Net namespace in the .NET Framework to handle the data transfers.

Here is the full source of the class, followed by a discussion of its key elements.

using System;
using System.Collections.Generic;
using System.Text;
using System.Net;

namespace InternetArchive.Utilities
{
   public class S3
   {
       // PROPERTIES

       private string _accessKey = "YOUR_ACCESS_KEY";
       private string _secretKey = "YOUR_SECRET_KEY";
       private string _s3BaseDomain = "http://s3.us.archive.org";
       private string _bucketAddressFormat = "{0}/{1}";
       private string _objectAddressFormat = "{0}/{1}/{2}";

       // The WebClient class from the System.Net namespace is used for several Get and Put operations
       private WebClient _webClient = null;

       public WebClient WebClient
       {
           get
           {
               if (_webClient == null)
               {
                   // Set the Internet Archive authorization headers when the WebClient is instantiated
                   _webClient = new WebClient();
                   _webClient.Headers.Add("authorization", this.GetAuthHeaderValue());
               }
               return _webClient;
           }
       }

       // CONSTRUCTORS

       public S3()
       {
       }

       public S3(string accessKey, string secretKey)
       {
           _accessKey = accessKey;
           _secretKey = secretKey;
       }

       ~S3()
       {
           if (_webClient != null)
           {
               _webClient.Dispose();
               _webClient = null;
           }
       }

       // OBJECT OPERATIONS

       // Objects are files that are placed into buckets (folders).

       /// <summary>
       /// Upload a file into the specified bucket.
       /// </summary>
       /// <param name="fileName">The name of the file to be uploaded</param>
       /// <param name="bucketName">The Internet Archive identifier of the destination bucket </param>
       /// <param name="objectName">The name to give the file at Internet Archive</param>
       /// <param name="contentType">A valid MIME type for the file being uploaded</param>
       /// <param name="headers">A list of key-value pairs to be added as HTTP headers</param>
       /// <param name="preventDerive">True if Internet Archive should initiate its derivation process</param>
       /// <param name="makeBucket">True if Internet Archive should create a new bucket</param>
       /// <returns>"Success" if the upload was successful, otherwise an error message.</returns>
       public string PutObject(string fileName, string bucketName, string objectName,
           string contentType, List<KeyValuePair<string, string>> headers,
           bool preventDerive, bool makeBucket)
       {
           string result = string.Empty;
           try
           {
               if (preventDerive)
               {
                   // Set a header to prevent IA from initiating a derive process on this item
                   if (headers == null) headers = new List<KeyValuePair<string, string>>();
                   headers.Add(new KeyValuePair<string, string>("x-archive-queue-derive", "0"));
               }
               if (makeBucket)
               {
                   // Set a header to allow IA to create a "bucket" in which to place this item
                   if (headers == null) headers = new List<KeyValuePair<string, string>>();
                   headers.Add(new KeyValuePair<string, string>("x-archive-auto-make-bucket", "1"));
               }

               string destination = String.Format(_objectAddressFormat,
                           _s3BaseDomain, bucketName, objectName);
               this.HttpRequest(destination, fileName, "PUT", contentType, headers);
               result = "Success";
           }
           catch (Exception ex)
           {
               result = "Error: " + ex.Message;
           }

           return result;
       }

       /// <summary>
       /// Download a file to the specified location.
       /// </summary>
       /// <param name="bucketName">The Internet Archive identifier of the bucket holding the file</param>
       /// <param name="objectName">The name of the file to be downloaded</param>
       /// <param name="fileName">The name of a local file to which to download the object</param>
       /// <returns>True if the download was successful, otherwise false</returns>
       public bool GetObject(string bucketName, string objectName, string fileName)
       {
           bool result = true;
           try
           {
               this.WebClient.DownloadFile(
                     String.Format(_objectAddressFormat, _s3BaseDomain, bucketName, objectName), fileName);
           }
           catch
           {
               result = false;
           }

           return result;
       }

       // BUCKET OPERATIONS

       // Buckets are folders.  They are named with a unique identifier.

       /// <summary>
       /// List all of the buckets owned by the authorized user.
       /// </summary>
       /// <returns>XML listing of buckets</returns>
       public string ListBuckets()
       {
           return this.WebClient.DownloadString(_s3BaseDomain);
       }

       /// <summary>
       /// List the contents of the specified bucket.
       /// </summary>
       /// <param name="bucketName"></param>
       /// <returns>XML listing of the files in the bucket</returns>
       public string GetBucket(string bucketName)
       {
           return this.WebClient.DownloadString(
                 String.Format(_bucketAddressFormat, _s3BaseDomain, bucketName));
       }

       // HELPER METHODS

       /// <summary>
       /// Get the Internet Archive authorization string to be passed in an HTTP header
       /// </summary>
       /// <returns></returns>
       private string GetAuthHeaderValue()
       {
           return String.Format("LOW {0}:{1}", _accessKey, _secretKey);
       }

       /// <summary>
       /// Submit an HTTP request to upload a file
       /// </summary>
       /// <param name="url">The Url to which to submit the file</param>
       /// <param name="fileName">A file to be uploaded</param>
       /// <param name="method">"PUT"</param>
       /// <param name="contentType">A valid MIME type for the file being uploaded</param>
       /// <param name="headers">A list of key-value pairs to be added as HTTP headers</param>
       private void HttpRequest(string url, string fileName, string method,
           string contentType, List<KeyValuePair<string, string>> headers)
       {
           System.IO.Stream stream = null;

           try
           {
               // Read file to be uploaded
               byte[] fileContents = System.IO.File.ReadAllBytes(fileName);

               // Prepare the web request
               HttpWebRequest req = (HttpWebRequest)WebRequest.Create(url);
               req.Method = method;
               req.Timeout = 600000;    // 10 minutes
               req.ContentType = contentType;
               req.ContentLength = fileContents.Length;
               req.Headers.Add("authorization", this.GetAuthHeaderValue());

               // If additional header values have been specified, add them now
               if (headers != null)
               {
                   foreach (KeyValuePair<string, string> header in headers)
                   {
                       req.Headers.Add(header.Key, header.Value);
                   }
               }

               // Send the data
               stream = req.GetRequestStream();
               stream.Write(fileContents, 0, fileContents.Length);
               stream.Close();

               // Make sure we were successful
               HttpWebResponse response = (HttpWebResponse)req.GetResponse();
               if (response.StatusCode != HttpStatusCode.Created)
               {
                   throw new UnauthorizedAccessException("File not written to " + url + ".  HTTP status: " +
                         response.StatusCode.ToString());
               }
           }
           catch (WebException wex)
           {
               throw wex;
           }
           finally
           {
               if (stream != null)
               {
                   stream.Close();
                   stream.Dispose();
                   stream = null;
               }
           }
       }
   }
}

 
The first thing to note is found in the section of the code labeled PROPERTIES.  Should you choose to use this code yourself, notice that you’ll need to set the values of the  _secretKey and _accessKey  properties to your own Internet Archive API keys.  (Alternately, you can set these by passing the values to the constructor).

The CONSTRUCTORS section of the code is straightforward, and needs no further explanation.

The section labeled OBJECT OPERATIONS includes methods used to upload and download individual files.  The PutObject method is used to upload a file to Internet Archive.  It allows you to set extra HTTP headers (for passing metadata about the uploaded file to Internet Archive), toggle Internet Archive’s derivation process, and toggle creation of a new bucket in which to store the file.  The actual upload is handled by the HttpRequest method, found in the HELPER METHODS section of the code.  The GetObject method is used for downloading a file from Internet Archive.

BUCKET OPERATIONS are methods for sending simple requests to InternetArchive to return the list of buckets associated with the specified API keys (the ListBuckets method), as well as the list of files contained in a particular bucket (the GetBucket method).

Finally, the HELPER METHODS section of the code includes private methods that support the class’ functionality.  You might pay close attention to the HttpRequest method, which handles the uploading of files to Internet Archive.  It uses the System.Net.HttpWebRequest class to perform the uploads; this is the only time that the WebClient instance is NOT used by to perform an HTTP operation.  System.Net.HttpWebRequest provides more fine-grained control over the upload process, which is needed here, particularly for setting the HTTP headers.

Here is a short example of how the preceding class might be used.  This example uploads a file to an existing item at Internet Archive, without setting any additional metadata values.

/// <summary>
/// A simple function that uses the S3 class to upload an XML file to an
/// existing bucket at Internet Archive
/// </summary>
private void UploadXmlFile(string localFileName, string remoteFileName, string bucketName)
{
   S3 s3 = new S3();

   try
   {
       // Upload the file
       string putResult = s3.PutObject(localFileName, bucketName,
           remoteFileName, "application/xml", null, true, false);

       // Evaluate results
       if (putResult == "Success")
       {
           // File uploaded
       }
       else if (putResult.ToLower().Contains("403"))
       {
           // Name file skipped (forbidden) – no permissions to write to bucket
       }
       else
       {
           // Error uploading file
       }
   }
   catch (Exception ex)
   {
       //  Error uploading file
   }
   finally
   {
       if (s3 != null) s3 = null;
   }
}

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: