Data Access Framework Comparison

Introduction

For some time now I have been working on a project that utilizes a custom-built data access framework, rather than popular ORM frameworks such as Entity Framework or NHibernate.

While the custom framework has worked well for the project, I had questions about it.  For example, it uses stored procedures to implement basic CRUD operations, and I wondered if inline parameterized SQL statements might perform better.  Also, I wondered about the performance of the custom framework compared to the leading ORMs.

Besides my questions about the custom framework, I recognized the importance of having at least a basic understanding of how to use the other ORM frameworks.

In order to answer my questions about the custom framework and to gain some practical experience with the other ORMs, I created a simple web application that uses each of those frameworks to perform basic CRUD applications.  While executing the CRUD operations, the application times them and produces a summary report of the results.

The code for the test application can be found at https://github.com/mlichtenberg/ORMComparison.

NOTE: I assume that most readers are familiar with the basics of Entity Framework and NHibernate, so I will not provide an overview of them here.

Using the custom framework is similar to Entity Framework and NHibernate’s “database-first” approach.  Any project that uses the library references a single assembly containing the base functionality of the library.  A T4 template is used to generate additional classes based on tables in a SQL Server database.  Some of the classes are similar to EF’s Model classes and NHibernate’s Domain classes.  The others provide the basic CRUD functionality for the domain/model classes. 

For these tests I made a second copy of the custom framework classes that provide the basic CRUD functionality, and edited them to replace the CRUD stored procedures with parameterized SQL statements.

The custom framework includes much less overhead on top of ADO.NET than the popular ORMs, so I expected the tests to show that it was the best-performing framework.  The question was, how much better?

In the rest of this post, I will describe the results of my experiment, as well as some of the optimization tips I learned along the way.  Use the following links to jump directly to a topic.

Test Application Overview
“Out-of-the-Box” Performance
Entity Framework Performance After Code Optimization
     AutoDetectChangesEnabled and DetectChanges()
     Recycling the DbContext
NHibernate Performance After Configuration Optimization
     What’s Up with Update Performance in NHibernate?
Results Summary

Test Application Overview

    A SQL Express database was used for the tests.  The data model is borrowed from Microsoft’s Contoso University sample application.  Here is the ER diagram for the database:

image

 

The database was pre-populated with sample data.  The number of rows added to each table were:

Department: 20
Course: 200
Person: 100000
Enrollment: 200000

This was done because SQL Server’s optimizer will behave differently with an empty database than it will with a database containing data, and I wanted the database to respond as it would in a “real-world” situation.  For the tests, all CRUD operations were performed against the Enrollment table.

Five different data access frameworks were tested:

  1. Custom framework with stored procedures
  2. Custom framework with parameterized SQL statements
  3. Entity Framework
  4. NHibernate
  5. Fluent NHibernate

The testing algorithm follows the same pattern for each of the frameworks:

01) Start timer
02) For a user-specified number of iterations 
03)      Submit an INSERT statement to the database
04)      Save the identifier of the new database record
05) End timer
06) Start timer
07) For each new database record identifier
08)      Submit a SELECT statement to the database
09) End timer
10) Start timer
11) For each new database record identifier
12)      Submit an UPDATE statement to the database
13) End timer
14) Start timer
15) For each new database record identifier
16)      Submit a DELETE statement to the database
17) End timer

Note that after the test algorithm completes, the database is in the same state as when the tests began.

To see the actual code, visit https://github.com/mlichtenberg/ORMComparison/blob/master/MVCTestHarness/Controllers/TestController.cs.

"Out-of-the-Box" Performance

I first created very basic tests for each framework. Essentially, these were the “Hello World” versions of the CRUD code for each framework.  No optimization was attempted.

Here is an example of the code that performs the INSERTs for the custom framework.  There is no difference between the version with stored procedures and the version without, other than the namespace from which EnrollmentDAL is instantiated.

    DA.EnrollmentDAL enrollmentDAL = new DA.EnrollmentDAL();

    for (int x = 0; x < Convert.ToInt32(iterations); x++)
    {
        DataObjects.Enrollment enrollment = enrollmentDAL.EnrollmentInsertAuto
            (null, null, 101, 1, null);
        ids.Add(enrollment.EnrollmentID);
    }

      And here is the equivalent code for Entity Framework:

    using (SchoolContext db = new SchoolContext())
    {
       for (int x = 0; x < Convert.ToInt32(iterations); x++)
        {
            Models.Enrollment enrollment = new Models.Enrollment {
                CourseID = 101, StudentID = 1, Grade = null };
            db.Enrollments.Add(enrollment);
            db.SaveChanges();
            ids.Add(enrollment.EnrollmentID);
        }

    }

    The code for NHibernate and Fluent NHibernate is almost identical.  Here is the NHibernate version:

using (var session = NH.NhibernateSession.OpenSession("SchoolContext"))
{
    var course = session.Get<NHDomain.Course>(101);
    var student = session.Get<NHDomain.Person>(1);

    for (int x = 0; x < Convert.ToInt32(iterations); x++)
    {
        var enrollment = new NHDomain.Enrollment { 
            Course = course, Person = student, Grade = null };
        session.SaveOrUpdate(enrollment);

        ids.Add(enrollment.Enrollmentid);
    }

}

The SELECT, UPDATE, and DELETE code for each framework followed similar patterns. 

    NOTE: A SQL Server Profiler trace proved that the actual interactions with the database were the same for each framework.  The same database connections were established, and equivalent CRUD statements were submitted by each framework.  Therefore, any measured differences in performance are due to the overhead of the frameworks themselves.

        Here are the results of the tests of the “out-of-the-box” code:

      Framework              Operation     Elapsed Time (seconds)
      Custom                 Insert        5.9526039
      Custom                 Select        1.9980745
      Custom                 Update        5.0850357
      Custom                 Delete        3.7785886

      Custom (no SPs)        Insert        5.2251725
      Custom (no SPs)        Select        2.0028176
      Custom (no SPs)        Update        4.5381994
      Custom (no SPs)        Delete        3.7064278

      Entity Framework       Insert        1029.5544975
      Entity Framework       Select        8.6153572
      Entity Framework       Update        2362.7183765
      Entity Framework       Delete        25.6118191

      NHibernate             Insert        9.9498188
      NHibernate             Select        7.3306331
      NHibernate             Update        274.7429862
      NHibernate             Delete        12.4241886

      Fluent NHibernate      Insert        11.796126
      Fluent NHibernate      Select        7.3961941
      Fluent NHibernate      Update        283.1575124
      Fluent NHibernate      Delete        10.791648

      NOTE: For all tests, each combination of Framework and Operation was executed 10000 times.   Looking at the first line of the preceding results, this means that Custom framework took 7.45 seconds to perform 10000 INSERTs.

      As you can see, both instances of the the custom framework outperformed Entity Framework and NHibernate.  In addition, the version of the custom framework that used parameterized SQL was very slightly faster than the version that used stored procedures.  Most interesting however, was the performance for INSERT and UPDATE operations.  Entity Framework and both versions of NHibernate were not just worse than the two custom framework versions, they were much MUCH worse.  Clearly, some optimization and/or configuration changes were needed.

      Entity Framework Performance After Code Optimization

      AutoDetectChangesEnabled and DetectChanges()  

      It turns out that much of Entity Framework’s poor performance appears to have been due to the nature of the tests themselves.  Information on Microsoft’s MSDN website notes that if you are tracking a lot of objects in your DbContext object and call methods like Add() and SaveChanges() many times in a loop, your performance may suffer.  That scenario describes the test almost perfectly.

      The solution is to turn off Entity Framework’s automatic detection of changes by setting AutoDetectChangesEnabled to false and explicitly calling DetectChanges().  This instructs Entity Framework to only detect changes to entities when explicitly instructed to do so.  Here is what the updated code for performing INSERTs with Entity Framework looks like (changes highlighted in red):

      using (SchoolContext db = new SchoolContext())
      {
          db.Configuration.AutoDetectChangesEnabled = false;

          for (int x = 0; x < Convert.ToInt32(iterations); x++)
          {
              Models.Enrollment enrollment = new Models.Enrollment {
                  CourseID = 101, StudentID = 1, Grade = null };
              db.Enrollments.Add(enrollment);
              db.ChangeTracker.DetectChanges();
              db.SaveChanges();
              ids.Add(enrollment.EnrollmentID);
          }
      }

      Here are the results of tests with AutoDetectChangesEnabled set to false:

      Framework           Operation    Elapsed Time (seconds)
      Entity Framework    Insert       606.5569332
      Entity Framework    Select       6.4425741
      Entity Framework    Update       605.6206616
      Entity Framework    Delete       21.0813293

      As you can see, INSERT and UPDATE performance improved significantly, and SELECT and DELETE performance also improved slightly.

      Note that turning off AutoDetectChangesEnabled and calling DetectChanges() explicitly in all cases WILL slightly improve the performance of Entity Framework.  However, it could also cause subtle bugs.  Therefore, it is best to only use this optimization technique in very specific scenarios and allow the default behavior otherwise.

      Recycling the DbContext

      While Entity Framework performance certainly improved by changing the AutoDetectChangesEnabled value, it was still relatively poor. 

      Another problem with the tests is that the same DbContext was used for every iteration of an operation (i.e. one DbContext object was used for all 10000 INSERT operations).  This is a problem because the context maintains a record of all entities added to it during its lifetime.  The effect of this was a gradual slowdown of the INSERT (and UPDATE) operations as more and more entities were added to the context.

      Here is what the Entity Framework INSERT code looks like after modifying it to periodically create a new Context (changes highlighted in red):

      for (int x = 0; x < Convert.ToInt32(iterations); x++)
      {
          // Use a new context after every 100 Insert operations
          using (SchoolContext db = new SchoolContext())
          {
              db.Configuration.AutoDetectChangesEnabled = false;

              int count = 1;
              for (int y = x; y < Convert.ToInt32(iterations); y++)
              {
                  Models.Enrollment enrollment = new Models.Enrollment {
                      CourseID = 101, StudentID = 1, Grade = null };
                  db.Enrollments.Add(enrollment);
                  db.ChangeTracker.DetectChanges();
                  db.SaveChanges();
                  ids.Add(enrollment.EnrollmentID);

                  count++;
                  if (count >= 100) break;
                  x++;
              }
          }
      }

      And here are the results of the Entity Framework tests with the additional optimization added:

      Framework            Operation     Elapsed Time (seconds)
      Entity Framework     Insert        14.7847024
      Entity Framework     Select        5.5516514
      Entity Framework     Update        13.823694
      Entity Framework     Delete        10.0770142

      Much better!  The time to perform the SELECT operations was little changed, but the DELETE time was reduced by half, and the INSERT and UPDATE times decreased from a little more than 10 minutes to about 14 seconds.

      NHibernate Performance After Configuration Optimization

      For the NHibernate frameworks, the tests themselves were not the problem.  NHibernate itself needs some tuning. 

      An optimized solution was achieved by changing the configuration settings of the NHibernate Session object.  Here is the definition of the SessionFactory for NHibernate (additions highlighted in red):

      private static ISessionFactory SessionFactory
      {
          get
          {
              if (_sessionFactory == null)
              {
                  string connectionString = ConfigurationManager.ConnectionStrings
                      [_connectionKeyName].ToString();

                  var configuration = new NHConfig.Configuration();
                  configuration.Configure();

                  configuration.SetProperty(NHConfig.Environment.ConnectionString,
                      connectionString);

                  configuration.SetProperty(NHibernate.Cfg.Environment.FormatSql,
                      Boolean.FalseString);
                  configuration.SetProperty
                     (NHibernate.Cfg.Environment.GenerateStatistics,
                          Boolean.FalseString);
                  configuration.SetProperty
                     (NHibernate.Cfg.Environment.Hbm2ddlKeyWords,
                          NHConfig.Hbm2DDLKeyWords.None.ToString());
                  configuration.SetProperty(NHibernate.Cfg.Environment.PrepareSql,
                          Boolean.TrueString);
                  configuration.SetProperty
                      (NHibernate.Cfg.Environment.PropertyBytecodeProvider,
                          "lcg");
                  configuration.SetProperty
                      (NHibernate.Cfg.Environment.PropertyUseReflectionOptimizer,
                          Boolean.TrueString);
                  configuration.SetProperty
                      (NHibernate.Cfg.Environment.QueryStartupChecking,
                          Boolean.FalseString);
                  configuration.SetProperty(NHibernate.Cfg.Environment.ShowSql, 
                      Boolean.FalseString);
                  configuration.SetProperty
                      (NHibernate.Cfg.Environment.UseProxyValidator, 
                          Boolean.FalseString);
                  configuration.SetProperty
                      (NHibernate.Cfg.Environment.UseSecondLevelCache,
                          Boolean.FalseString);

                  configuration.AddAssembly(typeof(Enrollment).Assembly);
                  _sessionFactory = configuration.BuildSessionFactory();
              }
              return _sessionFactory;
          }
      }

      And here is the InitializeSessionFactory method for Fluent NHibernate, with the equivalent changes included:

      private static void InitializeSessionFactory()
      {
          string connectionString = ConfigurationManager.ConnectionStrings[_connectionKeyName]
              .ToString();

          _sessionFactory = Fluently.Configure()
              .Database(MsSqlConfiguration.MsSql2012.ConnectionString(connectionString).ShowSql())
              .Mappings(m => m.FluentMappings.AddFromAssemblyOf<Enrollment>())
              .BuildConfiguration().SetProperty
                  (NHibernate.Cfg.Environment.FormatSql, Boolean.FalseString)
              .SetProperty(NHibernate.Cfg.Environment.GenerateStatistics,
                  Boolean.FalseString)
              .SetProperty(NHibernate.Cfg.Environment.Hbm2ddlKeyWords,
                  NHibernate.Cfg.Hbm2DDLKeyWords.None.ToString())
              .SetProperty(NHibernate.Cfg.Environment.PrepareSql,
                  Boolean.TrueString)
              .SetProperty(NHibernate.Cfg.Environment.PropertyBytecodeProvider,
                  "lcg")
              .SetProperty
                  (NHibernate.Cfg.Environment.PropertyUseReflectionOptimizer,
                      Boolean.TrueString)
              .SetProperty(NHibernate.Cfg.Environment.QueryStartupChecking,
                  Boolean.FalseString)
              .SetProperty(NHibernate.Cfg.Environment.ShowSql, Boolean.FalseString)
              .SetProperty(NHibernate.Cfg.Environment.UseProxyValidator,
                  Boolean.FalseString)
              .SetProperty(NHibernate.Cfg.Environment.UseSecondLevelCache,
                  Boolean.FalseString)
              .BuildSessionFactory();
      }

      The following table gives a brief description of the purpose of these settings:

      Setting                   Purpose
      FormatSql                 Format the SQL before sending it to the database
      GenerateStatistics        Produce statistics on the operations performed
      Hbm2ddlKeyWords           Should NHibernate automatically quote all db object names
      PrepareSql                Compiles the SQL before executing it
      PropertyBytecodeProvider  What bytecode provider to use for the generation of code
      QueryStartupChecking      Check all named queries present in the startup configuration
      ShowSql                   Show the produced SQL
      UseProxyValidator         Validate that mapped entities can be used as proxies
      UseSecondLevelCache       Enable the second level cache

      Notice that several of these (FormatSQL, GenerateStatistics, ShowSQL) are most useful for debugging.  It is not clear why they are enabled by default in NHibernate; it seems to me that these should be opt-in settings, rather than opt-out.

      Here are the results of tests of the NHibernate frameworks with these changes in place:

      Framework                        Operation     Elapsed Time (seconds)
      NHibernate (Optimized)           Insert        5.0894047
      NHibernate (Optimized)           Select        5.2877312
      NHibernate (Optimized)           Update        133.9417387
      NHibernate (Optimized)           Delete        5.6669841

      Fluent NHibernate (Optimized)    Insert        5.0175024
      Fluent NHibernate (Optimized)    Select        5.2698945
      Fluent NHibernate (Optimized)    Update        128.3563561
      Fluent NHibernate (Optimized)    Delete        5.5299521

      These results are much improved, with the INSERT, SELECT, and DELETE operations nearly matching the results achieved by the custom framework.   The UPDATE performance, while improved, is still relatively poor.

      What’s Up with Update Performance in NHibernate?

      The poor update performance is a mystery to me.  I have researched NHibernate optimization techniques and configuration settings, and have searched for other people reporting problems with UPDATE operations.  Unfortunately, I have not been able to find a solution.

      This is disappointing, as I personally found NHibernate more comfortable to work with than Entity Framework, and because it beats or matches the performance of Entity Framework for SELECT, INSERT, and DELETE operations.

      If anyone out there knows of a solution, please leave a comment!

      Final Results

      The following table summarizes the results of the tests using the optimal configuration for each framework.  These are the same results shown earlier in this post, combined here in a single table.

      Framework                        Operation     Elapsed Time (seconds)
      Custom                           Insert        5.9526039
      Custom                           Select        1.9980745
      Custom                           Update        5.0850357
      Custom                           Delete        3.7785886

      Custom (no SPs)                  Insert        5.2251725
      Custom (no SPs)                  Select        2.0028176
      Custom (no SPs)                  Update        4.5381994
      Custom (no SPs)                  Delete        3.7064278

      Entity Framework (Optimized)     Insert        14.7847024
      Entity Framework (Optimized)     Select        5.5516514
      Entity Framework (Optimized)     Update        13.823694
      Entity Framework (Optimized)     Delete        10.0770142

      NHibernate (Optimized)           Insert        5.0894047
      NHibernate (Optimized)           Select        5.2877312
      NHibernate (Optimized)           Update        133.9417387
      NHibernate (Optimized)           Delete        5.6669841

      Fluent NHibernate (Optimized)    Insert        5.0175024
      Fluent NHibernate (Optimized)    Select        5.2698945
      Fluent NHibernate (Optimized)    Update        128.3563561
      Fluent NHibernate (Optimized)    Delete        5.5299521

      And here is a graph showing the same information:

      image

    Installing Docker on Windows

    I recently set up Docker on both Windows 10 and Windows 7.  In both cases, it was a slightly bumpy experience, so I am recording the steps I followed here.

    Windows 10

    Machine Specifications
    • 64-bit Windows 10
    • VT-X/AMD-v support enabled in the BIOS
    • Hyper-V installed
    • Latest version of VirtualBox (version 5.0.16 at the time of this writing) installed
    Process

    STEP 1

    Download and install Docker Toolbox.  This is straightforward, and detailed instructions for doing this are available online).

    STEP 2

    Docker on Windows requires VirtualBox in order to run a lightweight Linux virtual machine.  VirtualBox and the Windows Hyper-V technology are mutually exclusive technologies; you cannot use both at the same time.  Therefore, If Hyper-V is installed and enabled on your machine, you should disable it by following the directions found here.

    STEP 3

    The Docker startup script checks for the presence of Hyper-V, and halts if it is found.  It does NOT check the enabled/disabled state of Hyper-V.  So, if Hyper-V is installed on your computer, add a parameter to the Docker startup script to avoid the Hyper-V check.

    1. Locate the file C:\Program Files\Docker Toolbox\start.sh, and make a backup copy.
    2. Open the file C:\Program Files\Docker Toolbox\start.sh for editing.
    3. Look for this line: "${DOCKER_MACHINE}" create -d virtualbox "${VM}"
    4. Add a parameter to the line: "${DOCKER_MACHINE}" create –virtualbox-no-vtx-check -d virtualbox "${VM}"
    5. Save the file
      STEP 4

    The Docker Toolbox installation adds an icon to the desktop labeled “Docker Quickstart Terminal”.  Click on this icon to run Docker for the first time. 

    At this point the script will create the Docker virtual machine and attempt to configure it.  If the configuration proceeds successfully, you will have a running Docker instance.

    In my case, configuration failed.  I received the following error, which seems to be not uncommon:

    Error creating machine: Error in driver during machine creation: Unable to start the VM: C:\Program Files\Oracle\VirtualBox\VBoxManage.exe startvm default –type headless failed:

    VBoxManage.exe: error: Failed to open/create the internal network ‘HostInterfaceNetworking-VirtualBox Host-Only Ethernet Adapter #5’ (VERR_INTNET_FLT_IF_NOT_FOUND).

    VBoxManage.exe: error: Failed to attach the network LUN (VERR_INTNET_FLT_IF_NOT_FOUND)

    VBoxManage.exe: error: Details: code E_FAIL (0x80004005), component ConsoleWrap, interface IConsole

    Details: 00:00:03.020271 Power up failed (vrc=VERR_INTNET_FLT_IF_NOT_FOUND, rc=E_FAIL (0X80004005))

    To fix this error, I followed the instructions at https://caffinc.github.io/2015/11/fix-vbox-network/ to change the properties of the VirtualBox network adapter.  The key instructions are repeated here:

    I opened my Control Panel > Network & Internet > Network Connections and located my VirtualBox Host-Only Network adapter, called VirtualBox Host-Only Network #6 on my machine. I right-clicked it and selected the Properties, and checked the un-checked VirtualBox NDIS6 Bridged Networking Driver and started VirtualBox again.

    Then I used the VirtualBox Manager to delete the Docker virtual machine (named "default") and delete all associated files.

    Once that additional configuration was complete, I was able to re-run the “Docker Quickstart Terminal” successfully.

    Windows 7

    Machine Specifications
    • 64-bit Windows 7
    • VT-X/AMD-v support enabled in the BIOS.
    • Latest version of VirtualBox (version 5.0.16 at the time of this writing) installed
    Process

    STEP 1

    Download and install Docker Toolbox.

    STEP 2

    The Toolbox installation adds an icon to the desktop labeled “Docker Quickstart Terminal”.  Click on this icon to run Docker for the first time.  

    At this point the script will create the Docker virtual machine and attempt to configure it.  If the configuration proceeds successfully, you will have a running Docker instance.

    My configuration failed.  I received the following error:

    Running pre-create checks…
    Creating machine…
    (default) Copying C:\Users\mlichtenberg\.docker\machine\cache\boot2docker.iso to
    C:\Users\mlichtenberg\.docker\machine\machines\default\boot2docker.iso…
    (default) Creating VirtualBox VM…
    (default) Creating SSH key…
    (default) Starting the VM…
    (default) Check network to re-create if needed…
    (default) Windows might ask for the permission to create a network adapter. Some
    times, such confirmation window is minimized in the taskbar.
    (default) Creating a new host-only adapter produced an error: C:\Program Files\O
    racle\VirtualBox\VBoxManage.exe hostonlyif create failed:
    (default) 0%…
    (default) Progress state: E_INVALIDARG
    (default) VBoxManage.exe: error: Failed to create the host-only adapter
    (default) VBoxManage.exe: error: Assertion failed: [!aInterfaceName.isEmpty()] a
    t ‘F:\tinderbox\win-5.0\src\VBox\Main\src-server\HostNetworkInterfaceImpl.cpp’ (
    74) in long __cdecl HostNetworkInterface::init(class com::Bstr,class com::Bstr,c
    lass com::Guid,enum __MIDL___MIDL_itf_VirtualBox_0000_0000_0036).
    (default) VBoxManage.exe: error: Please contact the product vendor!
    (default) VBoxManage.exe: error: Details: code E_FAIL (0x80004005), component Ho
    stNetworkInterfaceWrap, interface IHostNetworkInterface
    (default) VBoxManage.exe: error: Context: "enum RTEXITCODE __cdecl handleCreate(
    struct HandlerArg *)" at line 71 of file VBoxManageHostonly.cpp
    (default)
    (default) This is a known VirtualBox bug. Let’s try to recover anyway…
    Error creating machine: Error in driver during machine creation: Error setting u
    p host only network on machine start: The host-only adapter we just created is n
    ot visible. This is a well known VirtualBox bug. You might want to uninstall it
    and reinstall at least version 5.0.12 that is is supposed to fix this issue

    To attempt to fix this error, I first followed the instructions at https://caffinc.github.io/2015/11/fix-vbox-network/ to change the properties of the VirtualBox network adapter.  The key instructions are repeated here:

    I opened my Control Panel > Network & Internet > Network Connections and located my VirtualBox Host-Only Network adapter, called VirtualBox Host-Only Network #6 on my machine. I right-clicked it and selected the Properties, and checked the un-checked VirtualBox NDIS6 Bridged Networking Driver and started VirtualBox again.

    This did NOT work.  However, I found additional advice online that suggested doing the opposite (checking the “VirtualBox NDIS6 Bridged Networking Driver” instead of unchecking), and then disabling and enabling the network adapter.

    Disabling and re-enabling the network adapter may have been the key to resolving the error, but I cannot sure for sure.  In any case, I had success after 1) putting the properties of the network adapter back to their original state 2) and re-starting the network adapter.

    Once the network adapter was reset, I used the VirtualBox Manager to delete the Docker virtual machine (named "default") and delete all associated files.

    Once the additional configuration was complete, I was able to re-run the “Docker Quickstart Terminal” successfully.

    Testing / Validation

    On both Windows 7 and Windows 10, once the Docker installation completes successfully you will be left with a terminal window opened to a Docker command line.

    An Internet connection is required for the following.

    To test a basic Docker image, run the following command:

    docker run hello-world

    You will receive output that looks like the following:

    Unable to find image ‘hello-world:latest’ locally
    511136ea3c5a: Pull complete
    31cbccb51277: Pull complete
    e45a5af57b00: Pull complete
    hello-world:latest: The image you are pulling has been verified.
    Important: image verification is a tech preview feature and should not be
    relied on to provide security.
    Status: Downloaded newer image for hello-world:latest
    Hello from Docker.
    This message shows that your installation appears to be working correctly.

    To generate this message, Docker took the following steps:
    1. The Docker client contacted the Docker daemon.
    2. The Docker daemon pulled the "hello-world" image from the Docker Hub.
       (Assuming it was not already locally available.)
    3. The Docker daemon created a new container from that image which runs the
       executable that produces the output you are currently reading.
    4. The Docker daemon streamed that output to the Docker client, which sent it
       to your terminal.

     

    That is it.  For next steps with Docker, reading the User’s Guide is a good place to start. 

    If you want skip the specifics of Docker itself and dive right in to using a Docker container, pre-built Docker images for popular technologies such as nginx, Mongo, MySQL, Node, PostgreSQL, SOLR, ElasticSearch, and many others can be found at the Docker Hub.

    Tuning Tesseract OCR

    Background

    Tesseract is an open-source tool for generating OCR (Optical Character Recognition) output from digital images of text.  It has been around for a long time, and the project is currently "owned" by Google.  Tesseract is still in development, but its last official release was more than 2 years old.

    More information about Tesseract OCR can be found here and here.

    In general, Tesseract does a good job with clean, predictably-formatted pages of text.  More challenging are pages with unusual type faces or formatting.  Tesseract does provide a huge number of parameters that can be used to tune the output and improve its accuracy.  Unfortunately, most (if not all) of those parameters are minimally documented.

    While attempting to tune Tesseract for a project involving scans of old (pre-1923) life sciences texts, I found a few specific problems:

    • In some OCR outputs, entire lines or parts of lines were be located in the wrong place on the page.
    • In general, Tesseract is difficult to tune
    • The configurable settings are poorly documented and not named in a manner that eases use
    • Some tuning options are too aggressive
    • Some tuning options fix one problem while causing another

    Following are descriptions and examples of a few of the tuning parameters that are available, highlighting the pros and cons of each.  These parameters produced obvious differences in the OCR output.  A number of other parameters were evaluated, but produced no, or very minimal, differences.

    Notes/Disclaimers About the Tests

    In all of the following examples, "config" is the name of the file containing the configuration settings.  A Tesseract config file is just a plain text file containing space-delimited key/value pairs for Tesseract config variables, each on separate line.  There are several standard config files in the tessdata/configs folder of a standard Tesseract installation.

    The image used in this evaluation can be downloaded from http://www.archive.org/download/mobot31753003515068/page/n48

    Tesseract 3.02 was used for all tests.  In all cases, the settings tested and analyzed here have worked as described for the small number of images in our test set.  There is no guarantee that they will work the same for all, or even a majority, of scans.  At best, the following information should be used to direct tests on your own images.

    Configuration Settings for Capturing Debugging Output

    debug_file tesseract.log

    Writes debugging information to the named log file.

    tessedit_write_images true

    Internally, tesseract converts the image being processed to a TIF; this setting writes that TIF to disk.

    tessedit_create_hocr

    Writes the output, including coordinate information, to an HTML file instead of to the standard text file.  The coordinate information can be particularly helpful.  These coordinates are for lines and words, which can be easier to work with than some other "box" settings that produce coordinates by letter.

    Example 1: Baseline Output

    Command

    tesseract image.jpg outputfilename

    Command Line Arguments

    None

    Config Settings

    None

    Notes
    • This is the baseline output of Tesseract.
    • The lines in the output that are highlighted in red should be a single line reading "provincial towns, and in America.  A single sale has contained as many as".  The 2nd half of the line has been mis-located.
    • The words highlighted in blue include extra characters that are a results of "noise" (specks and imperfections in the image).
    • It is not clear where the words at the end of the page (highlighted in green) come from.
    Output

    -43 THE ORCHID REVIEW. [AUOUs’r, I921.

    occasions an Eastern grass arrived dead, but the following attempt met
    -with success. As many as 23 collectors were at one time employed in
    different parts of the world- Auction sales were held in London, the largest
    provincial towns, and in America.
    .20,ooo Orchids.

    Having completed the St. Albans’ nursery, a branch business was
    organised at Summit, New Jersey, U.S.A., and placed under the manage-

    -ment of Mr. Fostermann, a former collector, and later under Mr. A.

    A single sale has contained as many as

    Dimmoek; Eventually, when found to be too far-‘distant,=it wasacquired by ‘

    Messrs. Lager & Hurrell, who still maintain it as an Orchid nursery. In
    1886 the authoritative work, “Reichenbachz’a,” with life-size coloured
    illustrations, was published by Mr. Sander, many of the articles being
    ‘personally supervised by him.

    Still restless, Mr. Sander, in 1894, commenced building a nursery at St.
    André, Bruges, Belgium. In 1914 this had developed into an enormous
    concern with 250 houses, about 50 of which were devoted to Orchids, the
    culture of Vanda coerulea, Phalaenopses, Dendrobium superbiens, Laelia
    ~Gouldiana and Cymbidium Sanderi being especially successful, while great
    strides were made in the growing and breeding of Odontoglossums. The
    -Orchid section alone formed a comparatively large nursery, but the
    remaining houses covered more than four times the area, and contained huge
    quantities of palms, azaleas, dracaenas, and araucarias, while outside a
    stock of 30,000 trimmed bays, from miniatures to giants, rhododendrons,
    hardy azaleas, begonias, etc., were grown. His collectors penetrated all
    parts of the globe to which Orchids are indigenous, and the two nurseries
    acted as clearing houses to all countries.

    Mr. Sanders successes at the leading European and American Horti-
    -cultural Exhibitions gained for him a world-wide reputation. He was one of
    the original holders of the Victorian Medal of Horticulture, and held several
    foreign orders, including that of the Croen of Belgium. He was a baron of
    the Russian Empire, and as head of the firm he won the French President’s
    Prix d’Honneur in Paris, the Veitchian Cup in 1906, the Coronation
    Challenge Cup in 1913, 4f Gold Medals, 24 Silver Cups, as well as Medals
    and Diplomas by the hundred. In International Exhibitions at London,
    Edinburgh, Brussels, Antwerp, Ghent, Paris, Petrograd, Moscow, Florence,
    Milan, New York, Chicago and St. Louis he won Gold Medals and highest
    awards. At each Ghent quinquennial his new plants were amongst the
    leading attractions to horticulturists from all parts of the world. His
    personality was felt by all who came in contact with him. Genial to all,
    and enthusiastic where plants were concerned, it was a pleasure to speak
    with him on Orchids, and especially interesting and instructive when he
    could be induced to give personal reminiscences of the struggles to obtain

    .3,
    9.-
    4:?
    ‘ii

     

    Example 2: PageSegMode

    Command

    tesseract image.jpg outputfilename -psm 6

    Command Line Arguments

    -psm 6

    Config Settings

    None

    Notes
    • The command line argument -psm stands for PageSegMode (Page Segmentation Mode).  It directs the layout analysis that Tesseract performs on the page.  By default, Tesseract fully automates the page segmentation, but does not perform orientation and script detection.  A value of 6 directs Tesseract to assume a single uniform block of text.
    • The text in the output that is highlighted in red is now correctly contained on a single line.
    • The words highlighted in blue include extra characters that are a results of "noise" (specks and imperfections in the image).  None of these have corrected; in fact, a few new ones appear.
    • There are no longer extra lines between paragraphs.  However, those lines do not actually appear on the source image either.
    • The garbage words at the end of the page no longer appear.
    • A small number of errors in individual words that appear in the original output were corrected, and a few other incorrect words changed (but were still incorrect).
    • -psm 6 can produce very poor output.  For example, processing of the image found at http://www.archive.org/download/mobot31753002262522/page/n1 results in a huge amount of garbage output, instead of just a few lines of text.
    • Pages with images produce poor output when using PageSegMode 6.  Because this parameter instructs Tesseract to treat everything as a single block of text, images are not recognized as images, and are instead processed as text (resulting in lots of garbage in the OCR).
    • Another disadvantage of using PageSegMode 6 is that text on "rotated" pages is not recognized.  In its normal mode, Tesseract is able to automatically normalize the page orientation and detect words.
    • In addition, PageSegMode 6 results in very poor results from multi-column pages.  Again, in its normal mode, Tesseract does a decent job of processing columns of text.
    Output

    .48 THE ORCHID REVIEW. [Avcvs-.r. 1921- ’
    occasions an Eastern grass arrived dead, but the following attempt met
    -with success. As many as 23 collectors were at one time employed in
    different parts of the world- Auction sales were held in London, the largest
    provincial towns, and in America. A single sale has contained as many as
    .20,ooo Orchids.
    Having completed the St. Albans’ nursery, a branch business was
    organised at Summit, New Jersey, U.S.A., and placed under the manage-
    -ment of Mr. Fostermann, a former collector, and later under Mr. A.
    Dimmoek; Eventually, when found to be too far-‘distant,=it wasiacquired by ‘
    Messrs. Lager & Hurrell, who still maintain it as an Orchid nursery. In
    1886 the authoritative work, “Reichenbachz’a,” with life-size coloured
    illustrations, was published by Mr. Sander, many of the articles being
    ‘personally supervised by
    him. ‘
    Still restless, Mr. Sander, in 1894, commenced building a nursery at St.
    André, Bruges, Belgium. In 1914 this had developed into an enormous
    concern with 250 houses, about 50 of which were devoted to Orchids, the
    culture of Vanda coerulea, Phalaenopses, Dendrobium superbiens, Laelia
    ~Gouldiana and Cymbidium Sanderi being especially successful, while great
    strides were made in the growing and breeding of Odontoglossums. The
    -Orchid section alone formed a comparatively large nursery, but the
    remaining houses covered more than four times the area, and contained huge
    quantities of palms, azaleas, dracaenas, and araucarias, while outside a
    stock of 30,000 trimmed bays, from miniatures to giants, rhododendrons,
    hardy azaleas, begonias, etc., were grown. His collectors penetrated all
    parts of the globe to which Orchids are indigenous, and the two nurseries
    acted as clearing houses to all countries.
    Mr. Sanders successes at the leading European and American Horti-
    -cultural Exhibitions gained for him a world-wide reputation. He was one of
    the original holders of the Victorian Medal of Horticulture, and held several
    foreign orders, including that of the Croen of Belgium. He was a baron
    of .
    the Russian Empire, and as head of the firm he won the French President’s
    Prix d’Honneur in Paris, the Veitchian Cup in 1906, the Coronation
    Challenge Cup in 1913, 4f Gold Medals, 24 Silver Cups, as well as Medals
    and Diplomas by the hundred. In International Exhibitions at London,
    Edinburgh, Brussels, Antwerp, Ghent, Paris, Petrograd, Moscow, Florence,
    Milan, New York, Chicago and St. Louis he won Gold Medals and highest
    awards. At each Ghent quinquennial his new plants were amongst the
    leading attractions to horticulturists from all parts of the world. His
    personality was felt by all who came in contact with him. Genial to all,
    and enthusiastic where plants were concerned, it was a pleasure to speak
    with him on Orchids, and especially interesting and instructive when he
    could be induced to give personal reminiscences of the struggles to obtain

     

    Example 3: Line Size

    Command

    tesseract image.jpg outputfilename config

    Command Line Arguments

    None

    Config Settings

    textord_min_linesize 3.25

    Notes
    • textord_min_linesize seems to have an affect on the line heights detected by Tesseract when it performs the layout analysis on the image.  The default value for this setting is 1.25.
    • When set to 3.25, the "broken" line problem in the original baseline output is corrected.  Lower settings (for example, 3.0) do not correct the "broken" lines.  
    • This settings causes other character recognition errors.
    • The text in the output that is highlighted in red is again correctly contained on a single line.
    • The words highlighted in blue include extra characters that are a results of "noise" (specks and imperfections in the image).  None of these have corrected, but no new ones have appeared.
    • Lines between "paragraphs" now appear in somewhat odd locations.  Again, there are NO lines between paragraphs on the source image.
    • The garbage words at the end of the page do not appear.
    • A small number of errors in individual words that appear in the original output were corrected, a few other incorrect words changed (but were still incorrect), a small number of correct words now are  incorrect.  These have been highlighted in purple.
    Output

    .48 THE ORCHID REVIEW. [AUGUsT, 1921.

    occasions an Eastern grass arrived dead, but the following attempt met
    ‘with success. As many as 23 collectors were at one time employed in

    different parts of the world- Auction sales were held in London, the largest
    provincial towns, and in America. A single sale has contained as many as
    .20,00o Orchids.

    Having completed the St. Albans’ nursery, a branch business was
    organised at Summit, New Jersey, U.S.A., and placed under the manage-
    -ment of Mr. Fostermann, a former collector, and later under Mr. A.
    Dimmook; Eventually, when found to be too far-‘distant,=it wasiacqui d b ‘

    I

    Messrs. Lager & Hurrell, who still maintain it as an Orchid nurser: 1
    1886 the authoritative work, “Reichenbachz’a,” with life-size coloured
    illustrations, was published by Mr. Sander, many of the articles being
    ‘personally supervised by him.

    Still restless, Mr. Sander, in 1894, commenced building a nursery at St.
    André, Bruges, Belgium. In 1914 this had developed into an enormous i
    concern with 250 houses, about 50 of which were devoted to Orchids, the
    culture of Vanda coerulea, Phalaenopses, Dendrobium superbiens, Laelia
    ~Gouldiana and Cymbidium Sanderi being especially successful, while great
    strides were made in the growing and breeding of Odontogl sssss s. The
    -Orchid section alone formed a comparatively large nursery, but the
    remaining houses covered more than four times the area, and contained huge ;
    quantities of palms, azaleas, dracaenas, and araucarias, while outside a
    stock of 30,000 trimmed bays, from miniatures to giants, rhododendrons,
    hardy azaleas, begonias, etc., were grown. His collectors penetrated all
    parts of the globe to which Orchids are indigenous, and the two nurseries
    acted as clearing houses to all countries.

    Mr. Sanders successes at the leading European and American Horti-
    -cultural Exhibitions gained for him a world-wide reputation. He was one of
    the original holders of the Victorian Medal of Horticulture, and held several
    foreign orders, including that of the Croen of Belgium. He was a baron of
    the Russian Empire, and as head of the firm he won the French President’s 3
    Prix d’Honneur in Paris, the Veitchian Cup in 1906, the Coronation
    Challenge Cup in 1913, 4f Gold Medals, 24 Silver Cups, as well as Medals
    and Diplomas by the hundred. In International Exhibitions at London,

    Edinburgh, Brussels, Antwerp, Ghent, Paris, Petrograd, Moscow, Florence, fi
    Milan, New York, Chicago and St. Louis he won Gold Medals and highest
    awards. At each Ghent quinquennial his new plants were amongst the
    leading attractions to horticulturists from all parts of the world. His

    personality was felt by all who came in contact with him. Genial to all,
    and enthusiastic where plants were concerned, it was a pleasure to speak
    with him on Orchids, and especially interesting and instructive when he i
    could be induced to give personal reminiscences of the struggles to obtain

     

    Example 4: Noise Reduction

    Command

    tesseract image.jpg outputfilename -psm 6 config

    Command Line Arguments

    -psm 6

    Config Settings

    textord_heavy_nr 1

    Notes
    • Note that for this test, the PageSegMode command line parameter was used in conjunction with the configuration setting, and PageSegMode was responsible for the elimination of the “broken” lines in the output.
    • textord_heavy_nr instructs Tesseract to vigorously remove noise from the output.
    • Did a good job of removing noise from the results (highlighted in blue), BUT it also removed many valid punctuation and diacritic marks (highlighted in green), including ALL periods.
    • A small number of errors in individual words that appear in the original output were corrected, a few other incorrect words changed (but were still incorrect), a small number of correct words now are  incorrect.  These have been highlighted in purple.
    • Introduced three extra blank lines.
    • In total, more errors were introduced than were corrected with this setting.
    Output

    48 THE ORCHID REVIEW [Aucvsr I921
    occasions an Eastern grass arrived dead, but the following attempt met
    with success As many as 23 collectors were at one time employed in i
    different parts of the world Auction sales were held in London, the largest
    provincial towns, and in America A single sale has contained as many as
    20,000 Orchids
    Having completed the St Albans’ nursery, a branch business was
    organised at Summit, New Jersey, U S A , and placed under the manage it
    ment of Mr Fostermann, a former collector, and later under Mr A
    Dimmock Eventually, when found to be too far distant, it was acquired by 4.
    Messrs Lager & Huriell, who still maintain it as an Orchid nursery In
    1886 the authoritative work, “Rezchenbachm,” with life size coloured
    illustrations, was published by Mr Sander, many of the articles being
    personally supervised by him
    Still restless, Mr Sander, in 1894, commenced building a nursery at St
    Andre
    , Bruges, Belgium In 1914 this had developed into an enormous
    concern with 250 houses, about 50 of which were devoted to Orchids, the
    culture of Vanda coerulea, Phalaenopses, Dendrobium superbiens, Laelia
    Gouldiana and C3 mbidium Sanderi being especially successful, while great
    strides were made in the growing and breeding of Odontoglossums The
    Orchid section alone formed a compir 1tlV€ly large nurser}, but the
    remaining hou es covered more than four times the area, and contained huge
    quantities of palms, azaleas dracaenas, and araucarias, while outside a
    stock of 30,000 trimmed bays, from miniatures to giants, rhododendrons,
    hardy azaleas begonias, etc , were grown His collectors penetrated all
    parts of the globe to which Orchids are indigenous, and the two nurseries
    acted as clearing houses to all countries
    Mr Sanders successes at the leading European and American Horti

    cultural Exhibitions gained for him a world wide reputation He was one of
    the original holders of the Victorian Medal of Horticulture, and held several
    foreign orders, including that of the Croen of Belgium He was a baron of

    the Russian hmpire, and as head of the firm he won the French President’s

    Pnx d’Honneur in Paris, the Veitchian Cup in 1906, the Coronation
    Challenge Cup in 1913, 41 Gold Medals, 24 Silver Cups, as well as Medals
    and Diplomas by the hundred In International Exhibitions at London,
    Edinburgh, Brussels, Antwerp, Ghent, Paris, Petrograd, Moscow, Florence, i
    Milan, New York, Chicago and St Louis he won Gold Medals and highest
    awards At each Ghent quinquennial his new plants were amongst the
    leading attractions to horticulturists from all parts of the world His ?
    personality was felt by all who came in contact with him Genial to all,
    and enthusiastic where plants were concerned, it was a pleasure to speak
    with him on Orchids, and especially interesting and instructive when he
    could be induced to give personal reminiscences of the struggles to obtain

     

    Other Parameters

    The following additional parameters were evaluated, but had little to no affect on the output.

    tessedit_word_for_word 1

    According to documentation within the source code, this setting "Make(s) output have exactly one word per WERD".

    textord_space_size_is_variable 1

    This setting makes Tesserct assume that spaces have variable width, even though characters have fixed pitch.

    textord_max_noise_size
    textord_noise_area_ratio
    speckle_large_max_size
    speckle_small_penalty
    speckle_large_penalty
    speckle_small_certainty

    Each of these settings modifies how Tesseract identifies and handles noise (non-characters markings) on the page.  It was possible to get Tesseract to produce different… but not better… outputs by modifying these settings.  It is possible that I simply did not find the right combination of values to produce actual improvements in the output.

    Conclusion

    It seems clear from these tests that processing at least some documents with Tesseract OCR requires careful tuning, and that such tuning is not always a simple task.  In fact, a bit of trial-and-error may be needed to produce the desired results.

    Ultimately, to get the best results from Tesseract OCR when tuning is required, one of two things needs to be true.  Either the tuning needs to be done for each book (or page) to be processed, or the books/pages to be processed should be of similar quality/characteristics (so that tuning can be done once for the entire workload).

    Ligatures in Tesseract OCR Output

    Tesseract is an open source OCR engine.  It has been open source since 2005, and development on the engine has been sponsored by Google since 2006.  It is a command line tool, although there are separate projects that provide a GUI.  More information about Tesseract can be found here.

    The following advice is known to apply to Tesseract version 3.0.2, but likely also applies to later versions.

    While using Tesseract, one curiousity that I noticed is that it frequently outputs ligatures such as “fi” and “fl” rather than individual letters (“f” followed by “i” and “f” followed by “l”.  To a human reading the OCR output, this is no problem, as there is little difference to the naked eye between the ligatures and “normal” characters.  However, any post-processing or machine validation of the output can be affected by the presence of the ligatures.

    There are  couple ways to eliminate the ligatures from the output.  First, a directive can be added to the Tesseract configuration file.  The configuration file is just a plain text file containing space-delimited key/value pairs for Tesseract config variables, each pair on separate line.  So, to direct Tesseract to “blacklist”, or not use, specific ligatures, add something like the following to the configuration file:

    tessedit_char_blacklist    fifl

    In the previous example, replace the fi and fl with the exact ligatures you want Tesseract to not use.  The list of common Latin ligatures shown here can be found at http://www.unicode.org/charts/PDF/UFB00.pdf:
     

    ff LATIN SMALL LIGATURE FF
    fi LATIN SMALL LIGATURE FI
    fl LATIN SMALL LIGATURE FL
    ffi LATIN SMALL LIGATURE FFI
    ffl LATIN SMALL LIGATURE FFL
    ſt LATIN SMALL LIGATURE LONG S T
    st LATIN SMALL LIGATURE ST

    Another way to remove ligatures from Tesseract output is simply to post-process the output (using whatever tool or programming language you prefer), replacing the ligatures with the appropriate characters.  This may actually be the better approach.  In my own tests, I obtained more accurate final outputs by post-processing the Tesseract output.  This blog post also suggests that post-processing is the better approach.

    Merging Git Repositories and Preserving History

    Recently I was faced with the need to merge two Git repositories and preserve the history behind the files in each.  Here is an overview of the situation:

    • A "Main" project repository with a remote in a public GitHub project.
    • A "Secondary" project repository with a remote in a private Visual Studio Online project.
    • Both repositories contained Visual Studio / .NET / C# solutions/projects.
    • I needed to move the Secondary (private) repository into the Main (public) repository.
    • I wanted to preserve the history of the files in the Secondary repository.

    Much advice about merging two Git repositories and preserving history can be found online.  Here is a sample of what can be found:

    http://saintgimp.org/2013/01/22/merging-two-git-repositories-into-one-repository-without-losing-file-history/
    http://jasonkarns.com/blog/merge-two-git-repositories-into-one/
    http://stackoverflow.com/questions/1425892/how-do-you-merge-two-git-repositories
    http://julipedia.meroh.net/2014/02/how-to-merge-multiple-git-repositories.html
    http://scottwb.com/blog/2012/07/14/merge-git-repositories-and-preseve-commit-history/

    If you scan through the content at those links, you see that there seems to be multiple ways to approach this problem.

    Following is the sequence of Git commands that worked for me.  I must stress that this worked for me, and may not work equally well for your situation.  Proceed carefully, and be prepared to handle unexpected situations.

    1) Navigate to the master branch of the Main project repository.

    2) Add a remote that references the Secondary project repository.  In my case, this was a reference to the Visual Studio Online remote repository.

    git remote add secondaryrep <URL of secondary repository>

    3) Create a new branch in the Main repository.

    git branch mergebranch

    4) Navigate to the new branch.

    git checkout mergebranch

    5) Fetch the files and metadata from the Secondary repository.

    git fetch secondaryrep

    6) Merge the master branch of the Secondary repository into the working branch of the Main repository.

    git merge secondaryrep/master

    At this point I had to stop and resolve a handful of minor errors which were specific to my situation.  You may or may not encounter similar issues with your own repositories.

    Specifically, there was an untracked file that initially prevented the merge operation.  In this case, it was safe to simply remove that file and retry the merge.

    In addition, after merging, there were a few merge conflicts in configuration files related to NuGet packages (recall that these repositories contained Visual Studio / .NET/ C# projects).  It was a simple matter to edit the files indicated by Git and resolve the conflicts.

    7) Prepare the files for commit.

    git add .

    8) Opened all projects/solutions in Visual Studio and confirm that they successfully build and pass tests.

    9) Commit the files from the Secondary repository.

    git commit -a -m "Added projects from secondary repository"

    10) Return to the master branch of the Main repository.

    git checkout master

    11) Merge the branch we created for the files from the Secondary repository into the master branch of the Main repository.

    git merge mergebranch

    12) Push the updated master branch to the Main repository’s remote GitHub repository.

    git push origin master

    13) Remove the branch that had been created for the Secondary repository’s files.

    git branch -d mergebranch

    14) Remove the reference to the Secondary repository’s remote Visual Studio Online repository.

    git remote remove secondaryrep

    hOCRImageMapper: A Tool For Visualizing hOCR Files

    Just uploaded to GitHub (https://github.com/mlichtenberg/hocrimagemapper), this simple application provides a way to visualize hOCR output.

    Per Wikipedia: "hOCR is an open standard of data representation for formatted text obtained from optical character recognition (OCR). The definition encodes text, style, layout information, recognition confidence metrics and other information using Extensible Markup Language (XML) in form of Hypertext Markup Language (HTML) or XHTML."

    hOCR is produced by the Tesseract, Cuneiform, and OCRopus OCR software.  My motivation for creating this tool was a need to analyze hOCR output produced by Tesseract.

    This application has been implemented as a simple WinForms application  (yeah, I know, but it was quick) written in C#.

    When using the application, the text contained in an hOCR file is loaded alongside the image that is the source of the OCR output.  Hovering over a word in the text highlights the word in the image. 

    image
    Hovering over the word “quantitative” in the left panel highlights the word in the source image on the right.

    Clicking a word in the text displays the coordinates for the bounding box used to highlight the word.  (This bounding box is extracted from the hOCR output).  The coordinates are displayed as two pairs of X-Y coordinates that represent the upper right and lower left corners of the bounding box.

    image
    Clicking the word displays its coordinates.  In
    this case, the X-Y pairs are (513, 540) for the
    upper right and (846, 600) for the lower left.

    The source code can be downloaded from the Github repository, or the compiled executable can be downloaded directly.