That Conference 2016–Session Resources

Last week I had the pleasure of attending the 2016 edition of That Conference.

It was an all-around excellent experience.  The venue, topics, speakers, sponsors, food, after-hours events, and swag all left little to complain about.  In addition, many technical conferences include areas/times for free-form open discussions led by conference attendees on topics of their choosing, and That Conference is no exception.  That Conference’s version of this was called Open Spaces, and by all accounts it was a success.  While I only took part in a single discussion, I observed that the area designated for those discussions was never not busy.

Conference experiences can be spoiled by inexperienced or ill-prepared session speakers.  At That Conference I was pleased by the quality of the speakers in all twelve sessions and three keynotes that I attended.  However, there were too many interesting sessions (a good thing!) and too little time (can’t be helped).  Therefore, since returning home I have been watching social media and the conference website in order to compile links to as many of the session materials as possible.

Here are the links to everything that I have been able to find.  (If you know if others, please post a comment with the links!)

Against Best Practices – Embracing the Avant Garde for a Weirder Web
Chelsea Maxwell

As Seen On TV: Developing Apps for Apple TV and TVOS
Matthew Soucoup

Back to the Future => C# 7
Mike Harris

Battle of the CLI: Gulp vs. Grunt
Abbey Gwayambadde

Be An Expert Xamarin Outdoorsman with the Ultimate Xamarin Toolchain
Vince Bullinger

Bear Proof Applications: Using Continuous Security to Mitigate Threats
Wendy Istanick

Boost Your Immune System with DevOps
Michelle Munstedt

Build and Deploy Your ASP.NET Core Applications… Automatically!
Brandon Martinez

Build Your Own Smart Home
Brandon Satrom

Building Mobile Games That Make Money
Scott Davis

C#: You Don’t Know Jack
George Heeres

Clean Architecture: Patterns, Practices, and Principles
Matthew Renze

Common T-SQL Mistakes
Kevin Boles

Computer Science: The Good Parts
Jeffery Cohen

Daring to Develop With Docker
Philip Nelson

Date and Time: Odds, Ends, and Oddities
Maggie Pint

Domain Driven Data
Bradley Holt

Enough Cryptography to be Dangerous
Steve Marx

The Experimentation Mindset
Doc Norton

Finding Your Way to the App Store
Matthew Ridley

From Inception to Production: A Continuous Delivery Story
Ian Randall

From Mobile First to Offline First
Bradley Holt

Full-Stack ASP.NET MVC Performance Tuning
Dustin Ewers

Happy Full-Stack Javascript Campers
Ryan Niemeyer

How I Learned To Love Dependency Injection
James Bender

Identity Management in ASP.NET Core
Ondrej Balas

An Internet Of Beers
Wade Wegner

Intro to Typescript
Jody Gustafson

Introduction to Angular 2.0
Jeremy Foster

Javascript Code Quality
Md Khan

Keynote: Family Keynote
Neely Drake and Emily Davis

Keynote: From 0 to 100,000: How Particle Failed, then Succeeded, then Scaled
Zach Supalla

Keynote: Stop Writing Code
Keith Casey

Keynote: You Have Too Much Time
Jeff Blankenburg

Mastering Voice UX Featuring Amazon’s Echo (AKA Alexa)
Chris Pauly

A Microservices Architecture That Emphasizes Rapid Development
Rob Hruska

Microsoft Bot Framework: Hiking Up the Trail of Automation
David Hauck

The Millennials R Coming
Heather Shapiro

Node.JS Crash Course
David Neal

Not Just Arts & Crafts: A Developer’s Guide to Incorporating Lean UX Practices into the Development Process
Rachel Krause

Out With the Old, In With the New: A Comparison of Angular 1 and 2
Tony Gemoll

Pavlov Yourself!
Alexandra Feldman

React Native to the Rescue
Josh Gretz

React vs. Angular – Dawn of Changes
John Ptacek

ReactJS For Beginners
Arthur Kay

Ruby on Rails from 0 to Deploy in 60 Minutes
Chris Johnson

Ruby Writing Ruby – Campfire Tales of Metaprogramming
Sara Gibbons

Service Bus Summer Camp
David Boike

So Many Analytics Tools, Which One Is Right For Me?
Jason Groom

Start Your Own Business, Dammit!
Terra Fletcher

A Tale of Two Redesigns
Jess Bertling

Tell SQL Server Profiler To Take A Hike
Jes Borland

Understanding Git, Part 2
Keith Dahlby

UX Beyond the UI – How the Rest of Software Development Affects User Experience
Joe Regan

Why Your Site Is Slow
Steve Persch

Working From Whereever
Aaron Douglas

Log Parser – Transforming Plain Text Files

This post describes how to solve a specific problem with Microsoft’s Log Parser tool.  For background on the tool (and lots of examples), start here.

The Problem

Given a file named MyLog.log that looks like this…

ip=0.0.0.0 date=20160620 time=06:00:00 device=A23456789 log=00013
ip=0.0.0.1 date=20160621 time=06:00:01 device=A13456789 log=00014
ip=0.0.0.2 date=20160622 time=06:00:02 device=A12456789 log=00015
ip=0.0.0.3 date=20160623 time=06:00:03 device=A12356789 log=00016
ip=0.0.0.4 date=20160624 time=06:00:04 device=A12346789 log=00017
ip=0.0.0.5 date=20160625 time=06:00:05 device=A12345789 log=00018
ip=0.0.0.6 date=20160626 time=06:00:06 device=A12345689 log=00019
ip=0.0.0.7 date=20160627 time=06:00:07 device=A12345679 log=00020
ip=0.0.0.8 date=20160628 time=06:00:08 device=A12345678 log=00021
ip=0.0.0.9 date=20160629 time=06:00:09 device=A123456789 log=00022

…transform it into a tab-separated file with a header row.  Each field should include only the field value (and not the field name).

Notice that the original file has no header, the fields are separated with spaces, and the field name is part of each field (i.e. "ip=").

The Solution

Step 1)

logparser -i:TSV -iSeparator:space -headerRow:OFF
     "select * into ‘MyLogTemp.log’ from ‘MyLog.log’"
     -o:TSV -oSeparator:space -headers:ON

In this command, -i:TSV -iSeparator:space informs Log Parser that the input file is a space-separated text file, and -headerRow:OFF lets Log Parser know that the file has no headers.  Likewise, -o:TSV -oSeparator:space -headers:ON tells Log Parser to output a space-separated text file with headers.

This produces a file named MyLogTemp.log with the following content:

Filename RowNumber Field1 Field2 Field3 Field4 Field5
MyLog.log 1 ip=0.0.0.0 date=20160620 time=06:00:00 device=A23456789 log=00013
MyLog.log 2 ip=0.0.0.1 date=20160621 time=06:00:01 device=A13456789 log=00014
MyLog.log 3 ip=0.0.0.2 date=20160622 time=06:00:02 device=A12456789 log=00015
MyLog.log 4 ip=0.0.0.3 date=20160623 time=06:00:03 device=A12356789 log=00016
MyLog.log 5 ip=0.0.0.4 date=20160624 time=06:00:04 device=A12346789 log=00017
MyLog.log 6 ip=0.0.0.5 date=20160625 time=06:00:05 device=A12345789 log=00018
MyLog.log 7 ip=0.0.0.6 date=20160626 time=06:00:06 device=A12345689 log=00019
MyLog.log 8 ip=0.0.0.7 date=20160627 time=06:00:07 device=A12345679 log=00020
MyLog.log 9 ip=0.0.0.8 date=20160628 time=06:00:08 device=A12345678 log=00021
MyLog.log 10 ip=0.0.0.9 date=20160629 time=06:00:09 device=A123456789 log=00022

This hasn’t done much.  In fact is has added some stuff that is not relevant (the Filename and RowNumber columns), while leaving field names in each fields and maintaining the space field separator.  However, it HAS added headers (Field1, Field2, ect), which are needed for the second step.

Step 2)

logparser -i:TSV -iSeparator:space -headerRow:ON
     "select REPLACE_STR(Field1, ‘ip=’, ”) AS ip,
               REPLACE_STR(Field2, ‘date=’, ”) AS date,
               REPLACE_STR(Field3, ‘time=’, ”) AS time,
               REPLACE_STR(Field4, ‘device=’, ”) AS device,
               REPLACE_STR(Field5, ‘log=’, ”) AS log
     into ‘MyLogTransformed.log’
     from ‘MyLogTemp.log’"
     -o:TSV -oSeparator:tab -headers:ON

The input and output specifications in this command are similar to those in Step 1, except here the input file has headers (-headerRow:ON) and the output file is tab-separated (-oSeparator:tab) instead of space-separated.  The main difference is in the SELECT statement itself, where the use of the REPLACE_STR function removes the field names from the field values and the AS statement assigns the desired headers to each column of data.  Notice that the REPLACE_STR function uses the headers that were added in Step 1.

This produces the final result in a file named MyLogTransformed.log:

ip     date     time     device     log
0.0.0.0     20160620     06:00:00     A23456789     00013
0.0.0.1     20160621     06:00:01     A13456789     00014
0.0.0.2     20160622     06:00:02     A12456789     00015
0.0.0.3     20160623     06:00:03     A12356789     00016
0.0.0.4     20160624     06:00:04     A12346789     00017
0.0.0.5     20160625     06:00:05     A12345789     00018
0.0.0.6     20160626     06:00:06     A12345689     00019
0.0.0.7     20160627     06:00:07     A12345679     00020
0.0.0.8     20160628     06:00:08     A12345678     00021
0.0.0.9     20160629     06:00:09     A123456789     00022

More Information

See Log Parser’s built-in help for additional explanations of the Log Parser features used in the solution.  In particular, look at the following:

logparser -h
logparser -h -i:TSV
logparser -h -o:TSV
logparser -h FUNCTIONS REPLACE_STR

Recommended Tool: Express Profiler for SQL Server Databases

NOTE:  As I was writing up this post I discovered the news that SQL Profiler is deprecated as of the release of SQL Server 2016.  If this also affects the underlying SQL Server tracing APIs, then this news may affect the long-term future of the Express Profiler.  For now, however, it is a tool that I recommend.

Express Profiler is a simple Open Source alternative to the SQL Profiler that ships with the full SQL Server Management Studio.  This is particularly useful when working with SQL Server Express databases, as the Express version of the Management Studio does NOT include the SQL Profiler.

Usage of the Express Profiler should be self-explanatory to anyone familiar with the SQL Profiler.

Here are some details about Express Profiler from the project page:

  • ExpressProfiler (aka SqlExpress Profiler) is a simple and fast replacement for SQL Server Profiler with basic GUI
  • Can be used with both Express and non-Express editions of SQL Server 2005/2008/2008r2/2012/2014 (including LocalDB)
  • Tracing of basic set of events (Batch/RPC/SP:Stmt Starting/Completed, Audit login/logout, User error messages, Blocked Process report) and columns (Event Class, Text Data,Login, CPU, Reads, Writes, Duration, SPID, Start/End time, Database/Object/Application name) – both selectable
  • Filters on most data columns
  • Copy all/selected event rows to clipboard in form of XML
  • Find in "Text data" column
  • Export data in Excel’s clipboard format

While I have found Express Profiler to be a good and useful tool, it is not as fully-featured as the SQL Profiler.  Here are some key "missing" features in Express Profiler:

  • No way to load a saved trace output, although that feature is on the roadmap for the tool.
  • No way to save trace output directly to a database table.
  • Fewer columns can be included in the trace output, and many fewer events can be traced.  In my experience, however, the columns and events that I find myself using in most cases are all available.
  • As there are fewer columns in the output, there are fewer columns on which to filter.  Again, the most common/useful columns and events are covered.
  • No way to create trace templates for use with future traces.

Despite these limitations, I recommend this tool for situations where the full SQL Profiler is not available.

Installing Docker on Windows

I recently set up Docker on both Windows 10 and Windows 7.  In both cases, it was a slightly bumpy experience, so I am recording the steps I followed here.

Windows 10

Machine Specifications
  • 64-bit Windows 10
  • VT-X/AMD-v support enabled in the BIOS
  • Hyper-V installed
  • Latest version of VirtualBox (version 5.0.16 at the time of this writing) installed
Process

STEP 1

Download and install Docker Toolbox.  This is straightforward, and detailed instructions for doing this are available online).

STEP 2

Docker on Windows requires VirtualBox in order to run a lightweight Linux virtual machine.  VirtualBox and the Windows Hyper-V technology are mutually exclusive technologies; you cannot use both at the same time.  Therefore, If Hyper-V is installed and enabled on your machine, you should disable it by following the directions found here.

STEP 3

The Docker startup script checks for the presence of Hyper-V, and halts if it is found.  It does NOT check the enabled/disabled state of Hyper-V.  So, if Hyper-V is installed on your computer, add a parameter to the Docker startup script to avoid the Hyper-V check.

  1. Locate the file C:\Program Files\Docker Toolbox\start.sh, and make a backup copy.
  2. Open the file C:\Program Files\Docker Toolbox\start.sh for editing.
  3. Look for this line: "${DOCKER_MACHINE}" create -d virtualbox "${VM}"
  4. Add a parameter to the line: "${DOCKER_MACHINE}" create –virtualbox-no-vtx-check -d virtualbox "${VM}"
  5. Save the file
    STEP 4

The Docker Toolbox installation adds an icon to the desktop labeled “Docker Quickstart Terminal”.  Click on this icon to run Docker for the first time. 

At this point the script will create the Docker virtual machine and attempt to configure it.  If the configuration proceeds successfully, you will have a running Docker instance.

In my case, configuration failed.  I received the following error, which seems to be not uncommon:

Error creating machine: Error in driver during machine creation: Unable to start the VM: C:\Program Files\Oracle\VirtualBox\VBoxManage.exe startvm default –type headless failed:

VBoxManage.exe: error: Failed to open/create the internal network ‘HostInterfaceNetworking-VirtualBox Host-Only Ethernet Adapter #5’ (VERR_INTNET_FLT_IF_NOT_FOUND).

VBoxManage.exe: error: Failed to attach the network LUN (VERR_INTNET_FLT_IF_NOT_FOUND)

VBoxManage.exe: error: Details: code E_FAIL (0x80004005), component ConsoleWrap, interface IConsole

Details: 00:00:03.020271 Power up failed (vrc=VERR_INTNET_FLT_IF_NOT_FOUND, rc=E_FAIL (0X80004005))

To fix this error, I followed the instructions at https://caffinc.github.io/2015/11/fix-vbox-network/ to change the properties of the VirtualBox network adapter.  The key instructions are repeated here:

I opened my Control Panel > Network & Internet > Network Connections and located my VirtualBox Host-Only Network adapter, called VirtualBox Host-Only Network #6 on my machine. I right-clicked it and selected the Properties, and checked the un-checked VirtualBox NDIS6 Bridged Networking Driver and started VirtualBox again.

Then I used the VirtualBox Manager to delete the Docker virtual machine (named "default") and delete all associated files.

Once that additional configuration was complete, I was able to re-run the “Docker Quickstart Terminal” successfully.

Windows 7

Machine Specifications
  • 64-bit Windows 7
  • VT-X/AMD-v support enabled in the BIOS.
  • Latest version of VirtualBox (version 5.0.16 at the time of this writing) installed
Process

STEP 1

Download and install Docker Toolbox.

STEP 2

The Toolbox installation adds an icon to the desktop labeled “Docker Quickstart Terminal”.  Click on this icon to run Docker for the first time.  

At this point the script will create the Docker virtual machine and attempt to configure it.  If the configuration proceeds successfully, you will have a running Docker instance.

My configuration failed.  I received the following error:

Running pre-create checks…
Creating machine…
(default) Copying C:\Users\mlichtenberg\.docker\machine\cache\boot2docker.iso to
C:\Users\mlichtenberg\.docker\machine\machines\default\boot2docker.iso…
(default) Creating VirtualBox VM…
(default) Creating SSH key…
(default) Starting the VM…
(default) Check network to re-create if needed…
(default) Windows might ask for the permission to create a network adapter. Some
times, such confirmation window is minimized in the taskbar.
(default) Creating a new host-only adapter produced an error: C:\Program Files\O
racle\VirtualBox\VBoxManage.exe hostonlyif create failed:
(default) 0%…
(default) Progress state: E_INVALIDARG
(default) VBoxManage.exe: error: Failed to create the host-only adapter
(default) VBoxManage.exe: error: Assertion failed: [!aInterfaceName.isEmpty()] a
t ‘F:\tinderbox\win-5.0\src\VBox\Main\src-server\HostNetworkInterfaceImpl.cpp’ (
74) in long __cdecl HostNetworkInterface::init(class com::Bstr,class com::Bstr,c
lass com::Guid,enum __MIDL___MIDL_itf_VirtualBox_0000_0000_0036).
(default) VBoxManage.exe: error: Please contact the product vendor!
(default) VBoxManage.exe: error: Details: code E_FAIL (0x80004005), component Ho
stNetworkInterfaceWrap, interface IHostNetworkInterface
(default) VBoxManage.exe: error: Context: "enum RTEXITCODE __cdecl handleCreate(
struct HandlerArg *)" at line 71 of file VBoxManageHostonly.cpp
(default)
(default) This is a known VirtualBox bug. Let’s try to recover anyway…
Error creating machine: Error in driver during machine creation: Error setting u
p host only network on machine start: The host-only adapter we just created is n
ot visible. This is a well known VirtualBox bug. You might want to uninstall it
and reinstall at least version 5.0.12 that is is supposed to fix this issue

To attempt to fix this error, I first followed the instructions at https://caffinc.github.io/2015/11/fix-vbox-network/ to change the properties of the VirtualBox network adapter.  The key instructions are repeated here:

I opened my Control Panel > Network & Internet > Network Connections and located my VirtualBox Host-Only Network adapter, called VirtualBox Host-Only Network #6 on my machine. I right-clicked it and selected the Properties, and checked the un-checked VirtualBox NDIS6 Bridged Networking Driver and started VirtualBox again.

This did NOT work.  However, I found additional advice online that suggested doing the opposite (checking the “VirtualBox NDIS6 Bridged Networking Driver” instead of unchecking), and then disabling and enabling the network adapter.

Disabling and re-enabling the network adapter may have been the key to resolving the error, but I cannot sure for sure.  In any case, I had success after 1) putting the properties of the network adapter back to their original state 2) and re-starting the network adapter.

Once the network adapter was reset, I used the VirtualBox Manager to delete the Docker virtual machine (named "default") and delete all associated files.

Once the additional configuration was complete, I was able to re-run the “Docker Quickstart Terminal” successfully.

Testing / Validation

On both Windows 7 and Windows 10, once the Docker installation completes successfully you will be left with a terminal window opened to a Docker command line.

An Internet connection is required for the following.

To test a basic Docker image, run the following command:

docker run hello-world

You will receive output that looks like the following:

Unable to find image ‘hello-world:latest’ locally
511136ea3c5a: Pull complete
31cbccb51277: Pull complete
e45a5af57b00: Pull complete
hello-world:latest: The image you are pulling has been verified.
Important: image verification is a tech preview feature and should not be
relied on to provide security.
Status: Downloaded newer image for hello-world:latest
Hello from Docker.
This message shows that your installation appears to be working correctly.

To generate this message, Docker took the following steps:
1. The Docker client contacted the Docker daemon.
2. The Docker daemon pulled the "hello-world" image from the Docker Hub.
   (Assuming it was not already locally available.)
3. The Docker daemon created a new container from that image which runs the
   executable that produces the output you are currently reading.
4. The Docker daemon streamed that output to the Docker client, which sent it
   to your terminal.

 

That is it.  For next steps with Docker, reading the User’s Guide is a good place to start. 

If you want skip the specifics of Docker itself and dive right in to using a Docker container, pre-built Docker images for popular technologies such as nginx, Mongo, MySQL, Node, PostgreSQL, SOLR, ElasticSearch, and many others can be found at the Docker Hub.

How To Restore a Database in a SQL Server AlwaysOn Availability Group

Background

There are two clustered servers running SQL Server 2014.  The servers host production databases as well as databases used for QA and testing.  One AlwaysOn Availability Group has been created for the production databases and one for the QA databases.  The same server serves as the primary for both AlwaysOn Availability Groups.

One of the production databases needs to be restored from backup.  Backups are taken from the secondary server, not the primary.  The backups should be restored to the same server from which they were taken.

The following tasks need to be completed in order to restore the database:

  • Make the secondary server from which the backups were taken the primary server
  • Remove the database to be restored from the AlwaysOn Availability Group
  • Restore the database
  • Add the database back into the Always Availability Group

Following are detailed instructions for completing these tasks.

Task 1: Switch the Primary and Secondary Servers

1) Connect to both servers in the cluster in SQL Server Management Studio.

2) On the Secondary server, expand the "Availability Groups" folder under the "AlwaysOn High Availability" folder.

01

3) Right-click on the availability group containing the database to be restored and select "Failover…" from the context menu.  Click “Next >”.

02

4) Select the new primary server.  Click “Next >”.

03

5) Verify the choices and click “Finish”.

04

05

6) Repeat steps 3-5 for the remaining availability group.

Task 2: Remove the Database from the Availability Group

A restore operation cannot be performed on a database that is part of an availability group, so the next task is to remove the database from the group.

1) On the new primary server, expand the list of Availability Databases for the availability group.

06

2) Right-click the database to be restored and select "Remove Database from Availability Group…" from the context-menu. 

07

3) Click “OK” to remove the database from the availability group. 

08

Task 3: Restore the Database

1) In the “Databases” folder on the primary server, right-click on the database to be restored and select "Properties" to open the “Database Properties” dialog.  Select the “Options” page, scroll down to the “Restrict Access” option, and change the value from MULTI_USER to SINGLE_USER.  Click “OK”.

09

2) In the “Databases” folder on the primary server, right-click on the database to be restored and select Tasks->Restore->Database… from the context menu.

3) On the “General” page of the “Restore Database” dialog, select the last Full backup and all Transaction log backups.

10

4) Select the “Options” page of the “Restore Database” dialog and click the "Overwrite the existing database (WITH REPLACE)" option.  Click "OK".

11

Task 4: Add the Database Back to the Availability Group

After the restore of the database to the new primary server is complete, it can be put back into the availability group.

1) In the “Database” folder on the secondary server, right-click the database and select "Delete" from the context menu.  Click “OK”.

12

2) Right-click “Availability Databases” in the availability group on the primary server and select "Add Database…" from the context menu.  Click “Next >”.

13

3) Select the database to be added to the group and click "Next >".

14

4) Select "Full" as the data synchronization preference.  This will take a full backup of the database on the primary and restore it on the secondary server(s).  Specify a network location accessible to the primary and all secondary servers in which to place the backup files.  Click "Next >".

15

5) Use the “Connect…” button to establish a connection to the secondary server(s).  Click "Next >".

16

6) The "Add Database to Availability Group" wizard will validate all of the settings for the new availability group database.  When it completes, click "Next >".

17

7) Verify the choices and click "Finish" to add the database to the availability group.

18

19

Final Tasks

The restore of a database in an AlwaysOn Availability Group is now complete.

At this point it is recommended to immediately perform a backup of the restored database on the secondary server.  This will establish a new backup chain for the database.

The backup files created during synchronization of the primary and secondary server(s) can be deleted.  The location of those files was specified in Step 4 of the “Add the Database Back to the Availability Group” task.

Note that the restored database should now be back in MULTI_USER mode.  Recall that it had been set to SINGLE_USER in Step 1 of the “Restore the Database” task.

Tuning Tesseract OCR

Background

Tesseract is an open-source tool for generating OCR (Optical Character Recognition) output from digital images of text.  It has been around for a long time, and the project is currently "owned" by Google.  Tesseract is still in development, but its last official release was more than 2 years old.

More information about Tesseract OCR can be found here and here.

In general, Tesseract does a good job with clean, predictably-formatted pages of text.  More challenging are pages with unusual type faces or formatting.  Tesseract does provide a huge number of parameters that can be used to tune the output and improve its accuracy.  Unfortunately, most (if not all) of those parameters are minimally documented.

While attempting to tune Tesseract for a project involving scans of old (pre-1923) life sciences texts, I found a few specific problems:

  • In some OCR outputs, entire lines or parts of lines were be located in the wrong place on the page.
  • In general, Tesseract is difficult to tune
  • The configurable settings are poorly documented and not named in a manner that eases use
  • Some tuning options are too aggressive
  • Some tuning options fix one problem while causing another

Following are descriptions and examples of a few of the tuning parameters that are available, highlighting the pros and cons of each.  These parameters produced obvious differences in the OCR output.  A number of other parameters were evaluated, but produced no, or very minimal, differences.

Notes/Disclaimers About the Tests

In all of the following examples, "config" is the name of the file containing the configuration settings.  A Tesseract config file is just a plain text file containing space-delimited key/value pairs for Tesseract config variables, each on separate line.  There are several standard config files in the tessdata/configs folder of a standard Tesseract installation.

The image used in this evaluation can be downloaded from http://www.archive.org/download/mobot31753003515068/page/n48

Tesseract 3.02 was used for all tests.  In all cases, the settings tested and analyzed here have worked as described for the small number of images in our test set.  There is no guarantee that they will work the same for all, or even a majority, of scans.  At best, the following information should be used to direct tests on your own images.

Configuration Settings for Capturing Debugging Output

debug_file tesseract.log

Writes debugging information to the named log file.

tessedit_write_images true

Internally, tesseract converts the image being processed to a TIF; this setting writes that TIF to disk.

tessedit_create_hocr

Writes the output, including coordinate information, to an HTML file instead of to the standard text file.  The coordinate information can be particularly helpful.  These coordinates are for lines and words, which can be easier to work with than some other "box" settings that produce coordinates by letter.

Example 1: Baseline Output

Command

tesseract image.jpg outputfilename

Command Line Arguments

None

Config Settings

None

Notes
  • This is the baseline output of Tesseract.
  • The lines in the output that are highlighted in red should be a single line reading "provincial towns, and in America.  A single sale has contained as many as".  The 2nd half of the line has been mis-located.
  • The words highlighted in blue include extra characters that are a results of "noise" (specks and imperfections in the image).
  • It is not clear where the words at the end of the page (highlighted in green) come from.
Output

-43 THE ORCHID REVIEW. [AUOUs’r, I921.

occasions an Eastern grass arrived dead, but the following attempt met
-with success. As many as 23 collectors were at one time employed in
different parts of the world- Auction sales were held in London, the largest
provincial towns, and in America.
.20,ooo Orchids.

Having completed the St. Albans’ nursery, a branch business was
organised at Summit, New Jersey, U.S.A., and placed under the manage-

-ment of Mr. Fostermann, a former collector, and later under Mr. A.

A single sale has contained as many as

Dimmoek; Eventually, when found to be too far-‘distant,=it wasacquired by ‘

Messrs. Lager & Hurrell, who still maintain it as an Orchid nursery. In
1886 the authoritative work, “Reichenbachz’a,” with life-size coloured
illustrations, was published by Mr. Sander, many of the articles being
‘personally supervised by him.

Still restless, Mr. Sander, in 1894, commenced building a nursery at St.
André, Bruges, Belgium. In 1914 this had developed into an enormous
concern with 250 houses, about 50 of which were devoted to Orchids, the
culture of Vanda coerulea, Phalaenopses, Dendrobium superbiens, Laelia
~Gouldiana and Cymbidium Sanderi being especially successful, while great
strides were made in the growing and breeding of Odontoglossums. The
-Orchid section alone formed a comparatively large nursery, but the
remaining houses covered more than four times the area, and contained huge
quantities of palms, azaleas, dracaenas, and araucarias, while outside a
stock of 30,000 trimmed bays, from miniatures to giants, rhododendrons,
hardy azaleas, begonias, etc., were grown. His collectors penetrated all
parts of the globe to which Orchids are indigenous, and the two nurseries
acted as clearing houses to all countries.

Mr. Sanders successes at the leading European and American Horti-
-cultural Exhibitions gained for him a world-wide reputation. He was one of
the original holders of the Victorian Medal of Horticulture, and held several
foreign orders, including that of the Croen of Belgium. He was a baron of
the Russian Empire, and as head of the firm he won the French President’s
Prix d’Honneur in Paris, the Veitchian Cup in 1906, the Coronation
Challenge Cup in 1913, 4f Gold Medals, 24 Silver Cups, as well as Medals
and Diplomas by the hundred. In International Exhibitions at London,
Edinburgh, Brussels, Antwerp, Ghent, Paris, Petrograd, Moscow, Florence,
Milan, New York, Chicago and St. Louis he won Gold Medals and highest
awards. At each Ghent quinquennial his new plants were amongst the
leading attractions to horticulturists from all parts of the world. His
personality was felt by all who came in contact with him. Genial to all,
and enthusiastic where plants were concerned, it was a pleasure to speak
with him on Orchids, and especially interesting and instructive when he
could be induced to give personal reminiscences of the struggles to obtain

.3,
9.-
4:?
‘ii

 

Example 2: PageSegMode

Command

tesseract image.jpg outputfilename -psm 6

Command Line Arguments

-psm 6

Config Settings

None

Notes
  • The command line argument -psm stands for PageSegMode (Page Segmentation Mode).  It directs the layout analysis that Tesseract performs on the page.  By default, Tesseract fully automates the page segmentation, but does not perform orientation and script detection.  A value of 6 directs Tesseract to assume a single uniform block of text.
  • The text in the output that is highlighted in red is now correctly contained on a single line.
  • The words highlighted in blue include extra characters that are a results of "noise" (specks and imperfections in the image).  None of these have corrected; in fact, a few new ones appear.
  • There are no longer extra lines between paragraphs.  However, those lines do not actually appear on the source image either.
  • The garbage words at the end of the page no longer appear.
  • A small number of errors in individual words that appear in the original output were corrected, and a few other incorrect words changed (but were still incorrect).
  • -psm 6 can produce very poor output.  For example, processing of the image found at http://www.archive.org/download/mobot31753002262522/page/n1 results in a huge amount of garbage output, instead of just a few lines of text.
  • Pages with images produce poor output when using PageSegMode 6.  Because this parameter instructs Tesseract to treat everything as a single block of text, images are not recognized as images, and are instead processed as text (resulting in lots of garbage in the OCR).
  • Another disadvantage of using PageSegMode 6 is that text on "rotated" pages is not recognized.  In its normal mode, Tesseract is able to automatically normalize the page orientation and detect words.
  • In addition, PageSegMode 6 results in very poor results from multi-column pages.  Again, in its normal mode, Tesseract does a decent job of processing columns of text.
Output

.48 THE ORCHID REVIEW. [Avcvs-.r. 1921- ’
occasions an Eastern grass arrived dead, but the following attempt met
-with success. As many as 23 collectors were at one time employed in
different parts of the world- Auction sales were held in London, the largest
provincial towns, and in America. A single sale has contained as many as
.20,ooo Orchids.
Having completed the St. Albans’ nursery, a branch business was
organised at Summit, New Jersey, U.S.A., and placed under the manage-
-ment of Mr. Fostermann, a former collector, and later under Mr. A.
Dimmoek; Eventually, when found to be too far-‘distant,=it wasiacquired by ‘
Messrs. Lager & Hurrell, who still maintain it as an Orchid nursery. In
1886 the authoritative work, “Reichenbachz’a,” with life-size coloured
illustrations, was published by Mr. Sander, many of the articles being
‘personally supervised by
him. ‘
Still restless, Mr. Sander, in 1894, commenced building a nursery at St.
André, Bruges, Belgium. In 1914 this had developed into an enormous
concern with 250 houses, about 50 of which were devoted to Orchids, the
culture of Vanda coerulea, Phalaenopses, Dendrobium superbiens, Laelia
~Gouldiana and Cymbidium Sanderi being especially successful, while great
strides were made in the growing and breeding of Odontoglossums. The
-Orchid section alone formed a comparatively large nursery, but the
remaining houses covered more than four times the area, and contained huge
quantities of palms, azaleas, dracaenas, and araucarias, while outside a
stock of 30,000 trimmed bays, from miniatures to giants, rhododendrons,
hardy azaleas, begonias, etc., were grown. His collectors penetrated all
parts of the globe to which Orchids are indigenous, and the two nurseries
acted as clearing houses to all countries.
Mr. Sanders successes at the leading European and American Horti-
-cultural Exhibitions gained for him a world-wide reputation. He was one of
the original holders of the Victorian Medal of Horticulture, and held several
foreign orders, including that of the Croen of Belgium. He was a baron
of .
the Russian Empire, and as head of the firm he won the French President’s
Prix d’Honneur in Paris, the Veitchian Cup in 1906, the Coronation
Challenge Cup in 1913, 4f Gold Medals, 24 Silver Cups, as well as Medals
and Diplomas by the hundred. In International Exhibitions at London,
Edinburgh, Brussels, Antwerp, Ghent, Paris, Petrograd, Moscow, Florence,
Milan, New York, Chicago and St. Louis he won Gold Medals and highest
awards. At each Ghent quinquennial his new plants were amongst the
leading attractions to horticulturists from all parts of the world. His
personality was felt by all who came in contact with him. Genial to all,
and enthusiastic where plants were concerned, it was a pleasure to speak
with him on Orchids, and especially interesting and instructive when he
could be induced to give personal reminiscences of the struggles to obtain

 

Example 3: Line Size

Command

tesseract image.jpg outputfilename config

Command Line Arguments

None

Config Settings

textord_min_linesize 3.25

Notes
  • textord_min_linesize seems to have an affect on the line heights detected by Tesseract when it performs the layout analysis on the image.  The default value for this setting is 1.25.
  • When set to 3.25, the "broken" line problem in the original baseline output is corrected.  Lower settings (for example, 3.0) do not correct the "broken" lines.  
  • This settings causes other character recognition errors.
  • The text in the output that is highlighted in red is again correctly contained on a single line.
  • The words highlighted in blue include extra characters that are a results of "noise" (specks and imperfections in the image).  None of these have corrected, but no new ones have appeared.
  • Lines between "paragraphs" now appear in somewhat odd locations.  Again, there are NO lines between paragraphs on the source image.
  • The garbage words at the end of the page do not appear.
  • A small number of errors in individual words that appear in the original output were corrected, a few other incorrect words changed (but were still incorrect), a small number of correct words now are  incorrect.  These have been highlighted in purple.
Output

.48 THE ORCHID REVIEW. [AUGUsT, 1921.

occasions an Eastern grass arrived dead, but the following attempt met
‘with success. As many as 23 collectors were at one time employed in

different parts of the world- Auction sales were held in London, the largest
provincial towns, and in America. A single sale has contained as many as
.20,00o Orchids.

Having completed the St. Albans’ nursery, a branch business was
organised at Summit, New Jersey, U.S.A., and placed under the manage-
-ment of Mr. Fostermann, a former collector, and later under Mr. A.
Dimmook; Eventually, when found to be too far-‘distant,=it wasiacqui d b ‘

I

Messrs. Lager & Hurrell, who still maintain it as an Orchid nurser: 1
1886 the authoritative work, “Reichenbachz’a,” with life-size coloured
illustrations, was published by Mr. Sander, many of the articles being
‘personally supervised by him.

Still restless, Mr. Sander, in 1894, commenced building a nursery at St.
André, Bruges, Belgium. In 1914 this had developed into an enormous i
concern with 250 houses, about 50 of which were devoted to Orchids, the
culture of Vanda coerulea, Phalaenopses, Dendrobium superbiens, Laelia
~Gouldiana and Cymbidium Sanderi being especially successful, while great
strides were made in the growing and breeding of Odontogl sssss s. The
-Orchid section alone formed a comparatively large nursery, but the
remaining houses covered more than four times the area, and contained huge ;
quantities of palms, azaleas, dracaenas, and araucarias, while outside a
stock of 30,000 trimmed bays, from miniatures to giants, rhododendrons,
hardy azaleas, begonias, etc., were grown. His collectors penetrated all
parts of the globe to which Orchids are indigenous, and the two nurseries
acted as clearing houses to all countries.

Mr. Sanders successes at the leading European and American Horti-
-cultural Exhibitions gained for him a world-wide reputation. He was one of
the original holders of the Victorian Medal of Horticulture, and held several
foreign orders, including that of the Croen of Belgium. He was a baron of
the Russian Empire, and as head of the firm he won the French President’s 3
Prix d’Honneur in Paris, the Veitchian Cup in 1906, the Coronation
Challenge Cup in 1913, 4f Gold Medals, 24 Silver Cups, as well as Medals
and Diplomas by the hundred. In International Exhibitions at London,

Edinburgh, Brussels, Antwerp, Ghent, Paris, Petrograd, Moscow, Florence, fi
Milan, New York, Chicago and St. Louis he won Gold Medals and highest
awards. At each Ghent quinquennial his new plants were amongst the
leading attractions to horticulturists from all parts of the world. His

personality was felt by all who came in contact with him. Genial to all,
and enthusiastic where plants were concerned, it was a pleasure to speak
with him on Orchids, and especially interesting and instructive when he i
could be induced to give personal reminiscences of the struggles to obtain

 

Example 4: Noise Reduction

Command

tesseract image.jpg outputfilename -psm 6 config

Command Line Arguments

-psm 6

Config Settings

textord_heavy_nr 1

Notes
  • Note that for this test, the PageSegMode command line parameter was used in conjunction with the configuration setting, and PageSegMode was responsible for the elimination of the “broken” lines in the output.
  • textord_heavy_nr instructs Tesseract to vigorously remove noise from the output.
  • Did a good job of removing noise from the results (highlighted in blue), BUT it also removed many valid punctuation and diacritic marks (highlighted in green), including ALL periods.
  • A small number of errors in individual words that appear in the original output were corrected, a few other incorrect words changed (but were still incorrect), a small number of correct words now are  incorrect.  These have been highlighted in purple.
  • Introduced three extra blank lines.
  • In total, more errors were introduced than were corrected with this setting.
Output

48 THE ORCHID REVIEW [Aucvsr I921
occasions an Eastern grass arrived dead, but the following attempt met
with success As many as 23 collectors were at one time employed in i
different parts of the world Auction sales were held in London, the largest
provincial towns, and in America A single sale has contained as many as
20,000 Orchids
Having completed the St Albans’ nursery, a branch business was
organised at Summit, New Jersey, U S A , and placed under the manage it
ment of Mr Fostermann, a former collector, and later under Mr A
Dimmock Eventually, when found to be too far distant, it was acquired by 4.
Messrs Lager & Huriell, who still maintain it as an Orchid nursery In
1886 the authoritative work, “Rezchenbachm,” with life size coloured
illustrations, was published by Mr Sander, many of the articles being
personally supervised by him
Still restless, Mr Sander, in 1894, commenced building a nursery at St
Andre
, Bruges, Belgium In 1914 this had developed into an enormous
concern with 250 houses, about 50 of which were devoted to Orchids, the
culture of Vanda coerulea, Phalaenopses, Dendrobium superbiens, Laelia
Gouldiana and C3 mbidium Sanderi being especially successful, while great
strides were made in the growing and breeding of Odontoglossums The
Orchid section alone formed a compir 1tlV€ly large nurser}, but the
remaining hou es covered more than four times the area, and contained huge
quantities of palms, azaleas dracaenas, and araucarias, while outside a
stock of 30,000 trimmed bays, from miniatures to giants, rhododendrons,
hardy azaleas begonias, etc , were grown His collectors penetrated all
parts of the globe to which Orchids are indigenous, and the two nurseries
acted as clearing houses to all countries
Mr Sanders successes at the leading European and American Horti

cultural Exhibitions gained for him a world wide reputation He was one of
the original holders of the Victorian Medal of Horticulture, and held several
foreign orders, including that of the Croen of Belgium He was a baron of

the Russian hmpire, and as head of the firm he won the French President’s

Pnx d’Honneur in Paris, the Veitchian Cup in 1906, the Coronation
Challenge Cup in 1913, 41 Gold Medals, 24 Silver Cups, as well as Medals
and Diplomas by the hundred In International Exhibitions at London,
Edinburgh, Brussels, Antwerp, Ghent, Paris, Petrograd, Moscow, Florence, i
Milan, New York, Chicago and St Louis he won Gold Medals and highest
awards At each Ghent quinquennial his new plants were amongst the
leading attractions to horticulturists from all parts of the world His ?
personality was felt by all who came in contact with him Genial to all,
and enthusiastic where plants were concerned, it was a pleasure to speak
with him on Orchids, and especially interesting and instructive when he
could be induced to give personal reminiscences of the struggles to obtain

 

Other Parameters

The following additional parameters were evaluated, but had little to no affect on the output.

tessedit_word_for_word 1

According to documentation within the source code, this setting "Make(s) output have exactly one word per WERD".

textord_space_size_is_variable 1

This setting makes Tesserct assume that spaces have variable width, even though characters have fixed pitch.

textord_max_noise_size
textord_noise_area_ratio
speckle_large_max_size
speckle_small_penalty
speckle_large_penalty
speckle_small_certainty

Each of these settings modifies how Tesseract identifies and handles noise (non-characters markings) on the page.  It was possible to get Tesseract to produce different… but not better… outputs by modifying these settings.  It is possible that I simply did not find the right combination of values to produce actual improvements in the output.

Conclusion

It seems clear from these tests that processing at least some documents with Tesseract OCR requires careful tuning, and that such tuning is not always a simple task.  In fact, a bit of trial-and-error may be needed to produce the desired results.

Ultimately, to get the best results from Tesseract OCR when tuning is required, one of two things needs to be true.  Either the tuning needs to be done for each book (or page) to be processed, or the books/pages to be processed should be of similar quality/characteristics (so that tuning can be done once for the entire workload).

Follow

Get every new post delivered to your Inbox.