Ligatures in Tesseract OCR Output

Tesseract is an open source OCR engine.  It has been open source since 2005, and development on the engine has been sponsored by Google since 2006.  It is a command line tool, although there are separate projects that provide a GUI.  More information about Tesseract can be found here.

The following advice is known to apply to Tesseract version 3.0.2, but likely also applies to later versions.

While using Tesseract, one curiousity that I noticed is that it frequently outputs ligatures such as “fi” and “fl” rather than individual letters (“f” followed by “i” and “f” followed by “l”.  To a human reading the OCR output, this is no problem, as there is little difference to the naked eye between the ligatures and “normal” characters.  However, any post-processing or machine validation of the output can be affected by the presence of the ligatures.

There are  couple ways to eliminate the ligatures from the output.  First, a directive can be added to the Tesseract configuration file.  The configuration file is just a plain text file containing space-delimited key/value pairs for Tesseract config variables, each pair on separate line.  So, to direct Tesseract to “blacklist”, or not use, specific ligatures, add something like the following to the configuration file:

tessedit_char_blacklist    fifl

In the previous example, replace the fi and fl with the exact ligatures you want Tesseract to not use.  The list of common Latin ligatures shown here can be found at http://www.unicode.org/charts/PDF/UFB00.pdf:
 

ff LATIN SMALL LIGATURE FF
fi LATIN SMALL LIGATURE FI
fl LATIN SMALL LIGATURE FL
ffi LATIN SMALL LIGATURE FFI
ffl LATIN SMALL LIGATURE FFL
ſt LATIN SMALL LIGATURE LONG S T
st LATIN SMALL LIGATURE ST

Another way to remove ligatures from Tesseract output is simply to post-process the output (using whatever tool or programming language you prefer), replacing the ligatures with the appropriate characters.  This may actually be the better approach.  In my own tests, I obtained more accurate final outputs by post-processing the Tesseract output.  This blog post also suggests that post-processing is the better approach.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: