Ligatures in Tesseract OCR Output
September 11, 2015 Leave a comment
Tesseract is an open source OCR engine. It has been open source since 2005, and development on the engine has been sponsored by Google since 2006. It is a command line tool, although there are separate projects that provide a GUI. More information about Tesseract can be found here.
The following advice is known to apply to Tesseract version 3.0.2, but likely also applies to later versions.
While using Tesseract, one curiousity that I noticed is that it frequently outputs ligatures such as “ﬁ” and “ﬂ” rather than individual letters (“f” followed by “i” and “f” followed by “l”. To a human reading the OCR output, this is no problem, as there is little difference to the naked eye between the ligatures and “normal” characters. However, any post-processing or machine validation of the output can be affected by the presence of the ligatures.
There are couple ways to eliminate the ligatures from the output. First, a directive can be added to the Tesseract configuration file. The configuration file is just a plain text file containing space-delimited key/value pairs for Tesseract config variables, each pair on separate line. So, to direct Tesseract to “blacklist”, or not use, specific ligatures, add something like the following to the configuration file:
ff LATIN SMALL LIGATURE FF
fi LATIN SMALL LIGATURE FI
fl LATIN SMALL LIGATURE FL
ffi LATIN SMALL LIGATURE FFI
ffl LATIN SMALL LIGATURE FFL
ſt LATIN SMALL LIGATURE LONG S T
st LATIN SMALL LIGATURE ST
Another way to remove ligatures from Tesseract output is simply to post-process the output (using whatever tool or programming language you prefer), replacing the ligatures with the appropriate characters. This may actually be the better approach. In my own tests, I obtained more accurate final outputs by post-processing the Tesseract output. This blog post also suggests that post-processing is the better approach.