Tuning Tesseract OCR

Background

Tesseract is an open-source tool for generating OCR (Optical Character Recognition) output from digital images of text.  It has been around for a long time, and the project is currently "owned" by Google.  Tesseract is still in development, but its last official release was more than 2 years old.

More information about Tesseract OCR can be found here and here.

In general, Tesseract does a good job with clean, predictably-formatted pages of text.  More challenging are pages with unusual type faces or formatting.  Tesseract does provide a huge number of parameters that can be used to tune the output and improve its accuracy.  Unfortunately, most (if not all) of those parameters are minimally documented.

While attempting to tune Tesseract for a project involving scans of old (pre-1923) life sciences texts, I found a few specific problems:

  • In some OCR outputs, entire lines or parts of lines were be located in the wrong place on the page.
  • In general, Tesseract is difficult to tune
  • The configurable settings are poorly documented and not named in a manner that eases use
  • Some tuning options are too aggressive
  • Some tuning options fix one problem while causing another

Following are descriptions and examples of a few of the tuning parameters that are available, highlighting the pros and cons of each.  These parameters produced obvious differences in the OCR output.  A number of other parameters were evaluated, but produced no, or very minimal, differences.

Notes/Disclaimers About the Tests

In all of the following examples, "config" is the name of the file containing the configuration settings.  A Tesseract config file is just a plain text file containing space-delimited key/value pairs for Tesseract config variables, each on separate line.  There are several standard config files in the tessdata/configs folder of a standard Tesseract installation.

The image used in this evaluation can be downloaded from http://www.archive.org/download/mobot31753003515068/page/n48

Tesseract 3.02 was used for all tests.  In all cases, the settings tested and analyzed here have worked as described for the small number of images in our test set.  There is no guarantee that they will work the same for all, or even a majority, of scans.  At best, the following information should be used to direct tests on your own images.

Configuration Settings for Capturing Debugging Output

debug_file tesseract.log

Writes debugging information to the named log file.

tessedit_write_images true

Internally, tesseract converts the image being processed to a TIF; this setting writes that TIF to disk.

tessedit_create_hocr

Writes the output, including coordinate information, to an HTML file instead of to the standard text file.  The coordinate information can be particularly helpful.  These coordinates are for lines and words, which can be easier to work with than some other "box" settings that produce coordinates by letter.

Example 1: Baseline Output

Command

tesseract image.jpg outputfilename

Command Line Arguments

None

Config Settings

None

Notes
  • This is the baseline output of Tesseract.
  • The lines in the output that are highlighted in red should be a single line reading "provincial towns, and in America.  A single sale has contained as many as".  The 2nd half of the line has been mis-located.
  • The words highlighted in blue include extra characters that are a results of "noise" (specks and imperfections in the image).
  • It is not clear where the words at the end of the page (highlighted in green) come from.
Output

-43 THE ORCHID REVIEW. [AUOUs’r, I921.

occasions an Eastern grass arrived dead, but the following attempt met
-with success. As many as 23 collectors were at one time employed in
different parts of the world- Auction sales were held in London, the largest
provincial towns, and in America.
.20,ooo Orchids.

Having completed the St. Albans’ nursery, a branch business was
organised at Summit, New Jersey, U.S.A., and placed under the manage-

-ment of Mr. Fostermann, a former collector, and later under Mr. A.

A single sale has contained as many as

Dimmoek; Eventually, when found to be too far-‘distant,=it wasacquired by ‘

Messrs. Lager & Hurrell, who still maintain it as an Orchid nursery. In
1886 the authoritative work, “Reichenbachz’a,” with life-size coloured
illustrations, was published by Mr. Sander, many of the articles being
‘personally supervised by him.

Still restless, Mr. Sander, in 1894, commenced building a nursery at St.
André, Bruges, Belgium. In 1914 this had developed into an enormous
concern with 250 houses, about 50 of which were devoted to Orchids, the
culture of Vanda coerulea, Phalaenopses, Dendrobium superbiens, Laelia
~Gouldiana and Cymbidium Sanderi being especially successful, while great
strides were made in the growing and breeding of Odontoglossums. The
-Orchid section alone formed a comparatively large nursery, but the
remaining houses covered more than four times the area, and contained huge
quantities of palms, azaleas, dracaenas, and araucarias, while outside a
stock of 30,000 trimmed bays, from miniatures to giants, rhododendrons,
hardy azaleas, begonias, etc., were grown. His collectors penetrated all
parts of the globe to which Orchids are indigenous, and the two nurseries
acted as clearing houses to all countries.

Mr. Sanders successes at the leading European and American Horti-
-cultural Exhibitions gained for him a world-wide reputation. He was one of
the original holders of the Victorian Medal of Horticulture, and held several
foreign orders, including that of the Croen of Belgium. He was a baron of
the Russian Empire, and as head of the firm he won the French President’s
Prix d’Honneur in Paris, the Veitchian Cup in 1906, the Coronation
Challenge Cup in 1913, 4f Gold Medals, 24 Silver Cups, as well as Medals
and Diplomas by the hundred. In International Exhibitions at London,
Edinburgh, Brussels, Antwerp, Ghent, Paris, Petrograd, Moscow, Florence,
Milan, New York, Chicago and St. Louis he won Gold Medals and highest
awards. At each Ghent quinquennial his new plants were amongst the
leading attractions to horticulturists from all parts of the world. His
personality was felt by all who came in contact with him. Genial to all,
and enthusiastic where plants were concerned, it was a pleasure to speak
with him on Orchids, and especially interesting and instructive when he
could be induced to give personal reminiscences of the struggles to obtain

.3,
9.-
4:?
‘ii

 

Example 2: PageSegMode

Command

tesseract image.jpg outputfilename -psm 6

Command Line Arguments

-psm 6

Config Settings

None

Notes
  • The command line argument -psm stands for PageSegMode (Page Segmentation Mode).  It directs the layout analysis that Tesseract performs on the page.  By default, Tesseract fully automates the page segmentation, but does not perform orientation and script detection.  A value of 6 directs Tesseract to assume a single uniform block of text.
  • The text in the output that is highlighted in red is now correctly contained on a single line.
  • The words highlighted in blue include extra characters that are a results of "noise" (specks and imperfections in the image).  None of these have corrected; in fact, a few new ones appear.
  • There are no longer extra lines between paragraphs.  However, those lines do not actually appear on the source image either.
  • The garbage words at the end of the page no longer appear.
  • A small number of errors in individual words that appear in the original output were corrected, and a few other incorrect words changed (but were still incorrect).
  • -psm 6 can produce very poor output.  For example, processing of the image found at http://www.archive.org/download/mobot31753002262522/page/n1 results in a huge amount of garbage output, instead of just a few lines of text.
  • Pages with images produce poor output when using PageSegMode 6.  Because this parameter instructs Tesseract to treat everything as a single block of text, images are not recognized as images, and are instead processed as text (resulting in lots of garbage in the OCR).
  • Another disadvantage of using PageSegMode 6 is that text on "rotated" pages is not recognized.  In its normal mode, Tesseract is able to automatically normalize the page orientation and detect words.
  • In addition, PageSegMode 6 results in very poor results from multi-column pages.  Again, in its normal mode, Tesseract does a decent job of processing columns of text.
Output

.48 THE ORCHID REVIEW. [Avcvs-.r. 1921- ’
occasions an Eastern grass arrived dead, but the following attempt met
-with success. As many as 23 collectors were at one time employed in
different parts of the world- Auction sales were held in London, the largest
provincial towns, and in America. A single sale has contained as many as
.20,ooo Orchids.
Having completed the St. Albans’ nursery, a branch business was
organised at Summit, New Jersey, U.S.A., and placed under the manage-
-ment of Mr. Fostermann, a former collector, and later under Mr. A.
Dimmoek; Eventually, when found to be too far-‘distant,=it wasiacquired by ‘
Messrs. Lager & Hurrell, who still maintain it as an Orchid nursery. In
1886 the authoritative work, “Reichenbachz’a,” with life-size coloured
illustrations, was published by Mr. Sander, many of the articles being
‘personally supervised by
him. ‘
Still restless, Mr. Sander, in 1894, commenced building a nursery at St.
André, Bruges, Belgium. In 1914 this had developed into an enormous
concern with 250 houses, about 50 of which were devoted to Orchids, the
culture of Vanda coerulea, Phalaenopses, Dendrobium superbiens, Laelia
~Gouldiana and Cymbidium Sanderi being especially successful, while great
strides were made in the growing and breeding of Odontoglossums. The
-Orchid section alone formed a comparatively large nursery, but the
remaining houses covered more than four times the area, and contained huge
quantities of palms, azaleas, dracaenas, and araucarias, while outside a
stock of 30,000 trimmed bays, from miniatures to giants, rhododendrons,
hardy azaleas, begonias, etc., were grown. His collectors penetrated all
parts of the globe to which Orchids are indigenous, and the two nurseries
acted as clearing houses to all countries.
Mr. Sanders successes at the leading European and American Horti-
-cultural Exhibitions gained for him a world-wide reputation. He was one of
the original holders of the Victorian Medal of Horticulture, and held several
foreign orders, including that of the Croen of Belgium. He was a baron
of .
the Russian Empire, and as head of the firm he won the French President’s
Prix d’Honneur in Paris, the Veitchian Cup in 1906, the Coronation
Challenge Cup in 1913, 4f Gold Medals, 24 Silver Cups, as well as Medals
and Diplomas by the hundred. In International Exhibitions at London,
Edinburgh, Brussels, Antwerp, Ghent, Paris, Petrograd, Moscow, Florence,
Milan, New York, Chicago and St. Louis he won Gold Medals and highest
awards. At each Ghent quinquennial his new plants were amongst the
leading attractions to horticulturists from all parts of the world. His
personality was felt by all who came in contact with him. Genial to all,
and enthusiastic where plants were concerned, it was a pleasure to speak
with him on Orchids, and especially interesting and instructive when he
could be induced to give personal reminiscences of the struggles to obtain

 

Example 3: Line Size

Command

tesseract image.jpg outputfilename config

Command Line Arguments

None

Config Settings

textord_min_linesize 3.25

Notes
  • textord_min_linesize seems to have an affect on the line heights detected by Tesseract when it performs the layout analysis on the image.  The default value for this setting is 1.25.
  • When set to 3.25, the "broken" line problem in the original baseline output is corrected.  Lower settings (for example, 3.0) do not correct the "broken" lines.  
  • This settings causes other character recognition errors.
  • The text in the output that is highlighted in red is again correctly contained on a single line.
  • The words highlighted in blue include extra characters that are a results of "noise" (specks and imperfections in the image).  None of these have corrected, but no new ones have appeared.
  • Lines between "paragraphs" now appear in somewhat odd locations.  Again, there are NO lines between paragraphs on the source image.
  • The garbage words at the end of the page do not appear.
  • A small number of errors in individual words that appear in the original output were corrected, a few other incorrect words changed (but were still incorrect), a small number of correct words now are  incorrect.  These have been highlighted in purple.
Output

.48 THE ORCHID REVIEW. [AUGUsT, 1921.

occasions an Eastern grass arrived dead, but the following attempt met
‘with success. As many as 23 collectors were at one time employed in

different parts of the world- Auction sales were held in London, the largest
provincial towns, and in America. A single sale has contained as many as
.20,00o Orchids.

Having completed the St. Albans’ nursery, a branch business was
organised at Summit, New Jersey, U.S.A., and placed under the manage-
-ment of Mr. Fostermann, a former collector, and later under Mr. A.
Dimmook; Eventually, when found to be too far-‘distant,=it wasiacqui d b ‘

I

Messrs. Lager & Hurrell, who still maintain it as an Orchid nurser: 1
1886 the authoritative work, “Reichenbachz’a,” with life-size coloured
illustrations, was published by Mr. Sander, many of the articles being
‘personally supervised by him.

Still restless, Mr. Sander, in 1894, commenced building a nursery at St.
André, Bruges, Belgium. In 1914 this had developed into an enormous i
concern with 250 houses, about 50 of which were devoted to Orchids, the
culture of Vanda coerulea, Phalaenopses, Dendrobium superbiens, Laelia
~Gouldiana and Cymbidium Sanderi being especially successful, while great
strides were made in the growing and breeding of Odontogl sssss s. The
-Orchid section alone formed a comparatively large nursery, but the
remaining houses covered more than four times the area, and contained huge ;
quantities of palms, azaleas, dracaenas, and araucarias, while outside a
stock of 30,000 trimmed bays, from miniatures to giants, rhododendrons,
hardy azaleas, begonias, etc., were grown. His collectors penetrated all
parts of the globe to which Orchids are indigenous, and the two nurseries
acted as clearing houses to all countries.

Mr. Sanders successes at the leading European and American Horti-
-cultural Exhibitions gained for him a world-wide reputation. He was one of
the original holders of the Victorian Medal of Horticulture, and held several
foreign orders, including that of the Croen of Belgium. He was a baron of
the Russian Empire, and as head of the firm he won the French President’s 3
Prix d’Honneur in Paris, the Veitchian Cup in 1906, the Coronation
Challenge Cup in 1913, 4f Gold Medals, 24 Silver Cups, as well as Medals
and Diplomas by the hundred. In International Exhibitions at London,

Edinburgh, Brussels, Antwerp, Ghent, Paris, Petrograd, Moscow, Florence, fi
Milan, New York, Chicago and St. Louis he won Gold Medals and highest
awards. At each Ghent quinquennial his new plants were amongst the
leading attractions to horticulturists from all parts of the world. His

personality was felt by all who came in contact with him. Genial to all,
and enthusiastic where plants were concerned, it was a pleasure to speak
with him on Orchids, and especially interesting and instructive when he i
could be induced to give personal reminiscences of the struggles to obtain

 

Example 4: Noise Reduction

Command

tesseract image.jpg outputfilename -psm 6 config

Command Line Arguments

-psm 6

Config Settings

textord_heavy_nr 1

Notes
  • Note that for this test, the PageSegMode command line parameter was used in conjunction with the configuration setting, and PageSegMode was responsible for the elimination of the “broken” lines in the output.
  • textord_heavy_nr instructs Tesseract to vigorously remove noise from the output.
  • Did a good job of removing noise from the results (highlighted in blue), BUT it also removed many valid punctuation and diacritic marks (highlighted in green), including ALL periods.
  • A small number of errors in individual words that appear in the original output were corrected, a few other incorrect words changed (but were still incorrect), a small number of correct words now are  incorrect.  These have been highlighted in purple.
  • Introduced three extra blank lines.
  • In total, more errors were introduced than were corrected with this setting.
Output

48 THE ORCHID REVIEW [Aucvsr I921
occasions an Eastern grass arrived dead, but the following attempt met
with success As many as 23 collectors were at one time employed in i
different parts of the world Auction sales were held in London, the largest
provincial towns, and in America A single sale has contained as many as
20,000 Orchids
Having completed the St Albans’ nursery, a branch business was
organised at Summit, New Jersey, U S A , and placed under the manage it
ment of Mr Fostermann, a former collector, and later under Mr A
Dimmock Eventually, when found to be too far distant, it was acquired by 4.
Messrs Lager & Huriell, who still maintain it as an Orchid nursery In
1886 the authoritative work, “Rezchenbachm,” with life size coloured
illustrations, was published by Mr Sander, many of the articles being
personally supervised by him
Still restless, Mr Sander, in 1894, commenced building a nursery at St
Andre
, Bruges, Belgium In 1914 this had developed into an enormous
concern with 250 houses, about 50 of which were devoted to Orchids, the
culture of Vanda coerulea, Phalaenopses, Dendrobium superbiens, Laelia
Gouldiana and C3 mbidium Sanderi being especially successful, while great
strides were made in the growing and breeding of Odontoglossums The
Orchid section alone formed a compir 1tlV€ly large nurser}, but the
remaining hou es covered more than four times the area, and contained huge
quantities of palms, azaleas dracaenas, and araucarias, while outside a
stock of 30,000 trimmed bays, from miniatures to giants, rhododendrons,
hardy azaleas begonias, etc , were grown His collectors penetrated all
parts of the globe to which Orchids are indigenous, and the two nurseries
acted as clearing houses to all countries
Mr Sanders successes at the leading European and American Horti

cultural Exhibitions gained for him a world wide reputation He was one of
the original holders of the Victorian Medal of Horticulture, and held several
foreign orders, including that of the Croen of Belgium He was a baron of

the Russian hmpire, and as head of the firm he won the French President’s

Pnx d’Honneur in Paris, the Veitchian Cup in 1906, the Coronation
Challenge Cup in 1913, 41 Gold Medals, 24 Silver Cups, as well as Medals
and Diplomas by the hundred In International Exhibitions at London,
Edinburgh, Brussels, Antwerp, Ghent, Paris, Petrograd, Moscow, Florence, i
Milan, New York, Chicago and St Louis he won Gold Medals and highest
awards At each Ghent quinquennial his new plants were amongst the
leading attractions to horticulturists from all parts of the world His ?
personality was felt by all who came in contact with him Genial to all,
and enthusiastic where plants were concerned, it was a pleasure to speak
with him on Orchids, and especially interesting and instructive when he
could be induced to give personal reminiscences of the struggles to obtain

 

Other Parameters

The following additional parameters were evaluated, but had little to no affect on the output.

tessedit_word_for_word 1

According to documentation within the source code, this setting "Make(s) output have exactly one word per WERD".

textord_space_size_is_variable 1

This setting makes Tesserct assume that spaces have variable width, even though characters have fixed pitch.

textord_max_noise_size
textord_noise_area_ratio
speckle_large_max_size
speckle_small_penalty
speckle_large_penalty
speckle_small_certainty

Each of these settings modifies how Tesseract identifies and handles noise (non-characters markings) on the page.  It was possible to get Tesseract to produce different… but not better… outputs by modifying these settings.  It is possible that I simply did not find the right combination of values to produce actual improvements in the output.

Conclusion

It seems clear from these tests that processing at least some documents with Tesseract OCR requires careful tuning, and that such tuning is not always a simple task.  In fact, a bit of trial-and-error may be needed to produce the desired results.

Ultimately, to get the best results from Tesseract OCR when tuning is required, one of two things needs to be true.  Either the tuning needs to be done for each book (or page) to be processed, or the books/pages to be processed should be of similar quality/characteristics (so that tuning can be done once for the entire workload).

Ligatures in Tesseract OCR Output

Tesseract is an open source OCR engine.  It has been open source since 2005, and development on the engine has been sponsored by Google since 2006.  It is a command line tool, although there are separate projects that provide a GUI.  More information about Tesseract can be found here.

The following advice is known to apply to Tesseract version 3.0.2, but likely also applies to later versions.

While using Tesseract, one curiousity that I noticed is that it frequently outputs ligatures such as “fi” and “fl” rather than individual letters (“f” followed by “i” and “f” followed by “l”.  To a human reading the OCR output, this is no problem, as there is little difference to the naked eye between the ligatures and “normal” characters.  However, any post-processing or machine validation of the output can be affected by the presence of the ligatures.

There are  couple ways to eliminate the ligatures from the output.  First, a directive can be added to the Tesseract configuration file.  The configuration file is just a plain text file containing space-delimited key/value pairs for Tesseract config variables, each pair on separate line.  So, to direct Tesseract to “blacklist”, or not use, specific ligatures, add something like the following to the configuration file:

tessedit_char_blacklist    fifl

In the previous example, replace the fi and fl with the exact ligatures you want Tesseract to not use.  The list of common Latin ligatures shown here can be found at http://www.unicode.org/charts/PDF/UFB00.pdf:
 

ff LATIN SMALL LIGATURE FF
fi LATIN SMALL LIGATURE FI
fl LATIN SMALL LIGATURE FL
ffi LATIN SMALL LIGATURE FFI
ffl LATIN SMALL LIGATURE FFL
ſt LATIN SMALL LIGATURE LONG S T
st LATIN SMALL LIGATURE ST

Another way to remove ligatures from Tesseract output is simply to post-process the output (using whatever tool or programming language you prefer), replacing the ligatures with the appropriate characters.  This may actually be the better approach.  In my own tests, I obtained more accurate final outputs by post-processing the Tesseract output.  This blog post also suggests that post-processing is the better approach.