hOCRImageMapper: A Tool For Visualizing hOCR Files

Just uploaded to GitHub (https://github.com/mlichtenberg/hocrimagemapper), this simple application provides a way to visualize hOCR output.

Per Wikipedia: "hOCR is an open standard of data representation for formatted text obtained from optical character recognition (OCR). The definition encodes text, style, layout information, recognition confidence metrics and other information using Extensible Markup Language (XML) in form of Hypertext Markup Language (HTML) or XHTML."

hOCR is produced by the Tesseract, Cuneiform, and OCRopus OCR software.  My motivation for creating this tool was a need to analyze hOCR output produced by Tesseract.

This application has been implemented as a simple WinForms application  (yeah, I know, but it was quick) written in C#.

When using the application, the text contained in an hOCR file is loaded alongside the image that is the source of the OCR output.  Hovering over a word in the text highlights the word in the image. 

image
Hovering over the word “quantitative” in the left panel highlights the word in the source image on the right.

Clicking a word in the text displays the coordinates for the bounding box used to highlight the word.  (This bounding box is extracted from the hOCR output).  The coordinates are displayed as two pairs of X-Y coordinates that represent the upper right and lower left corners of the bounding box.

image
Clicking the word displays its coordinates.  In
this case, the X-Y pairs are (513, 540) for the
upper right and (846, 600) for the lower left.

The source code can be downloaded from the Github repository, or the compiled executable can be downloaded directly.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: