hOCRImageMapper: A Tool For Visualizing hOCR Files
December 23, 2014 Leave a comment
Just uploaded to GitHub (https://github.com/mlichtenberg/hocrimagemapper), this simple application provides a way to visualize hOCR output.
Per Wikipedia: "hOCR is an open standard of data representation for formatted text obtained from optical character recognition (OCR). The definition encodes text, style, layout information, recognition confidence metrics and other information using Extensible Markup Language (XML) in form of Hypertext Markup Language (HTML) or XHTML."
hOCR is produced by the Tesseract, Cuneiform, and OCRopus OCR software. My motivation for creating this tool was a need to analyze hOCR output produced by Tesseract.
This application has been implemented as a simple WinForms application (yeah, I know, but it was quick) written in C#.
When using the application, the text contained in an hOCR file is loaded alongside the image that is the source of the OCR output. Hovering over a word in the text highlights the word in the image.
Clicking a word in the text displays the coordinates for the bounding box used to highlight the word. (This bounding box is extracted from the hOCR output). The coordinates are displayed as two pairs of X-Y coordinates that represent the upper right and lower left corners of the bounding box.