Tuning Tesseract OCR

Background

Tesseract is an open-source tool for generating OCR (Optical Character Recognition) output from digital images of text.  It has been around for a long time, and the project is currently "owned" by Google.  Tesseract is still in development, but its last official release was more than 2 years old.

More information about Tesseract OCR can be found here and here.

In general, Tesseract does a good job with clean, predictably-formatted pages of text.  More challenging are pages with unusual type faces or formatting.  Tesseract does provide a huge number of parameters that can be used to tune the output and improve its accuracy.  Unfortunately, most (if not all) of those parameters are minimally documented.

While attempting to tune Tesseract for a project involving scans of old (pre-1923) life sciences texts, I found a few specific problems:

  • In some OCR outputs, entire lines or parts of lines were be located in the wrong place on the page.
  • In general, Tesseract is difficult to tune
  • The configurable settings are poorly documented and not named in a manner that eases use
  • Some tuning options are too aggressive
  • Some tuning options fix one problem while causing another

Following are descriptions and examples of a few of the tuning parameters that are available, highlighting the pros and cons of each.  These parameters produced obvious differences in the OCR output.  A number of other parameters were evaluated, but produced no, or very minimal, differences.

Notes/Disclaimers About the Tests

In all of the following examples, "config" is the name of the file containing the configuration settings.  A Tesseract config file is just a plain text file containing space-delimited key/value pairs for Tesseract config variables, each on separate line.  There are several standard config files in the tessdata/configs folder of a standard Tesseract installation.

The image used in this evaluation can be downloaded from http://www.archive.org/download/mobot31753003515068/page/n48

Tesseract 3.02 was used for all tests.  In all cases, the settings tested and analyzed here have worked as described for the small number of images in our test set.  There is no guarantee that they will work the same for all, or even a majority, of scans.  At best, the following information should be used to direct tests on your own images.

Configuration Settings for Capturing Debugging Output

debug_file tesseract.log

Writes debugging information to the named log file.

tessedit_write_images true

Internally, tesseract converts the image being processed to a TIF; this setting writes that TIF to disk.

tessedit_create_hocr

Writes the output, including coordinate information, to an HTML file instead of to the standard text file.  The coordinate information can be particularly helpful.  These coordinates are for lines and words, which can be easier to work with than some other "box" settings that produce coordinates by letter.

Example 1: Baseline Output

Command

tesseract image.jpg outputfilename

Command Line Arguments

None

Config Settings

None

Notes
  • This is the baseline output of Tesseract.
  • The lines in the output that are highlighted in red should be a single line reading "provincial towns, and in America.  A single sale has contained as many as".  The 2nd half of the line has been mis-located.
  • The words highlighted in blue include extra characters that are a results of "noise" (specks and imperfections in the image).
  • It is not clear where the words at the end of the page (highlighted in green) come from.
Output

-43 THE ORCHID REVIEW. [AUOUs’r, I921.

occasions an Eastern grass arrived dead, but the following attempt met
-with success. As many as 23 collectors were at one time employed in
different parts of the world- Auction sales were held in London, the largest
provincial towns, and in America.
.20,ooo Orchids.

Having completed the St. Albans’ nursery, a branch business was
organised at Summit, New Jersey, U.S.A., and placed under the manage-

-ment of Mr. Fostermann, a former collector, and later under Mr. A.

A single sale has contained as many as

Dimmoek; Eventually, when found to be too far-‘distant,=it wasacquired by ‘

Messrs. Lager & Hurrell, who still maintain it as an Orchid nursery. In
1886 the authoritative work, “Reichenbachz’a,” with life-size coloured
illustrations, was published by Mr. Sander, many of the articles being
‘personally supervised by him.

Still restless, Mr. Sander, in 1894, commenced building a nursery at St.
André, Bruges, Belgium. In 1914 this had developed into an enormous
concern with 250 houses, about 50 of which were devoted to Orchids, the
culture of Vanda coerulea, Phalaenopses, Dendrobium superbiens, Laelia
~Gouldiana and Cymbidium Sanderi being especially successful, while great
strides were made in the growing and breeding of Odontoglossums. The
-Orchid section alone formed a comparatively large nursery, but the
remaining houses covered more than four times the area, and contained huge
quantities of palms, azaleas, dracaenas, and araucarias, while outside a
stock of 30,000 trimmed bays, from miniatures to giants, rhododendrons,
hardy azaleas, begonias, etc., were grown. His collectors penetrated all
parts of the globe to which Orchids are indigenous, and the two nurseries
acted as clearing houses to all countries.

Mr. Sanders successes at the leading European and American Horti-
-cultural Exhibitions gained for him a world-wide reputation. He was one of
the original holders of the Victorian Medal of Horticulture, and held several
foreign orders, including that of the Croen of Belgium. He was a baron of
the Russian Empire, and as head of the firm he won the French President’s
Prix d’Honneur in Paris, the Veitchian Cup in 1906, the Coronation
Challenge Cup in 1913, 4f Gold Medals, 24 Silver Cups, as well as Medals
and Diplomas by the hundred. In International Exhibitions at London,
Edinburgh, Brussels, Antwerp, Ghent, Paris, Petrograd, Moscow, Florence,
Milan, New York, Chicago and St. Louis he won Gold Medals and highest
awards. At each Ghent quinquennial his new plants were amongst the
leading attractions to horticulturists from all parts of the world. His
personality was felt by all who came in contact with him. Genial to all,
and enthusiastic where plants were concerned, it was a pleasure to speak
with him on Orchids, and especially interesting and instructive when he
could be induced to give personal reminiscences of the struggles to obtain

.3,
9.-
4:?
‘ii

 

Example 2: PageSegMode

Command

tesseract image.jpg outputfilename -psm 6

Command Line Arguments

-psm 6

Config Settings

None

Notes
  • The command line argument -psm stands for PageSegMode (Page Segmentation Mode).  It directs the layout analysis that Tesseract performs on the page.  By default, Tesseract fully automates the page segmentation, but does not perform orientation and script detection.  A value of 6 directs Tesseract to assume a single uniform block of text.
  • The text in the output that is highlighted in red is now correctly contained on a single line.
  • The words highlighted in blue include extra characters that are a results of "noise" (specks and imperfections in the image).  None of these have corrected; in fact, a few new ones appear.
  • There are no longer extra lines between paragraphs.  However, those lines do not actually appear on the source image either.
  • The garbage words at the end of the page no longer appear.
  • A small number of errors in individual words that appear in the original output were corrected, and a few other incorrect words changed (but were still incorrect).
  • -psm 6 can produce very poor output.  For example, processing of the image found at http://www.archive.org/download/mobot31753002262522/page/n1 results in a huge amount of garbage output, instead of just a few lines of text.
  • Pages with images produce poor output when using PageSegMode 6.  Because this parameter instructs Tesseract to treat everything as a single block of text, images are not recognized as images, and are instead processed as text (resulting in lots of garbage in the OCR).
  • Another disadvantage of using PageSegMode 6 is that text on "rotated" pages is not recognized.  In its normal mode, Tesseract is able to automatically normalize the page orientation and detect words.
  • In addition, PageSegMode 6 results in very poor results from multi-column pages.  Again, in its normal mode, Tesseract does a decent job of processing columns of text.
Output

.48 THE ORCHID REVIEW. [Avcvs-.r. 1921- ’
occasions an Eastern grass arrived dead, but the following attempt met
-with success. As many as 23 collectors were at one time employed in
different parts of the world- Auction sales were held in London, the largest
provincial towns, and in America. A single sale has contained as many as
.20,ooo Orchids.
Having completed the St. Albans’ nursery, a branch business was
organised at Summit, New Jersey, U.S.A., and placed under the manage-
-ment of Mr. Fostermann, a former collector, and later under Mr. A.
Dimmoek; Eventually, when found to be too far-‘distant,=it wasiacquired by ‘
Messrs. Lager & Hurrell, who still maintain it as an Orchid nursery. In
1886 the authoritative work, “Reichenbachz’a,” with life-size coloured
illustrations, was published by Mr. Sander, many of the articles being
‘personally supervised by
him. ‘
Still restless, Mr. Sander, in 1894, commenced building a nursery at St.
André, Bruges, Belgium. In 1914 this had developed into an enormous
concern with 250 houses, about 50 of which were devoted to Orchids, the
culture of Vanda coerulea, Phalaenopses, Dendrobium superbiens, Laelia
~Gouldiana and Cymbidium Sanderi being especially successful, while great
strides were made in the growing and breeding of Odontoglossums. The
-Orchid section alone formed a comparatively large nursery, but the
remaining houses covered more than four times the area, and contained huge
quantities of palms, azaleas, dracaenas, and araucarias, while outside a
stock of 30,000 trimmed bays, from miniatures to giants, rhododendrons,
hardy azaleas, begonias, etc., were grown. His collectors penetrated all
parts of the globe to which Orchids are indigenous, and the two nurseries
acted as clearing houses to all countries.
Mr. Sanders successes at the leading European and American Horti-
-cultural Exhibitions gained for him a world-wide reputation. He was one of
the original holders of the Victorian Medal of Horticulture, and held several
foreign orders, including that of the Croen of Belgium. He was a baron
of .
the Russian Empire, and as head of the firm he won the French President’s
Prix d’Honneur in Paris, the Veitchian Cup in 1906, the Coronation
Challenge Cup in 1913, 4f Gold Medals, 24 Silver Cups, as well as Medals
and Diplomas by the hundred. In International Exhibitions at London,
Edinburgh, Brussels, Antwerp, Ghent, Paris, Petrograd, Moscow, Florence,
Milan, New York, Chicago and St. Louis he won Gold Medals and highest
awards. At each Ghent quinquennial his new plants were amongst the
leading attractions to horticulturists from all parts of the world. His
personality was felt by all who came in contact with him. Genial to all,
and enthusiastic where plants were concerned, it was a pleasure to speak
with him on Orchids, and especially interesting and instructive when he
could be induced to give personal reminiscences of the struggles to obtain

 

Example 3: Line Size

Command

tesseract image.jpg outputfilename config

Command Line Arguments

None

Config Settings

textord_min_linesize 3.25

Notes
  • textord_min_linesize seems to have an affect on the line heights detected by Tesseract when it performs the layout analysis on the image.  The default value for this setting is 1.25.
  • When set to 3.25, the "broken" line problem in the original baseline output is corrected.  Lower settings (for example, 3.0) do not correct the "broken" lines.  
  • This settings causes other character recognition errors.
  • The text in the output that is highlighted in red is again correctly contained on a single line.
  • The words highlighted in blue include extra characters that are a results of "noise" (specks and imperfections in the image).  None of these have corrected, but no new ones have appeared.
  • Lines between "paragraphs" now appear in somewhat odd locations.  Again, there are NO lines between paragraphs on the source image.
  • The garbage words at the end of the page do not appear.
  • A small number of errors in individual words that appear in the original output were corrected, a few other incorrect words changed (but were still incorrect), a small number of correct words now are  incorrect.  These have been highlighted in purple.
Output

.48 THE ORCHID REVIEW. [AUGUsT, 1921.

occasions an Eastern grass arrived dead, but the following attempt met
‘with success. As many as 23 collectors were at one time employed in

different parts of the world- Auction sales were held in London, the largest
provincial towns, and in America. A single sale has contained as many as
.20,00o Orchids.

Having completed the St. Albans’ nursery, a branch business was
organised at Summit, New Jersey, U.S.A., and placed under the manage-
-ment of Mr. Fostermann, a former collector, and later under Mr. A.
Dimmook; Eventually, when found to be too far-‘distant,=it wasiacqui d b ‘

I

Messrs. Lager & Hurrell, who still maintain it as an Orchid nurser: 1
1886 the authoritative work, “Reichenbachz’a,” with life-size coloured
illustrations, was published by Mr. Sander, many of the articles being
‘personally supervised by him.

Still restless, Mr. Sander, in 1894, commenced building a nursery at St.
André, Bruges, Belgium. In 1914 this had developed into an enormous i
concern with 250 houses, about 50 of which were devoted to Orchids, the
culture of Vanda coerulea, Phalaenopses, Dendrobium superbiens, Laelia
~Gouldiana and Cymbidium Sanderi being especially successful, while great
strides were made in the growing and breeding of Odontogl sssss s. The
-Orchid section alone formed a comparatively large nursery, but the
remaining houses covered more than four times the area, and contained huge ;
quantities of palms, azaleas, dracaenas, and araucarias, while outside a
stock of 30,000 trimmed bays, from miniatures to giants, rhododendrons,
hardy azaleas, begonias, etc., were grown. His collectors penetrated all
parts of the globe to which Orchids are indigenous, and the two nurseries
acted as clearing houses to all countries.

Mr. Sanders successes at the leading European and American Horti-
-cultural Exhibitions gained for him a world-wide reputation. He was one of
the original holders of the Victorian Medal of Horticulture, and held several
foreign orders, including that of the Croen of Belgium. He was a baron of
the Russian Empire, and as head of the firm he won the French President’s 3
Prix d’Honneur in Paris, the Veitchian Cup in 1906, the Coronation
Challenge Cup in 1913, 4f Gold Medals, 24 Silver Cups, as well as Medals
and Diplomas by the hundred. In International Exhibitions at London,

Edinburgh, Brussels, Antwerp, Ghent, Paris, Petrograd, Moscow, Florence, fi
Milan, New York, Chicago and St. Louis he won Gold Medals and highest
awards. At each Ghent quinquennial his new plants were amongst the
leading attractions to horticulturists from all parts of the world. His

personality was felt by all who came in contact with him. Genial to all,
and enthusiastic where plants were concerned, it was a pleasure to speak
with him on Orchids, and especially interesting and instructive when he i
could be induced to give personal reminiscences of the struggles to obtain

 

Example 4: Noise Reduction

Command

tesseract image.jpg outputfilename -psm 6 config

Command Line Arguments

-psm 6

Config Settings

textord_heavy_nr 1

Notes
  • Note that for this test, the PageSegMode command line parameter was used in conjunction with the configuration setting, and PageSegMode was responsible for the elimination of the “broken” lines in the output.
  • textord_heavy_nr instructs Tesseract to vigorously remove noise from the output.
  • Did a good job of removing noise from the results (highlighted in blue), BUT it also removed many valid punctuation and diacritic marks (highlighted in green), including ALL periods.
  • A small number of errors in individual words that appear in the original output were corrected, a few other incorrect words changed (but were still incorrect), a small number of correct words now are  incorrect.  These have been highlighted in purple.
  • Introduced three extra blank lines.
  • In total, more errors were introduced than were corrected with this setting.
Output

48 THE ORCHID REVIEW [Aucvsr I921
occasions an Eastern grass arrived dead, but the following attempt met
with success As many as 23 collectors were at one time employed in i
different parts of the world Auction sales were held in London, the largest
provincial towns, and in America A single sale has contained as many as
20,000 Orchids
Having completed the St Albans’ nursery, a branch business was
organised at Summit, New Jersey, U S A , and placed under the manage it
ment of Mr Fostermann, a former collector, and later under Mr A
Dimmock Eventually, when found to be too far distant, it was acquired by 4.
Messrs Lager & Huriell, who still maintain it as an Orchid nursery In
1886 the authoritative work, “Rezchenbachm,” with life size coloured
illustrations, was published by Mr Sander, many of the articles being
personally supervised by him
Still restless, Mr Sander, in 1894, commenced building a nursery at St
Andre
, Bruges, Belgium In 1914 this had developed into an enormous
concern with 250 houses, about 50 of which were devoted to Orchids, the
culture of Vanda coerulea, Phalaenopses, Dendrobium superbiens, Laelia
Gouldiana and C3 mbidium Sanderi being especially successful, while great
strides were made in the growing and breeding of Odontoglossums The
Orchid section alone formed a compir 1tlV€ly large nurser}, but the
remaining hou es covered more than four times the area, and contained huge
quantities of palms, azaleas dracaenas, and araucarias, while outside a
stock of 30,000 trimmed bays, from miniatures to giants, rhododendrons,
hardy azaleas begonias, etc , were grown His collectors penetrated all
parts of the globe to which Orchids are indigenous, and the two nurseries
acted as clearing houses to all countries
Mr Sanders successes at the leading European and American Horti

cultural Exhibitions gained for him a world wide reputation He was one of
the original holders of the Victorian Medal of Horticulture, and held several
foreign orders, including that of the Croen of Belgium He was a baron of

the Russian hmpire, and as head of the firm he won the French President’s

Pnx d’Honneur in Paris, the Veitchian Cup in 1906, the Coronation
Challenge Cup in 1913, 41 Gold Medals, 24 Silver Cups, as well as Medals
and Diplomas by the hundred In International Exhibitions at London,
Edinburgh, Brussels, Antwerp, Ghent, Paris, Petrograd, Moscow, Florence, i
Milan, New York, Chicago and St Louis he won Gold Medals and highest
awards At each Ghent quinquennial his new plants were amongst the
leading attractions to horticulturists from all parts of the world His ?
personality was felt by all who came in contact with him Genial to all,
and enthusiastic where plants were concerned, it was a pleasure to speak
with him on Orchids, and especially interesting and instructive when he
could be induced to give personal reminiscences of the struggles to obtain

 

Other Parameters

The following additional parameters were evaluated, but had little to no affect on the output.

tessedit_word_for_word 1

According to documentation within the source code, this setting "Make(s) output have exactly one word per WERD".

textord_space_size_is_variable 1

This setting makes Tesserct assume that spaces have variable width, even though characters have fixed pitch.

textord_max_noise_size
textord_noise_area_ratio
speckle_large_max_size
speckle_small_penalty
speckle_large_penalty
speckle_small_certainty

Each of these settings modifies how Tesseract identifies and handles noise (non-characters markings) on the page.  It was possible to get Tesseract to produce different… but not better… outputs by modifying these settings.  It is possible that I simply did not find the right combination of values to produce actual improvements in the output.

Conclusion

It seems clear from these tests that processing at least some documents with Tesseract OCR requires careful tuning, and that such tuning is not always a simple task.  In fact, a bit of trial-and-error may be needed to produce the desired results.

Ultimately, to get the best results from Tesseract OCR when tuning is required, one of two things needs to be true.  Either the tuning needs to be done for each book (or page) to be processed, or the books/pages to be processed should be of similar quality/characteristics (so that tuning can be done once for the entire workload).

Ligatures in Tesseract OCR Output

Tesseract is an open source OCR engine.  It has been open source since 2005, and development on the engine has been sponsored by Google since 2006.  It is a command line tool, although there are separate projects that provide a GUI.  More information about Tesseract can be found here.

The following advice is known to apply to Tesseract version 3.0.2, but likely also applies to later versions.

While using Tesseract, one curiousity that I noticed is that it frequently outputs ligatures such as “fi” and “fl” rather than individual letters (“f” followed by “i” and “f” followed by “l”.  To a human reading the OCR output, this is no problem, as there is little difference to the naked eye between the ligatures and “normal” characters.  However, any post-processing or machine validation of the output can be affected by the presence of the ligatures.

There are  couple ways to eliminate the ligatures from the output.  First, a directive can be added to the Tesseract configuration file.  The configuration file is just a plain text file containing space-delimited key/value pairs for Tesseract config variables, each pair on separate line.  So, to direct Tesseract to “blacklist”, or not use, specific ligatures, add something like the following to the configuration file:

tessedit_char_blacklist    fifl

In the previous example, replace the fi and fl with the exact ligatures you want Tesseract to not use.  The list of common Latin ligatures shown here can be found at http://www.unicode.org/charts/PDF/UFB00.pdf:
 

ff LATIN SMALL LIGATURE FF
fi LATIN SMALL LIGATURE FI
fl LATIN SMALL LIGATURE FL
ffi LATIN SMALL LIGATURE FFI
ffl LATIN SMALL LIGATURE FFL
ſt LATIN SMALL LIGATURE LONG S T
st LATIN SMALL LIGATURE ST

Another way to remove ligatures from Tesseract output is simply to post-process the output (using whatever tool or programming language you prefer), replacing the ligatures with the appropriate characters.  This may actually be the better approach.  In my own tests, I obtained more accurate final outputs by post-processing the Tesseract output.  This blog post also suggests that post-processing is the better approach.

Merging Git Repositories and Preserving History

Recently I was faced with the need to merge two Git repositories and preserve the history behind the files in each.  Here is an overview of the situation:

  • A "Main" project repository with a remote in a public GitHub project.
  • A "Secondary" project repository with a remote in a private Visual Studio Online project.
  • Both repositories contained Visual Studio / .NET / C# solutions/projects.
  • I needed to move the Secondary (private) repository into the Main (public) repository.
  • I wanted to preserve the history of the files in the Secondary repository.

Much advice about merging two Git repositories and preserving history can be found online.  Here is a sample of what can be found:

http://saintgimp.org/2013/01/22/merging-two-git-repositories-into-one-repository-without-losing-file-history/
http://jasonkarns.com/blog/merge-two-git-repositories-into-one/
http://stackoverflow.com/questions/1425892/how-do-you-merge-two-git-repositories
http://julipedia.meroh.net/2014/02/how-to-merge-multiple-git-repositories.html
http://scottwb.com/blog/2012/07/14/merge-git-repositories-and-preseve-commit-history/

If you scan through the content at those links, you see that there seems to be multiple ways to approach this problem.

Following is the sequence of Git commands that worked for me.  I must stress that this worked for me, and may not work equally well for your situation.  Proceed carefully, and be prepared to handle unexpected situations.

1) Navigate to the master branch of the Main project repository.

2) Add a remote that references the Secondary project repository.  In my case, this was a reference to the Visual Studio Online remote repository.

git remote add secondaryrep <URL of secondary repository>

3) Create a new branch in the Main repository.

git branch mergebranch

4) Navigate to the new branch.

git checkout mergebranch

5) Fetch the files and metadata from the Secondary repository.

git fetch secondaryrep

6) Merge the master branch of the Secondary repository into the working branch of the Main repository.

git merge secondaryrep/master

At this point I had to stop and resolve a handful of minor errors which were specific to my situation.  You may or may not encounter similar issues with your own repositories.

Specifically, there was an untracked file that initially prevented the merge operation.  In this case, it was safe to simply remove that file and retry the merge.

In addition, after merging, there were a few merge conflicts in configuration files related to NuGet packages (recall that these repositories contained Visual Studio / .NET/ C# projects).  It was a simple matter to edit the files indicated by Git and resolve the conflicts.

7) Prepare the files for commit.

git add .

8) Opened all projects/solutions in Visual Studio and confirm that they successfully build and pass tests.

9) Commit the files from the Secondary repository.

git commit -a -m "Added projects from secondary repository"

10) Return to the master branch of the Main repository.

git checkout master

11) Merge the branch we created for the files from the Secondary repository into the master branch of the Main repository.

git merge mergebranch

12) Push the updated master branch to the Main repository’s remote GitHub repository.

git push origin master

13) Remove the branch that had been created for the Secondary repository’s files.

git branch -d mergebranch

14) Remove the reference to the Secondary repository’s remote Visual Studio Online repository.

git remote remove secondaryrep

hOCRImageMapper: A Tool For Visualizing hOCR Files

Just uploaded to GitHub (https://github.com/mlichtenberg/hocrimagemapper), this simple application provides a way to visualize hOCR output.

Per Wikipedia: "hOCR is an open standard of data representation for formatted text obtained from optical character recognition (OCR). The definition encodes text, style, layout information, recognition confidence metrics and other information using Extensible Markup Language (XML) in form of Hypertext Markup Language (HTML) or XHTML."

hOCR is produced by the Tesseract, Cuneiform, and OCRopus OCR software.  My motivation for creating this tool was a need to analyze hOCR output produced by Tesseract.

This application has been implemented as a simple WinForms application  (yeah, I know, but it was quick) written in C#.

When using the application, the text contained in an hOCR file is loaded alongside the image that is the source of the OCR output.  Hovering over a word in the text highlights the word in the image. 

image
Hovering over the word “quantitative” in the left panel highlights the word in the source image on the right.

Clicking a word in the text displays the coordinates for the bounding box used to highlight the word.  (This bounding box is extracted from the hOCR output).  The coordinates are displayed as two pairs of X-Y coordinates that represent the upper right and lower left corners of the bounding box.

image
Clicking the word displays its coordinates.  In
this case, the X-Y pairs are (513, 540) for the
upper right and (846, 600) for the lower left.

The source code can be downloaded from the Github repository, or the compiled executable can be downloaded directly.

“Count” and “Count Distinct” Queries in MongoDB

For the following examples, assume that you have a database that includes four records that include the following fields and values:

{ "_id" : ObjectId("54936…dd0c"), "last_name" : "smith", "first_name" : "mike" }
{ "_id" : ObjectId("54936…dd0d"), "last_name" : "smith", "first_name" : "william" }
{ "_id" : ObjectId("54936…dd0e"), "last_name" : "smith", "first_name" : "william" }
{ "_id" : ObjectId("54936…dd0f"), "last_name" : "smith", "first_name" : "mark" }

Note that there are four records with a last_name value of “smith”.  The four records have three distinct values for the first_name field (“mike, “william”, and “mark”).

To count the number of rows returned by a query, use "count()", as shown here:

> db.collection.find({“last_name”:”smith”}).count();

4

To count the unique values, use "distinct()" rather than "find()", and "length" rather than "count()".  The first argument for "distinct" is the field for which to aggregate distinct values, the second is the conditional statement that specifies which rows to select.  Append "length" to the end of the query to count the number of rows returned.  (The "count()" function does not work on the results of a "distinct" query.) 

Here is an example which counts the distinct number of first_name values for records with a last_name value of “smith”:

> db.collection.distinct("first_name", {“last_name”:”smith”}).length;

3

St. Louis Days of .NET 2014

My notes from the 2014 edition of St. Louis Days of .NET.  I was only able to attend the first day of the conference this year.

Front-End Design Patterns: SOLID CSS + JS for Backend Developers

Presenter: https://twitter.com/anthony_vdh
Session Materials:  http://vimeo.com/97315940

Use namespaced, unambiguous classes.   For example, use “.product_list_item” instead of “.product_list li” , and “.h1” instead of “h1”.

No cascading

Limit overriding

CSS Specificity – Specificity is the means by which a browser decides which property values are the most relevant to an element and get to be applied.
    Each CSS rule is assigned a specificity value
    Plot specificity values on a graph where the x-axis represents the line number in the CSS
    Line should be relatively flat, and only trend toward high specificity towards the end of the CSS
    Specificity graph generator: http://jonassebastianohlsson.com/specificity-graph/
    Another option of what a graph should look like: http://snook.ca/archives/html_and_css/specificity-graphs

Important CSS patterns and concepts
    Namespaces
    Modules
    Prototype
    Revealing Module
    Revealing Prototype

Optimizing Your Website’s Performance (End-To-End Diagnostics)

Presenter: http://mitchelsellers.com/
Session Materials: http://mitchelsellers.com/blogs/2014/11/17/2014-st-louis-days-of-net-presentations.aspx

If your test environment is different that your production environment, look for linear differences in order to estimate the differences between the servers.  For example, if the production server is a quad-core server and the test server is a dual-core server, measure the performance of the test server twice: once with one core active and once with both cores.  The difference between running with one core vs. two cores should allow you to estimate the difference between the dual-core server and the quad-core server.  Obviously, this will not be perfect, but does provide some baseline for estimating the differences between servers.

Different browsers have different limits on how many simultaneous requests can be made to a single domain (varies from 4 to 10).

Simple stuff to look at when optimizing a web site:
    Large images
    Long-running javascript
    Large viewstate

Make sure cache-expiration is set correctly for static content.  This is done in the web.config file.

TOOLS

Google PageSpeed
    Provides mobile and desktop scores
    Used in Google search rankings!
    Not useful for internal sites
    Similar to YSlow
    Blocked by pages requiring a login

Google Analytics (or similar)
    Useful for investigating daily loads (determine why site is slow at certain times)
    Use to investigate traffic patterns

Loader.IO
    Reasonably priced and free options available
    Use to simulate traffic load on your site
    Only tests static html

LoadStorm
    More expensive
    Use to simulate traffic load
    Tests everything; not just static content

New Relic
    Internal server monitoring

Hadoop For The SQL Ninja

Presenter: https://twitter.com/mwinkle

Hive is a SQL-like query language for Hadoop.
    Originated at Facebook
    Compiles to Map/Reduce jobs
    Queries tables/catalogs defined on top of underlying data stores
    Data stores can be text files, Mongo, etc
    Data stores just need to provide rows and columns of data
    Custom data provides can be created to provide rows/columns of data

Hive is good for:
    Large scale queries
    A variety of formats
    UDF extensibility

Hive is NOT good for:
    Interactive querying
    Small tables
    OLTP

Hive connectivity
    ODBC/JDBC – responsive queries
    Oozie – job-based workflows
    Powershell
    Azure Toolkit/API – now includes Visual Studio integration for viewing/executing queries

Angular for .NET Developers

Presenter: https://twitter.com/jamesbender
Session Materials: https://github.com/JamesBender/AngularDemos

AngularJS is a Javascript MVC framework
    Model-View-Controller are all on the client
    Data is exchanged via AJAX calls to REST web services
    Makes use of dependency injection

Benefits of AngularJS
    Unobtrusive Javascript
    Clean HTML
    Limits the need for third party libraries (like jQuery)
    Works well with ASP.NET MVC
    Easy Single-Page Applications (SPA)
    Testing is easy.  Jasmine is the test framework of choice.

HTML attributes provide AngularJS “hooks”.  For example, notice the attributes on the elements <html ng-app=”AngularApp”> and <input ng-model=”user.name” />

Data binding example:

    <input ng-model=”user.name”/>
    <p>Hello {{user.name}}</p>

    In this example, data entered into the input text box is echoed in the paragraph below the input element.

Making Rich, Interactive, Multi-Platform Applications with SignalR

Presenter: http://mitchelsellers.com/
Session Materials: http://mitchelsellers.com/blogs/2014/11/17/2014-st-louis-days-of-net-presentations.aspx

Use cases for SignalR
    Any application that involves polling
    Chat applications
    Real-time score updates
    Voting results
    Real-time stock prices

The Smooth Transition to TypeScript

Presenter: https://twitter.com/pottereric

TypeScript provides compile-time errors in Visual Studio.

TypeScript has type-checking
    Optional types on variables and parameters
    Primitive types are number, string, boolean, and any
    The “any” type tells the compiler to treat the variable like Javascript would

Intellisense for TypeScript is very good, and other typical Visual Studio tooling works as well.

TypeScript files compile to javascript (example.ts –> example.js), and the javascript is what gets referenced in your web applications.

TypeScript class definitions become javascript types.

The usual Visual Studio design and compile-time errors are available when working with classes.

A NuGet package exists that provides “jQuery typing files” that enable working with jQuery in TypeScript.

TypeScript supports generics and lambdas.