Comparing Documents with Bayes Classification, Term Frequency–Inverse Document Frequency, and Levenshtein Distance Algorithms
September 9, 2013 2 Comments
I recently needed to find a way to quickly evaluate a string against a large dictionary of strings and find any exact or close matches. More precisely, the strings in question were literary citations, which made the matching process more difficult than simple one-to-one string comparisons. For example, I needed to be able to recognize that the citation “An Article About Birds, 1950, John Smith” was in fact a match for “Smith, John 1950 An Article About Birds”. What’s more, these comparisons needed to be done in real-time, against a dictionary of existing citations that numbered in the tens of thousands.
THE ALGORITHMS
A bit of research identified three suitable algorithms: Bayes Classification, Term Frequency-Inverse Document Frequency (TFIDF), and Levenshtein Distance. (For a detailed description of each, follow the links.)
Very simply put, each of these algorithms works by comparing a string (or “document”) to a dictionary of strings (“documents”) and assigning scores indicating how closely each dictionary entry matches the the string being evaluated.
Open source C# implementations of each algorithm can be found at the following locations:
Bayes: http://www.codeproject.com/Articles/14270/A-Naive-Bayesian-Classifier-in-C
TFIDF: http://www.codeproject.com/Articles/12098/Term-frequency-Inverse-document-frequency-implemen
Levenshtein: http://www.codeproject.com/Articles/13525/Fast-memory-efficient-Levenshtein-algorithm
Output
Each algorithm returns scores indicating how similar the document being resolved is to each document in the dictionary. The "best" score varies by the algorithm used for the document resolution.
In the Bayes algorithm, the documents in the dictionary that receive the highest scores are the closest matches to the document being resolved, though there is not a lot of variation in the scores. Many documents might receive the same score. In addition, there is no set value that always indicates the best match, so one cannot simply look for a score of 1 or 10 or 100 to identify an exact match.
Here is an example of the output of a Bayes algorithm evaluation of the string “Three new Queensland fishes 1922 Ogilby, J D” against a dictionary with 5000 entries:
-29.0326457653881 | Three undescribed Queensland fishes 1916 Ogilby, J D |
-35.0762263136351 | Descriptions of three new Australian lizards 1892 Ogilby, J Douglas |
-35.4987968896462 | Description of a new pelagic fish from New Zealand 1893 Ogilby, J Douglas |
-37.3313783533945 | Three new Brazilian Micro-Lepidoptera 1911 |
-37.3313783533945 | New Philippine Coleoptera 1922 Heller, H M |
-37.3313783533945 | Three new Nicaraguan Diplopods 1956 |
-37.3313783533945 | Three new Gomphines 1902 Needham, J G |
-37.3313783533945 | New Tasmanian fish 1902 Johnston, R M |
-37.3313783533945 | Twenty-five new Sphingidae 1922 Clark, B P |
-37.3313783533945 | A new anarrhichadoid fish 1905 Gill, T N |
In the Term Frequency – Inverse Document Frequency algorithm, documents in the dictionary are assigned a score ranging from 0 to 1. Scores closest to 1 (with exactly 1 being best) are the closest matches to the document being resolved.
Here is an example of TFIDF output from the evaluation of the string “Three new Queensland fishes 1922 Ogilby, J D” against a dictionary with 5000 entries:
0.782137747714455 | Three undescribed Queensland fishes 1916 Ogilby, J D |
0.396136491455677 | Descriptions of three new Australian lizards 1892 Ogilby, J Douglas |
0.338747682136121 | Description of a new pelagic fish from New Zealand 1893 Ogilby, J Douglas |
0.32172277819332 | On some undescribed reptiles and fishes from Australia 1892 Ogilby, J Douglas |
0.249869274706399 | Some notes on the ferns of north Queensland 1915 Watts, W W |
0.245028794556167 | Description of a new shark from the Tasmanian coast 1893 Ogilby, J Douglas |
0.204027948804197 | A new subspecies of Megacephala murchisona from north Queensland 1907 Horn, W |
0.193913337083555 | Review of the genus Schedophilus, Cocco, and its allies 1893 Ogilby, J Douglas |
0.188518472471232 | Two undescribed Pelecypoda from the Lower Cretaceous of Queensland in the collection of the Australian Museum 1902 Etheridge, R |
0.182215641862376 | New Philippine Coleoptera 1922 Heller, H M |
In the Levenshtein Distance algorithm, the documents in the dictionary that receive the lowest score (0 being "best") are the closest matches to the document being resolved.
Here is an example of Levenshtein output from the evaluation of the string “Three new Queensland fishes 1922 Ogilby, J D” against a dictionary with 5000 entries:
20 | Three undescribed Queensland fishes 1916 Ogilby, J D |
54 | Descriptions of three new Australian lizards 1892 Ogilby, J Douglas |
57 | On three new spiders of the genus Oxyopes (Araneina) 1929 Chamberlin, R V |
57 | Descriptions of three new species of land tortoises 1828 Bell, T |
59 | Three new plants from Yucatan 1930 Standley, P C |
59 | Three new Gomphines 1902 Needham, J G |
59 | New Tasmanian fish 1902 Johnston, R M |
59 | A new anarrhichadoid fish 1905 Gill, T N |
60 | Three new rodents from Colorado 1908 Merriam, C H |
60 | Three new Canadian Anthomyiidae (Diptera) 1919 Malloch, J R |
As you can see, the TFIDF and Levenshtein algorithms produce output that is suited to identifying a few "best" matches. The less-specific scores given by the Bayes output are better suited for identifying categories of documents.
Performance
NOTE: The original TFIDF implementation generates and stores weight values for each word in the entire dictionary. For large dictionaries, this matrix of values quickly uses excessive amounts of memory (in my tests, all of the memory on my laptop was exhausted). Therefore, I modified the algorithm to calculate the term weights on the fly, as they are needed. This solved the memory issue, but had negative repercussions for the performance of the algorithm.
Below are results of timed test runs of each algorithm with dictionaries of three different sizes. Notice that none of the dictionaries are particularly large, and yet the performance of the TFIDF and Levenshtein algorithms significantly degrades when evaluating the larger dictionaries. In fact, the TFIDF algorithm took nearly a minute to evaluate a single document against a dictionary of just 5000 documents!.
These tables show the times it took each algorithm to evaluate 100 documents.
Dictionary size = 250 documents | |
Bayes | 0.64 seconds |
Levenshtein | 8.42 seconds |
TFIDF | 32.47 seconds |
Dictionary size = 1000 documents | |
Bayes | 1.40 seconds |
Levenshtein | 30.53 seconds |
TFIDF | 298.50 seconds |
Dictionary size = 5000 documents | |
Bayes | 9.47 seconds |
Levenshtein | 139.75 seconds |
TFIDF | 5689.13 seconds |
A Problem
Due to the extreme performance degradation of the TFIDF and Levenshtein algorithms as the size of the dictionary increases, using either of those algorithms to perform real-time resolution of more than a small number of documents is not feasible. That leaves just the Bayes algorithm, which handles large dictionaries much more efficiently. Unfortunately, the Bayes output is much less specific about which document is the "best" match.
Because I needed to find the “best” matches, and because I had a dictionary with over 90000 entries, the ideal algorithm for my needs would be one that returned the TFIDF or Levenshtein output with the speed of Bayes.
And a Solution
The solution I came up with is to combine the algorithms. The first step is to use the Bayes algorithm to evaluate the document against the full dictionary. Then, the subset of results that contains the most likely matching documents is identified. Finally, either the TFIDF or Levenshtein algorithm is used to evaluate the document a second time, using only the likely matches identified by the Bayes algorithm as the dictionary for this secondary evaluation.
Because the Bayes algorithm is always relatively fast, and because the size of the dictionary used for the secondary evaluation is small, the total time to evaluate a document is greatly improved. In addition, the final output is the more-specific output of the TFIDF or Levenshtein algorithms.
NOTE: To enable this chaining together of algorithms, I made one key change to all of the algorithms. I reduced the total number of results returned. I did this by choosing a threshold value for each algorithm and only returning documents that score better than the threshold value. For the TFIDF and Levenshtein algorithms, limiting the results was a simple matter of comparing the score of each document with the chosen threshold value. For Bayes, it was necessary to calculate the standard deviation of all of the scores and select only documents that scored better than the standard deviation from the mean. Applying these thresholds reduced the documents returned to only the most relevant… the most likely matches.
Here are results of timed test runs of hybrid algorithms that combine the Bayes algorithm with the Levenshtein and TFIDF algorithms. Dictionaries of three different sizes were tested. These tables show the times it took each algorithm to evaluate 100 documents
Dictionary size = 250 documents | |
Bayes / Levenshtein | 0.69 seconds |
Bayes / TFIDF | 0.69 seconds |
Dictionary size = 1000 documents | |
Bayes / Levenshtein | 1.53 seconds |
Bayes / TFIDF | 1.56 seconds |
Dictionary size = 5000 documents | |
Bayes / Levenshtein | 10.03 seconds |
Bayes / TFIDF | 10.44 seconds |
Notice how the hybrid algorithms improve upon the performance of the original TFIDF and Levenshtein algorithms. Their performance compares very favorably to the Bayes algorithm by itself, while producing the more specific outputs of the TFIDF and Levenshtein algorithms.
THE DOCUMENTRESOLVER CLASS
Click here to download the source code for the DocumentResolver class.
The DocumentResolver class library wraps each of the three open-source algorithms (with all of the modifications and extensions that have been described) in a common interface. Also, it implements two additional hybrid document resolution algorithms. One of the hybrid algorithms chains together the Bayes and TFIDF algorithms, and the other chains the Bayes and Levenshtein algorithms.
NOTE: The source code for the original TFIDF algorithm includes a class that implements a Porter Stemmer. This class reduces all of the words in a string to their root forms (or stems). For example, the root form of “birds” is “bird”. In testing, I found that the algorithms produced better results when evaluating only the root forms of words. So, I adapted this class so that it can be used by any of the algorithms. Notice the “useWordStemmer” boolean in the examples below. The value of that flag determines whether or not the word stemmer is used.
Here is an example that shows how to use the Bayes algorithm:
BayesResolverEngine resolver = new BayesResolverEngine();
// Each key-value pair in the dictionary returned by LoadData
// pairs an identifier with a string value
Dictionary<string, string> dictionary = LoadData(dictionaryfile);
resolver.Dictionary = dictionary;
bool useWordStemmer = true;
string document = “string to be resolved”;
resolutionResults = resolver.Resolve(document, useWordStemmer);
foreach (ResolutionResult resolutionResult in resolutionResults)
{
// The score assigned by the algorithm
Console.WriteLine(resolutionResult.Score);
// Identifier of the matching string
Console.WriteLine(resolutionResult.Key);
// The matching string
Console.WriteLine(resolutionResult.Document);
}
The usage of the TFIDF and Levenshtein algorithms is identical, except for the first line. For those algorithms, substitute either
TFIDFResolverEngine resolver = new TFIDFResolverEngine();
or
LevenshteinResolverEngine resolver = new LevenshteinResolverEngine();
Usage of the hybrid algorithms is again very similar, with one additional step. Here is a complete example:
DocumentResolver resolver = new DocumentResolver();
// Each key-value pair in the dictionary returned by LoadData
// pairs an identifier with a string value
Dictionary<string, string> dictionary = LoadData(dictionaryfile);
resolver.SetDictionary(dictionary);
resolver.SetEngine(DocumentResolver.EngineType.BayesTFIDF);
bool useWordStemmer = true;
string document = “string to be resolved”;
resolutionResults = resolver.Resolve(document, useWordStemmer);
foreach (ResolutionResult resolutionResult in resolutionResults)
{
// The score assigned by the algorithm
Console.WriteLine(resolutionResult.Score);
// Identifier of the matching string
Console.WriteLine(resolutionResult.Key);
// The matching string
Console.WriteLine(resolutionResult.Document);
}
Notice that the resolver class instantiated on the first line is again different. The other difference is the inclusion of resolver.SetEngine(). This identifies which chained algorithm to use; in this case it is Bayes-TFIDF. To instead use Bayes-Levenshtein, change this line to
resolver.SetEngine(DocumentResolver.EngineType.BayesLevenshtein);
In addition to the DocumentResolver class itself, the source code includes a GUI application and a console application that can be used to exercise the algorithms. Sample data sets are also included. The code shown above is taken from those applications..
FUTURE IMPROVEMENTS
One obvious improvement that could be made to the DocumentResolver library would be to incorporate the use of simple parallel operations. Each of the algorithms evaluates a given string against the entire dictionary, one entry at a time. This is a prime candidate for parallel processing.
For example, each algorithm has a construct similar to the following (taken from the TFIDFResolverEngine class):
// Evaluate the string against all of the data entries
List<ResolutionResult> results = new List<ResolutionResult>();
TFIDFMeasure tf = new TFIDFMeasure(documents, useWordStemmer);
…
foreach(KeyValuePair<string, string> kvp in documents)
{
…
double similarity = tf.GetSimilarity(0, x);
…
}
}
Replacing the “foreach” loop with a “Parallel.ForEach()” method should improve the processing times. (Perhaps that will be a topic for another post.)
CONCLUSION
In summary, you can make use of any of the algorithms in the DocumentResolver library, provided you are aware of the characteristics and limitations of each.
The TFIDF and Levenshtein Distance algorithms give more definite answers to the question of “which documents match”, whereas the Bayes Classification algorithm clusters the output into broader groups. On the other hand, the Bayes algorithm is much faster, particularly with large dictionaries.
If you need more definite answers than what the Bayes algorithm provides, but have a large dictionary (more than 1000 documents) or otherwise need to minimize the processing time, consider using one of the hybrid algorithms (Bayes-TFIDF or Bayes-Levenshtein).
Pingback: A Simple Parallel.ForEach Example | LichtenBytes
Amazing article! (y) Just what I was looking for.