Finding possibly matching strings in a large dataset - hashmap

I'm in the middle of a project where I have to process text documents and enhance them with Wikipedia links. Preprocessing a document includes locating all the possible target articles, so I extract all ngrams and do a comparison against a database containing all the article names. The current algorithm is a simple caseless string comparison preceded by simple trimming. However, I'd like it to be more flexible and tolerant to errors or little text modifications like prefixes etc. Besides, the database is pretty huge and I have a feeling that string comparison in such a large database is not the best idea...
What I thought of is a hashing function, which would assign a unique (I'd rather avoid collisions) hash to any article or ngram so that I could compare hashes instead of the strings. The difference between two hashes would let me know if the words are similiar so that I could gather all the possible target articles.
Theoretically, I could use cosine similiarity to calculate the similiarity between words, but this doesn't seem right for me because comparing the characters multiple times sounds like a performance issue to me.
Is there any recommended way to do it? Is it a good idea at all? Maybe the string comparison with proper indexing isn't that bad and the hashing won't help me here?
I looked around the hashing functions, text similarity algoriths, but I haven't found a solution yet...

Consider using the Apache Lucene API It provides functionality for searching, stemming, tokenization, indexing, document similarity scoring. Its an open source implementation of basic best practices in Information Retrieval
The functionality that seems most useful to you from Lucene is their moreLikeThis algorithm which uses Latent Semantic Analysis to locate similar documents.

Related

inexact string search - short query string to huge database (blast?)

I have an OCR that recognises a few short query strings (4-12 letters) in a given picture. And I would like to match these recognised words against a big database of known words. I've already build a confusion matrix with the used alphabet from the most common mistakes and I tried to do a whole gotoh alignment against all words in my database and found (not surprinsingly) that this is too time consuming.
So I am looking for a heuristic approach to match these words to the database (allowing mismatches). Does anyone know of an available library or algorithm that could help me out?
I've already thought about using BLAST or FASTA but the way I understood it both are limited to the standard amino acid alphabet and I would like to use all letters and numbers.
Thank you for your help!
Not an expert but I've done some reading on bioinformatics (which aren't the topic but related). You could use suffix trees or related data structures to more-quickly search through the database. I believe currently the time required for construction of the tree is linear wrt database length, and time required for querying the tree is linear wrt length of query string, so if you have a lot of query strings that are relatively short this sounds like the perfect data structure for you. More reading can be found on the wikipedia page for suffix trees.

Mahout: Creating vectors from Text, How can we support foreign language?

http://mahout.apache.org/users/basics/creating-vectors-from-text.html
Mahout teach out how to create vectors from text using lucene?
Is there a way to support character other than English?
Thanks
It requires a few steps to make a document become vector. Because you mentioned Apache Lucene and Mahout, I will briefly explain how you can obtain vectors by using Mahout and Lucene. It's a little bit tedious but you have to see the big picture in order to understand what you need to do to create vectors from languages other than English.
Firstly, by using Apache Lucene, you can create index files from text. In this step, the text is passed through Analyzer. The Analyzer will break the text into pieces (or technically tokens) and do most of important things including removing stopping words (the, but, a, an, ...), stemming words, converting to lower cases, etc. So, you can see, to support different languages, all you need to do is to build your own Analyzer.
In Lucene, StandardAnalyzer is the most well-equipped analyzer you can use, it supports non-English languages like Chinese, Japanese, Korean.
Secondly, after you obtain the index files, the next step is to mining text by using Mahout. No matter what you are going to do with your text, you have to convert the index files to SequenceFile since Mahout can only read inputs in SequenceFile format. The workaround is to use SequenceFilesFromLuceneStorage class in Mahout to do it.
Thirdly, after you have the sequence file, you now can convert it to vectors. For example, you can use the class SparseVectorsFromSequenceFiles to do it.
Hope it helps.

Fuzzy String Matching

I have a requirement within my application to fuzzy match a string value inputted by the user, against a datastore.
I am basically attempting to find possible duplicates in the process in which data is added to the system.
I have looked at Metaphone, Double Metaphone, and SoundEx, and the conclusion I have came to is they are all well and good when dealing with a single word input string; however I am trying to match against a undefined number of words (they are actually place names).
I did consider actually splitting each of the words from the string (removing any I define as noise words), then implementing some logic which would determine which place names within my data store, best matched (based on the keys from the algorithm I choose); the advantage I see in this, would be I could selectively tighten up, or loosen the match criteria to suit the application: however this does seem a little dirty to me.
So my question(s) are:
1: Am I approaching this problem in the right way, yes I understand it will be quite expensive; however (without going to deeply into the implementation) this information will be coming from a memcache database.
2: Are there any algorithms out there, that already specialise in phonetically matching multiple words? If so, could you please provide me with some information on them, and if possible their strengths and limitations.
You may want to look into a Locality-sensitive Hash such as the Nilsimsa Hash. I have used Nilsimsa to "hash" craigslists posts across various cities to search for duplicates (NOTE: I'm not a CL employee, just a personal project I was working on).
Most of these methods aren't as tunable as you may want (basically you can get some loosely-defined "edit distance" metric) and they're not phonetic, solely character based.

Parsing bulk text with Hadoop: best practices for generating keys

I have a 'large' set of line delimited full sentences that I'm processing with Hadoop. I've developed a mapper that applies some of my favorite NLP techniques to it. There are several different techniques that I'm mapping over the original set of sentences, and my goal during the reducing phase is to collect these results into groups such that all members in a group share the same original sentence.
I feel that using the entire sentence as a key is a bad idea. I felt that generating some hash value of the sentence may not work because of a limited number of keys (unjustified belief).
Can anyone recommend the best idea/practice for generating unique keys for each sentence? Ideally, I would like to preserve order. However, this isn't a main requirement.
Aντίο,
Standard hashing should work fine. Most hash algorithms have a value space far greater than the number of sentences you're likely to be working with, and thus the likelihood of a collision will still be extremely low.
Despite the answer that I've already given you about what a proper hash function might be, I would really suggest you just use the sentences themselves as the keys unless you have a specific reason why this is problematic.
Though you might want to avoid simple hash functions (for example, any half-baked idea that you could think up quickly) because they might not mix up the sentence data enough to avoid collisions in the first place, one of the standard cryptographic hash functions would probably be quite suitable, for example MD5, SHA-1, or SHA-256.
You can use MD5 for this, even though collisions have been found and the algorithm is considered unsafe for security intensive purposes. This isn't a security critical application, and the collisions that have been found arose through carefully constructed data and probably won't arise randomly in your own NLP sentence data. (See, for example Johannes Schindelin's explanation of why it's probably unnecessary to change git to use SHA-256 hashes, so that you can appreciate the reasoning behind this.)

How do you efficiently implement a document similarity search system?

How do you implement a "similar items" system for items described by a
set of tags?
In my database, I have three tables, Article, ArticleTag and Tag. Each
Article is related to a number of Tags via a many-to-many
relationship. For each Article i want to find the five most similar
articles to implement a "if you like this article you will like these
too" system.
I am familiar with Cosine similarity
and using that algorithm works very well. But it is way to slow. For
each article, I need to iterate over all articles, calculate the
cosine similarity for the article pair and then select the five
articles with the highest similarity rating.
With 200k articles and 30k tags, it takes me half a minute to
calculate the similar articles for a single article. So I need
another algorithm that produces roughly as good results as cosine
similarity but that can be run in realtime and which does not require
me to iterate over the whole document corpus each time.
Maybe someone can suggest an off-the-shelf solution for this? Most of
the search engines I looked at does not enable document similarity
searching.
Some questions,
How is ArticleTag different from Tag? Or is that the M2M mapping table?
Can you sketch out how you've implemented the cosine matching algorithm?
Why don't you store your document tags in an in memory data structure of some sort, using it only to retrieve document IDs? This way, you only hit the database during retrieval time.
Depending on the freq of document addition, this structure can be designed for fast/slow updates.
Initial intuition towards an answer - I'd say, an online clustering algorithm (perhaps do a Principal Components Analysis on the co-occurrence matrix, which will approximate a K-means cluster?). Better refined once you answer some of those questions above.
Cheers.
You can do it with the Lemur toolkit. With its KeyfileIncIndex, you have to re-retrieve the document from its source; the IndriIndex supports retrieving the document from the index.
But anyway, you index your documents, and then you build a query from the document you want to find similar documents to. You can then do a search with that query, and it will score the other documents for similarity. It's pretty fast in my experience. It treats both source documents and basic queries as documents, so finding similarities is really what it does (unless you're using the Indri parser stuff - that's a bit different, and I'm not sure how it works).

Resources