inexact string search - short query string to huge database (blast?) - search

I have an OCR that recognises a few short query strings (4-12 letters) in a given picture. And I would like to match these recognised words against a big database of known words. I've already build a confusion matrix with the used alphabet from the most common mistakes and I tried to do a whole gotoh alignment against all words in my database and found (not surprinsingly) that this is too time consuming.
So I am looking for a heuristic approach to match these words to the database (allowing mismatches). Does anyone know of an available library or algorithm that could help me out?
I've already thought about using BLAST or FASTA but the way I understood it both are limited to the standard amino acid alphabet and I would like to use all letters and numbers.
Thank you for your help!

Not an expert but I've done some reading on bioinformatics (which aren't the topic but related). You could use suffix trees or related data structures to more-quickly search through the database. I believe currently the time required for construction of the tree is linear wrt database length, and time required for querying the tree is linear wrt length of query string, so if you have a lot of query strings that are relatively short this sounds like the perfect data structure for you. More reading can be found on the wikipedia page for suffix trees.

Related

Searching substrings from a large set of strings

Is there a space efficient data structure that can help answer the following question:
Assume I have a database of a large number of strings (in the
millions). I need to be able to answer quickly if a given string is
a substring of one these strings in the database.
Note that it's not even necessary in this case to tell which string it is a substring of, just that it's a substring of one.
As clarification, the ideal is to keep the data as small as possible, but query speed is really the most important issue. The minimum requirement is being able to hold the query data structure in RAM.
The right way to go about this is to avoid using your Java application to answer the question. If you solve the problem in Java, your app is guaranteed to read the entire table, and this is in addition to logic you will have to run on each record.
A better strategy would be to use your database to answer the question. Consider the following SQL query (assuming your database is some SQL flavor):
SELECT COUNT(*) FROM your_table WHERE column LIKE "%substring%"
This query will return the number of rows where 'column' contains some 'substring'. You can issue a JDBC call from your Java application. As a general rule, you should leave the heavy database lifting to your RDBMS; it was created for that.
I am giving a hat tip to this SO post which was the basis for my response: http://www.stackoverflow.com/questions/4122193/how-to-search-for-rows-containing-a-substring
Strings are highly compact structures, so for regular English text it is unlikely that you will find any other kind of structure that would be more space efficient than strings. You can perform various tricks with bits so as to make each character occupy less space in memory, (at the expense of supporting other languages,) but the savings there will be linear.
However, if your strings have a very low degree of variation, (very high level of repetition,) then you might be able to save space by constructing a tree in which each node corresponds to a letter. Each path of nodes in the tree then forms a possible word, as follows:
[c]-+-[a]-+-[t]
+
+-[r]
So, the above tree encodes the following words: cat, car. Of course this will only result in savings if you have a huge number of mostly similar strings.

Fuzzy String Matching

I have a requirement within my application to fuzzy match a string value inputted by the user, against a datastore.
I am basically attempting to find possible duplicates in the process in which data is added to the system.
I have looked at Metaphone, Double Metaphone, and SoundEx, and the conclusion I have came to is they are all well and good when dealing with a single word input string; however I am trying to match against a undefined number of words (they are actually place names).
I did consider actually splitting each of the words from the string (removing any I define as noise words), then implementing some logic which would determine which place names within my data store, best matched (based on the keys from the algorithm I choose); the advantage I see in this, would be I could selectively tighten up, or loosen the match criteria to suit the application: however this does seem a little dirty to me.
So my question(s) are:
1: Am I approaching this problem in the right way, yes I understand it will be quite expensive; however (without going to deeply into the implementation) this information will be coming from a memcache database.
2: Are there any algorithms out there, that already specialise in phonetically matching multiple words? If so, could you please provide me with some information on them, and if possible their strengths and limitations.
You may want to look into a Locality-sensitive Hash such as the Nilsimsa Hash. I have used Nilsimsa to "hash" craigslists posts across various cities to search for duplicates (NOTE: I'm not a CL employee, just a personal project I was working on).
Most of these methods aren't as tunable as you may want (basically you can get some loosely-defined "edit distance" metric) and they're not phonetic, solely character based.

Finding possibly matching strings in a large dataset

I'm in the middle of a project where I have to process text documents and enhance them with Wikipedia links. Preprocessing a document includes locating all the possible target articles, so I extract all ngrams and do a comparison against a database containing all the article names. The current algorithm is a simple caseless string comparison preceded by simple trimming. However, I'd like it to be more flexible and tolerant to errors or little text modifications like prefixes etc. Besides, the database is pretty huge and I have a feeling that string comparison in such a large database is not the best idea...
What I thought of is a hashing function, which would assign a unique (I'd rather avoid collisions) hash to any article or ngram so that I could compare hashes instead of the strings. The difference between two hashes would let me know if the words are similiar so that I could gather all the possible target articles.
Theoretically, I could use cosine similiarity to calculate the similiarity between words, but this doesn't seem right for me because comparing the characters multiple times sounds like a performance issue to me.
Is there any recommended way to do it? Is it a good idea at all? Maybe the string comparison with proper indexing isn't that bad and the hashing won't help me here?
I looked around the hashing functions, text similarity algoriths, but I haven't found a solution yet...
Consider using the Apache Lucene API It provides functionality for searching, stemming, tokenization, indexing, document similarity scoring. Its an open source implementation of basic best practices in Information Retrieval
The functionality that seems most useful to you from Lucene is their moreLikeThis algorithm which uses Latent Semantic Analysis to locate similar documents.

Searching for strings withing a string

I have a large number of strings (potentially 1,000,000+), and I want to search another string (a document) to see which of these search strings appears in the document.
Not all of the search strings are a single word, so it's not just a case of searching for each word in the document in the list of search strings.
What's the most efficient way of doing this?
I will be doing this for a large number of documents (coming from a feed), and need to do it fast enough that I can process the documents quicker than they're coming in (a second or two at most ideally).
I can potentially come up with a list of stop words that won't appear in the search strings (e.g. 'the', 'and').
Ideally the solution will be in Java, but that's not a requirement as I can always port the code into Java. If it makes any difference, the search strings are currently stored in a MongoDB.
Take a look at Radix trees and Suffix trees.
There is an example on the concurrent-trees project, of how to scan unseen documents efficiently for large numbers of keywords stored in the inverted radix tree in that project. Example code here.
Check out High-perfor­mance pattern matching algo­rithms Java

Quick Filter List

Everyone is familiar with this functionality. If you open up the the outlook address book and start typing a name, the list below the searchbox instantly filters to only contain items that match your query. .NET Reflector has a similar feature when you're browsing types ... you start typing, and regardless of how large the underlying assembly is that you're browsing, it's near instantaneous.
I've always kind of wondered what the secret sauce was here. How is it so fast? I imagine there are also different algorithms if the data is present in-memory, or if they need to be fetched from some external source (ie. DB, searching some file, etc.).
I'm not sure if this would be relevant, but if there are resources out there, I'm particularly interested how one might do this with WinForms ... but if you know of general resources, I'm interested in those as well :-)
What is the most common use of the trie data structure?
A Trie is basically a tree-structure for storing a large list of similar strings, which provides fast lookup of strings (like a hashtable) and allows you to iterate over them in alphabetical order.
Image from: http://en.wikipedia.org/wiki/Trie:
In this case, the Trie stores the strings:
i
in
inn
to
tea
ten
For any prefix that you enter (for example, 't', or 'te'), you can easily lookup all of the words that start with that prefix. More importantly, lookups are dependent on the length of the string, not on how many strings are stored in the Trie. Read the wikipedia article I referenced to learn more.
The process is called full text indexing/search.
If you want to play with the algorithms and data structures for this I would recommend you read Programming Collective Intelligence for a good introduction to the field, if you just want the functionality I would recommend lucene.

Resources