String matching algorithm : (multi token strings) - string

I have a dictionary which contains a big number of strings. Each string could have a range of 1 to 4 tokens (words). Example :
Dictionary :
The Shawshank Redemption
The Godfather \
Pulp Fiction
The Dark Knight
Fight Club
Now I have a paragraph and I need to figure out how many strings in the para are part of the dictionary.
Example, when the para below :
The Shawshank Redemption considered the greatest movie ever made according to the IMDB Top 250.For at least the year or two that I have occasionally been checking in on the IMDB Top 250 The Shawshank Redemption has been
battling The Godfather for the top spot.
is run against the dictionary, I should be getting the ones in bold as the ones that are part of the dictionary.
How can I do this with the least dictionary calls.
Thanks

You might be better off using a Trie. A Trie is better suited to finding partial matches (i.e. as you search through the text of a paragraph) that are potentially what you're looking for, as opposed to making a bunch of calls to a dictionary that will mostly fail.
The reason why I think a Trie (or some variation) is appropriate is because it's built to do exactly what you're trying to do:
If you use this (or some modification that has the tokenized words at each node instead of a letter), this would be the most efficient (at least that I know of) in terms of storage and retrieval; Storage because instead of storing the word "The" a couple thousand times in each Dict entry that has that word in the title (as is the case with movie titles), it would be stored once in one of the nodes right under the root. The next word, "Shawshank" would be in a child node, and then "redemption" would be in the next, with a total of 3 lookups; then you would move to the next phrase. If it fails, i.e. the phrase is only "The Shawshank Looper", then you fail after the same 3 lookups, and you move to the failed word, Looper (which as it happens, would also be a child node under the root, and you get a hit. This solution works assuming you're reading a paragraph without mashup movie names).
Using a hash table, you're going to have to split all the words, check the first word, and then while there's no match, keep appending words and checking if THAT phrase is in the dictionary, until you get a hit, or you reach the end of the paragraph. So if you hit a paragraph with no movie titles, you would have as many lookups as there are words in the paragraph.

This is not a complete answer, more like an extended-comment.
In literature it's called "multi-pattern matching problem". Since you mentioned that the set of patterns has millions of elements, Trie based solutions will most probably perform poorly.
As far as I know, in practice traditional string search is used with a lot of heuristics. DNA search, antivirus detection, etc. all of these fields need fast and reliable pattern matching, so there should be decent amount of research done.
I can imagine how Rabin-Karp with rolling-hash functions and some filters (Bloom filter) can be used in order to speed up the process. For example, instead of actually matching the substrings, you could first filter (e.g. with weak-hashes) and then actually verify, thus reducing number of verifications needed. Plus this should reduce the work done with the original dictionary itself, as you would store it's hashes, or other filters.

In Python:
import re
movies={1:'The Shawshank Redemption', 2:'The Godfather', 3:'Pretty Woman', 4:'Pulp Fiction'}
text = 'The Shawshank Redemption considered the greatest movie ever made according to the IMDB Top 250.For at least the year or two that I have occasionally been checking in on the IMDB Top 250 The Shawshank Redemption has been battling The Godfather for the top spot.'
repl_str ='(?P<title>' + '|'.join(['(?:%s)' %movie for movie in movies.values()]) + ')'
result = re.sub(repl_str, '<b>\g<title></b>',text)
Basically it consists of forming up a big substitution instruction string out of your dict values.
I don't know whether regex and sub have a limitation in the size of the substitution instructions you give them though. You might want to check.
lai

Related

Finding the most similar string among a set of millions of strings

Let's say I have a dictionary (word list) of millions upon millions of words. Given a query word, I want to find the word from that huge list that is most similar.
So let's say my query is elepant, then the result would most likely be elephant.
If my word is fentist, the result will probably be dentist.
Of course assuming both elephant and dentist are present in my initial word list.
What kind of index, data structure or algorithm can I use for this so that the query is fast? Hopefully complexity of O(log N).
What I have: The most naive thing to do is to create a "distance function" (which computes the "distance" between two words, in terms of how different they are) and then in O(n) compare the query with every word in the list, and return the one with the closest distance. But I wouldn't use this because it's slow.
The problem you're describing is a Nearest Neighbor Search (NNS). There are two main methods of solving NNS problems: exact and approximate.
If you need an exact solution, I would recommend a metric tree, such as the M-tree, the MVP-tree, and the BK-tree. These trees take advantage of the triangle inequality to speed up search.
If you're willing to accept an approximate solution, there are much faster algorithms. The current state of the art for approximate methods is Hierarchical Navigable Small World (hnsw). The Non-Metric Space Library (nmslib) provides an efficient implementation of hnsw as well as several other approximate NNS methods.
(You can compute the Levenshtein distance with Hirschberg's algorithm)
I made similar algorythm some time ago
Idea is to have an array char[255] with characters
and values is a list of words hashes (word ids) that contains this character
When you are searching 'dele....'
search(d) will return empty list
search(e) will find everything with character e, including elephant (two times, as it have two 'e')
search(l) will brings you new list, and you need to combine this list with results from previous step
...
at the end of input you will have a list
then you can try to do group by wordHash and order by desc by count
Also intresting thing, if your input is missing one or more characters, you will just receive empty list in the middle of the search and it will not affect this idea
My initial algorythm was without ordering, and i was storing for every character wordId and lineNumber and char position.
My main problem was that i want to search
with ee to find 'elephant'
with eleant to find 'elephant'
with antph to find 'elephant'
Every words was actually a line from file, so it's often was very long
And number of files and lines was big
I wanted quick search for directories with more than 1gb text files
So it was a problem even store them in memory, for this idea you need 3 parts
function to fill your cache
function to find by char from input
function to filter and maybe order results (i didn't use ordering, as i was trying to fill my cache in same order as i read the file, and i wanted to put lines that contains input in the same order upper )
I hope it make sense

Fuzzy search algorithm (approximate string matching algorithm)

I wish to create a fuzzy search algorithm.
However, upon hours of research I am really struggling.
I want to create an algorithm that performs a fuzzy search on a list of names of schools.
This is what I have looked at so far:
Most of my research keep pointing to "string metrics" on Google and Stackoverflow such as:
Levenshtein distance
Damerau-Levenshtein distance
Needleman–Wunsch algorithm
However this just gives a score of how similar 2 strings are. The only way I can think of implementing it as a search algorithm is to perform a linear search and executing the string metric algorithm for each string and returning the strings with scores above a certain threshold. (Originally I had my strings stored in a trie tree, but this obviously won't help me here!)
Although this is not such a bad idea for small lists, it would be problematic for lists with lets say a 100,000 names, and the user performed many queries.
Another algorithm I looked at is the Spell-checker method, where you just do a search for all potential misspellings. However this also is highly inefficient as it requires more than 75,000 words for a word of length 7 and error count of just 2.
What I need?
Can someone please suggest me a good efficient fuzzy search algorithm. with:
Name of the algorithm
How it works or a link to how it works
Pro's and cons and when it's best used (optional)
I understand that all algorithms will have their pros and cons and there is no best algorithm.
Considering that you're trying to do a fuzzy search on a list of school names, I don't think you want to go for traditional string similarity like Levenshtein distance. My assumption is that you're taking a user's input (either keyboard input or spoken over the phone), and you want to quickly find the matching school.
Distance metrics tell you how similar two strings are based on substitutions, deletions, and insertions. But those algorithms don't really tell you anything about how similar the strings are as words in a human language.
Consider, for example, the words "smith," "smythe," and "smote". I can go from "smythe" to "smith" in two steps:
smythe -> smithe -> smith
And from "smote" to "smith" in two steps:
smote -> smite -> smith
So the two have the same distance as strings, but as words, they're significantly different. If somebody told you (spoken language) that he was looking for "Symthe College," you'd almost certainly say, "Oh, I think you mean Smith." But if somebody said "Smote College," you wouldn't have any idea what he was talking about.
What you need is a phonetic algorithm like Soundex or Metaphone. Basically, those algorithms break a word down into phonemes and create a representation of how the word is pronounced in spoken language. You can then compare the result against a known list of words to find a match.
Such a system would be much faster than using a distance metric. Consider that with a distance metric, you need to compare the user's input with every word in your list to obtain the distance. That is computationally expensive and the results, as I demonstrated with "smith" and "smote" can be laughably bad.
Using a phonetic algorithm, you create the phoneme representation of each of your known words and place it in a dictionary (a hash map or possibly a trie). That's a one-time startup cost. Then, whenever the user inputs a search term, you create the phoneme representation of his input and look it up in your dictionary. That is a lot faster and produces much better results.
Consider also that when people misspell proper names, they almost always get the first letter right, and more often than not pronouncing the misspelling sounds like the actual word they were trying to spell. If that's the case, then the phonetic algorithms are definitely the way to go.
I wrote an article about how I implemented a fuzzy search:
https://medium.com/#Srekel/implementing-a-fuzzy-search-algorithm-for-the-debuginator-cacc349e6c55
The implementation is in Github and is in the public domain, so feel free to have a look.
https://github.com/Srekel/the-debuginator/blob/master/the_debuginator.h#L1856
The basics of it is: Split all strings you'll be searching for into parts. So if you have paths, then "C:\documents\lol.txt" is maybe "C", "documents", "lol", "txt".
Ensure you lowercase these strings to ensure that you it's case insensitive. (Maybe only do it if the search string is all-lowercase).
Then match your search string against this. In my case I want to match it regardless of order, so "loldoc" would still match the above path even though "lol" comes after "doc".
The matching needs to have some scoring to be good. The most important part I think is consecutive matching, so the more characters directly after one another that match, the better. So "doc" is better than "dcm".
Then you'll likely want to give extra score for a match that's at the start of a part. So you get more points for "doc" than "ocu".
In my case I also give more points for matching the end of a part.
And finally, you may want to consider giving extra points for matching the last part(s). This makes it so that matching the file name/ending scores higher than the folders leading up to it.
You're confusing fuzzy search algorithms with implementation: a fuzzy search of a word may return 400 results of all the words that have Levenshtein distance of, say, 2. But, to the user you have to display only the top 5-10.
Implementation-wise, you'll pre-process all the words in the dictionary and save the results into a DB. The popular words (and their fuzzy-likes) will be saved into cache-layer - so you won't have to hit the DB for every request.
You may add an AI layer that will add the most common spelling mistakes and add them to the DB. And etc.
A simple algorithm for "a kind of fuzzy search"
To be honest, in some cases, fuzzy search is mostly useless and I think that a simpler algorithm can improve the search result while providing the feeling that we are still performing a fuzzy search.
Here is my use case: Filtering down a list of countries using "Fuzzy search".
The list I was working with had two countries starting with Z: Zambia and Zimbabwe.
I was using Fusejs.
In this case, when entering the needle "zam", the result set was having 19 matches and the most relevant one for any human (Zambia) at the bottom of the list. And most of the other countries in the result did not even have the letter z in their name.
This was for a mobile app where you can pick a country from a list. It was supposed to be much like when you have to pick a contact from the phone's contacts. You can filter the contact list by entering some term in the search box.
IMHO, this kind of limited content to search from should not be treated in a way that will have people asking "what the heck?!?".
One might suggest to sort by most relevant match. But that's out of the question in this case because the user will then always have to visually find the "Item of Interest" in the reduced list. Keep in mind that this is supposed to be a filtering tool, not a search engine "à la Google". So the result should be sorted in a predictable way. And before filtering, the sorting was alphabetical. So the filtered list should just be an alphabetically sorted subset of the original list.
So I came up with the following algorithm ...
Grab the needle ... in this case: zam
Insert the .* pattern at the beginning and end of the needle
Insert the .* pattern between each letter of the needle
Perform a Regex search in the haystack using the new needle which is now .*z.*a.*m.*
In this case, the user will have a much expected result by finding everything that has somehow the letters z, a and m appearing in this order. All the letters in the needles will be present in the matches in the same order.
This will also match country names like Mozambique ... which is perfect.
I just think that sometimes, we should not try to kill a fly with a bazooka.
Fuzzy Sort is a javascript library is helpful to perform string matching from a large collection of data.
The following code will helpful to use fuzzy sort in react.js.
Install fuzzy sort through npm,
npm install fuzzysort
Full demo code in react.js
import React from 'react';
import './App.css';
import data from './testdata';
const fuzzysort = require('fuzzysort');
class App extends React.Component {
constructor(props){
super(props)
this.state = {
keyword: '',
results: [],
}
console.log("data: ", data["steam_games"]);
}
search(keyword, category) {
return fuzzysort.go(keyword, data[category]);
}
render(){
return (
<div className="App">
<input type="text" onChange={(e)=> this.setState({keyword: e.target.value})}
value={this.state.keyword}
/>
<button onClick={()=>this.setState({results: this.search(this.state.keyword, "steam_games")})}>Search</button>
{this.state.results !== null && this.state.results.length > 0 ?
<h3>Results:</h3> : null
}
<ul>
{this.state.results.map((item, index) =>{
return(
<li key={index}>{item.score} : {item.target}</li>
)
})
}
</ul>
</div>
);
}
}
export default App;
For more refer FuzzySort
The problem can be broken down into two parts:
1) Choosing the correct string metric.
2) Coming up with a fast implementation of the same.
Choosing the correct metric: This part is largely dependent on your use case. However, I would suggest using a combination of a distance-based score and a phonetic-based encoding for greater accuracy i.e. initially computing a score based on the Levenshtein distance and later using Metaphone or Double Metaphone to complement the results.
Again, you should base your decision on your use case. If you can do with using just the Metaphone or Double Metaphone algorithms, then you needn't worry much about the computational cost.
Implementation: One way to cap down the computational cost is to cluster your data into several small groups based on your use case and load them into a dictionary.
For example, If you can assume that your user enters the first letter of the name correctly, you can store the names based on this invariant in a dictionary.
So, if the user enters the name "National School" you need to compute the fuzzy matching score only for school names starting with the letter "N"

Algorithm (or pointer to literature) sought for string processing challenge

A group of amusing students write essays exclusively by plagiarising portions of the complete works of WIlliam Shakespere. At one end of the scale, an essay might exclusively consist a verbatim copy of a soliloquy... at the other, one might see work so novel that - while using a common alphabet - no two adjacent characters in the essay were used adjacently by Will.
Essays need to be graded. A score of 1 is assigned to any essay which can be found (character-by-character identical) in the plain-text of the complete works. A score of 2 is assigned to any work that can be successfully constructed from no fewer than two distinct (character-by-character identical) passages in the complete works, and so on... up to the limit - for an essay with N characters - which scores N if, and only if, no two adjacent characters in the essay were also placed adjacently in the complete works.
The challenge is to implement a program which can efficiently (and accurately) score essays. While any (practicable) data-structure to represent the complete works is acceptable - the essays are presented as ASCII strings.
Having considered this teasing question for a while, I came to the conclusion that it is much harder than it sounds. The naive solution, for an essay of length N, involves 2**(N-1) traversals of the complete works - which is far too inefficient to be practical.
While, obviously, I'm interested in suggested solutions - I'd also appreciate pointers to any literature that deals with this, or any similar, problem.
CLARIFICATIONS
Perhaps some examples (ranging over much shorter strings) will help clarify the 'score' for 'essays'?
Assume Shakespere's complete works are abridged to:
"The quick brown fox jumps over the lazy dog."
Essays scoring 1 include "own fox jump" and "The quick brow". The essay "jogging" scores 6 (despite being short) because it can't be represented in fewer than 6 segments of the complete works... It can be segmented into six strings that are all substrings of the complete works as follows: "[j][og][g][i][n][g]". N.B. Establishing scores for this short example is trivial compared to the original problem - because, in this example "complete works" - there is very little repetition.
Hopefully, this example segmentation helps clarify the 2*(N-1) substring searches in the complete works. If we consider the segmentation, the (N-1) gaps between the N characters in the essay may either be a gap between segments, or not... resulting in ~ 2*(N-1) substring searches of the complete works to test each segmentation hypothesis.
An (N)DFA would be a wonderful solution - if it were practical. I can see how to construct something that solved 'substring matching' in this way - but not scoring. The state space for scoring, on the surface, at least, seems wildly too large (for any substantial complete works of Shakespere.) I'd welcome any explanation that undermines my assumptions that the (N)DFA would be too large to be practical to compute/store.
A general approach for plagiarism detection is to append the student's text to the source text separated by a character not occurring in either and then to build either a suffix tree or suffix array. This will allow you to find in linear time large substrings of the student's text which also appear in the source text.
I find it difficult to be more specific because I do not understand your explanation of the score - the method above would be good for finding the longest stretch in the students work which is an exact quote, but I don't understand your N - is it the number of distinct sections of source text needed to construct the student's text?
If so, there may be a dynamic programming approach. At step k, we work out the least number of distinct sections of source text needed to construct first k characters of the student's text. Using a suffix array built just from the source text or otherwise, we find the longest match between the source text and characters x..k of the student's text, where x is of course as small as possible. Then the least number of sections of source text needed to construct the first k characters of student text is the least needed to construct 1..x-1 (which we have already worked out) plus 1. By running this process for k=1..the length of the student text we find the least number of sections of source text needed to reconstruct the whole of it.
(Or you could just search StackOverflow for the student's text, on the grounds that students never do anything these days except post their question on StackOverflow :-)).
I claim that repeatedly moving along the target string from left to right, using a suffix array or tree to find the longest match at any time, will find the smallest number of different strings from the source text that produces the target string. I originally found this by looking for a dynamic programming recursion but, as pointed out by Evgeny Kluev, this is actually a greedy algorithm, so let's try and prove this with a typical greedy algorithm proof.
Suppose not. Then there is a solution better than the one you get by going for the longest match every time you run off the end of the current match. Compare the two proposed solutions from left to right and look for the first time when the non-greedy solution differs from the greedy solution. If there are multiple non-greedy solutions that do better than the greedy solution I am going to demand that we consider the one that differs from the greedy solution at the last possible instant.
If the non-greedy solution is going to do better than the greedy solution, and there isn't a non-greedy solution that does better and differs later, then the non-greedy solution must find that, in return for breaking off its first match earlier than the greedy solution, it can carry on its next match for longer than the greedy solution. If it can't, it might somehow do better than the greedy solution, but not in this section, which means there is a better non-greedy solution which sticks with the greedy solution until the end of our non-greedy solution's second matching section, which is against our requirement that we want the non-greedy better solution that sticks with the greedy one as long as possible. So we have to assume that, in return for breaking off the first match early, the non-greedy solution gets to carry on its second match longer. But this doesn't work, because, when the greedy solution finally has to finish using its first match, it can jump on to the same section of matching text that the non-greedy solution is using, just entering that section later than the non-greedy solution did, but carrying on for at least as long as the non-greedy solution. So there is no non-greedy solution that does better than the greedy solution and the greedy solution is optimal.
Have you considered using N-Grams to solve this problem?
http://en.wikipedia.org/wiki/N-gram
First read the complete works of Shakespeare and build a trie. Then process the string left to right. We can greedily take the longest substring that matches one in the data because we want the minimum number of strings, so there is no factor of 2^N. The second part is dirt cheap O(N).
The depth of the trie is limited by the available space. With a gigabyte of ram you could reasonably expect to exhaustively cover Shakespearean English string of length at least 5 or 6. I would require that the leaf nodes are unique (which also gives a rule for constructing the trie) and keep a pointer to their place in the actual works, so you have access to the continuation.
This feels like a problem of partial matching a very large regular expression.
If so it can be solved by a very large non deterministic finite state automata or maybe more broadly put as a graph representing for every character in the works of Shakespeare, all the possible next characters.
If necessary for efficiency reasons the NDFA is guaranteed to be convertible to a DFA. But then this construction can give rise to 2^n states, maybe this is what you were alluding to?
This aspect of the complexity does not really worry me. The NDFA will have M + C states; one state for each character and C states where C = 26*2 + #punctuation to connect to each of the M states to allow the algorithm to (re)start when there are 0 matched characters. The question is would the corresponding DFA have O(2^M) states and if so is it necessary to make that DFA, theoretically it's not necessary. However, consider that in the construction, each state will have one and only one transition to exactly one other state (the next state corresponding to the next character in that work). We would expect that each one of the start states will be connected to on average M/C states, but in the worst case M meaning the NDFA will have to track at most M simultaneous states. That's a large number but not an impossibly large number for computers these days.
The score would be derived by initializing to 1 and then it would incremented every time a non-accepting state is reached.
It's true that one of the approaches to string searching is building a DFA. In fact, for the majority of the string search algorithms, it looks like a small modification on failure to match (increment counter) and success (keep going) can serve as a general strategy.

Algorithm for string processing

I am looking for a algorithm for string processing, I have searched for it but couldn't find a algorithm that meets my requirements. I will explain what the algorithm should do with an example.
There are two sets of word sets defined as shown below:
**Main_Words**: swimming, driving, playing
**Words_in_front**: I am, I enjoy, I love, I am going to go
The program will search through a huge set of words as soon it finds a word that is defined in Main_Words it will check the words in front of that Word to see if it has any matching words defined in Words_in_front.
i.e If the program encounters the word "Swimming" it has to check if the words in front of the word "Swimming" are one of these: I am, I enjoy, I love, I am going to go.
Are there any algorithms that can do this?
A straightforward way to do this would be to just do a linear scan through the text, always keeping track of the last N+1 words (or characters) you see, where N is the number of words (or characters) in the longest phrase contained in your words_in_front collection. When you have a "main word", you can just check whether the sequence of N words/characters before it ends with any of the prefixes you have.
This would be a bit faster if you transformed your words_in_front set into a nicer data structure, such as a hashmap (perhaps keyed by last letter in the phrase..) or a prefix/suffix tree of some sort, so you wouldn't have to do an .endsWith over every single member of the set of prefixes each time you have a matching "main word." As was stated in another answer, there is much room for optimization and a few other possible implementations, but there's a start.
Create a map/dictionary/hash/associative array (whatever is defined in your language) with key in Main_Words and Words_in_front are the linked list attached to the entry pointed by the key. Whenever you encounter a word matching a key, go to the table and see if in the attached list there are words that match what you have in front.
That's the basic idea, it can be optimized for both speed and space.
You should be able to build a regular expression along these lines:
I (am|enjoy|love|am going to go) (swimming|driving|playing)

How do I determine if a random string sounds like English?

I have an algorithm that generates strings based on a list of input words. How do I separate only the strings that sounds like English words? ie. discard RDLO while keeping LORD.
EDIT: To clarify, they do not need to be actual words in the dictionary. They just need to sound like English. For example KEAL would be accepted.
You can build a markov-chain of a huge english text.
Afterwards you can feed words into the markov chain and check how high the probability is that the word is english.
See here: http://en.wikipedia.org/wiki/Markov_chain
At the bottom of the page you can see the markov text generator. What you want is exactly the reverse of it.
In a nutshell: The markov-chain stores for each character the probabilities of which next character will follow. You can extend this idea to two or three characters if you have enough memory.
The easy way with Bayesian filters (Python example from http://sebsauvage.net/python/snyppets/#bayesian)
from reverend.thomas import Bayes
guesser = Bayes()
guesser.train('french','La souris est rentrée dans son trou.')
guesser.train('english','my tailor is rich.')
guesser.train('french','Je ne sais pas si je viendrai demain.')
guesser.train('english','I do not plan to update my website soon.')
>>> print guesser.guess('Jumping out of cliffs it not a good idea.')
[('english', 0.99990000000000001), ('french', 9.9999999999988987e-005)]
>>> print guesser.guess('Demain il fera très probablement chaud.')
[('french', 0.99990000000000001), ('english', 9.9999999999988987e-005)]
You could approach this by tokenizing a candidate string into bigrams—pairs of adjascent letters—and checking each bigram against a table of English bigram frequencies.
Simple: if any bigram is sufficiently low on the frequency table (or outright absent), reject the string as implausible. (String contains a "QZ" bigram? Reject!)
Less simple: calculate the overall plausibility of the whole string in terms of, say, a product of the frequencies of each bigram divided by the mean frequency of a valid English string of that length. This would allow you to both (a) accept a string with an odd low-frequency bigram among otherwise high-frequency bigrams, and (b) reject a string with several individual low-but-not-quite-below-the-threshold bigrams.
Either of those would require some tuning of the threshold(s), the second technique more so than the first.
Doing the same thing with trigrams would likely be more robust, though it'll also likely lead to a somewhat more strict set of "valid" strings. Whether that's a win or not depends on your application.
Bigram and trigram tables based on existing research corpora may be available for free or purchase (I didn't find any freely available but only did a cursory google so far), but you can calculate a bigram or trigram table from yourself from any good-sized corpus of English text. Just crank through each word as a token and tally up each bigram—you might handle this as a hash with a given bigram as the key and an incremented integer counter as the value.
English morphology and English phonetics are (famously!) less than isometric, so this technique might well generate strings that "look" English but present troublesome prounciations. This is another argument for trigrams rather than bigrams—the weirdness produced by analysis of sounds that use several letters in sequence to produce a given phoneme will be reduced if the n-gram spans the whole sound. (Think "plough" or "tsunami", for example.)
It's quite easy to generate English sounding words using a Markov chain. Going backwards is more of a challenge, however. What's the acceptable margin of error for the results? You could always have a list of common letter pairs, triples, etc, and grade them based on that.
You should research "pronounceable" password generators, since they're trying to accomplish the same task.
A Perl solution would be Crypt::PassGen, which you can train with a dictionary (so you could train it to various languages if you need to). It walks through the dictionary and collects statistics on 1, 2, and 3-letter sequences, then builds new "words" based on relative frequencies.
I'd be tempted to run the soundex algorithm over a dictionary of English words and cache the results, then soundex your candidate string and match against the cache.
Depending on performance requirements, you could work out a distance algorithm for soundex codes and accept strings within a certain tolerance.
Soundex is very easy to implement - see Wikipedia for a description of the algorithm.
An example implementation of what you want to do would be:
def soundex(name, len=4):
digits = '01230120022455012623010202'
sndx = ''
fc = ''
for c in name.upper():
if c.isalpha():
if not fc: fc = c
d = digits[ord(c)-ord('A')]
if not sndx or (d != sndx[-1]):
sndx += d
sndx = fc + sndx[1:]
sndx = sndx.replace('0','')
return (sndx + (len * '0'))[:len]
real_words = load_english_dictionary()
soundex_cache = [ soundex(word) for word in real_words ]
if soundex(candidate) in soundex_cache:
print "keep"
else:
print "discard"
Obviously you'll need to provide an implementation of read_english_dictionary.
EDIT: Your example of "KEAL" will be fine, since it has the same soundex code (K400) as "KEEL". You may need to log rejected words and manually verify them if you want to get an idea of failure rate.
Metaphone and Double Metaphone are similar to SOUNDEX, except they may be tuned more toward your goal than SOUNDEX. They're designed to "hash" words based on their phonetic "sound", and are good at doing this for the English language (but not so much other languages and proper names).
One thing to keep in mind with all three algorithms is that they're extremely sensitive to the first letter of your word. For example, if you're trying to figure out if KEAL is English-sounding, you won't find a match to REAL because the initial letters are different.
Do they have to be real English words, or just strings that look like they could be English words?
If they just need to look like possible English words you could do some statistical analysis on some real English texts and work out which combinations of letters occur frequently. Once you've done that you can throw out strings that are too improbable, although some of them may be real words.
Or you could just use a dictionary and reject words that aren't in it (with some allowances for plurals and other variations).
You could compare them to a dictionary (freely available on the internet), but that may be costly in terms of CPU usage. Other than that, I don't know of any other programmatic way to do it.
That sounds like quite an involved task! Off the top of my head, a consonant phoneme needs a vowel either before or after it. Determining what a phoneme is will be quite hard though! You'll probably need to manually write out a list of them. For example, "TR" is ok but not "TD", etc.
I would probably evaluate each word using a SOUNDEX algorithm against a database of english words. If you're doing this on a SQL-server it should be pretty easy to setup a database containing a list of most english words (using a freely available dictionary), and MSSQL server has SOUNDEX implemented as an available search-algorithm.
Obviously you can implement this yourself if you want, in any language - but it might be quite a task.
This way you'd get an evaluation of how much each word sounds like an existing english word, if any, and you could setup some limits for how low you'd want to accept results. You'd probably want to consider how to combine results for multiple words, and you would probably tweak the acceptance-limits based on testing.
I'd suggest looking at the phi test and index of coincidence. http://www.threaded.com/cryptography2.htm
I'd suggest a few simple rules and standard pairs and triplets would be good.
For example, english sounding words tend to follow the pattern of vowel-consonant-vowel, apart from some dipthongs and standard consonant pairs (e.g. th, ie and ei, oo, tr). With a system like that you should strip out almost all words that don't sound like they could be english. You'd find on closer inspection that you will probably strip out a lot of words that do sound like english as well, but you can then start adding rules that allow for a wider range of words and 'train' your algorithm manually.
You won't remove all false negatives (e.g. I don't think you could manage to come up with a rule to include 'rythm' without explicitly coding in that rythm is a word) but it will provide a method of filtering.
I'm also assuming that you want strings that could be english words (they sound reasonable when pronounced) rather than strings that are definitely words with an english meaning.

Resources