How do I determine if a random string sounds like English? - string

I have an algorithm that generates strings based on a list of input words. How do I separate only the strings that sounds like English words? ie. discard RDLO while keeping LORD.
EDIT: To clarify, they do not need to be actual words in the dictionary. They just need to sound like English. For example KEAL would be accepted.

You can build a markov-chain of a huge english text.
Afterwards you can feed words into the markov chain and check how high the probability is that the word is english.
See here: http://en.wikipedia.org/wiki/Markov_chain
At the bottom of the page you can see the markov text generator. What you want is exactly the reverse of it.
In a nutshell: The markov-chain stores for each character the probabilities of which next character will follow. You can extend this idea to two or three characters if you have enough memory.

The easy way with Bayesian filters (Python example from http://sebsauvage.net/python/snyppets/#bayesian)
from reverend.thomas import Bayes
guesser = Bayes()
guesser.train('french','La souris est rentrée dans son trou.')
guesser.train('english','my tailor is rich.')
guesser.train('french','Je ne sais pas si je viendrai demain.')
guesser.train('english','I do not plan to update my website soon.')
>>> print guesser.guess('Jumping out of cliffs it not a good idea.')
[('english', 0.99990000000000001), ('french', 9.9999999999988987e-005)]
>>> print guesser.guess('Demain il fera très probablement chaud.')
[('french', 0.99990000000000001), ('english', 9.9999999999988987e-005)]

You could approach this by tokenizing a candidate string into bigrams—pairs of adjascent letters—and checking each bigram against a table of English bigram frequencies.
Simple: if any bigram is sufficiently low on the frequency table (or outright absent), reject the string as implausible. (String contains a "QZ" bigram? Reject!)
Less simple: calculate the overall plausibility of the whole string in terms of, say, a product of the frequencies of each bigram divided by the mean frequency of a valid English string of that length. This would allow you to both (a) accept a string with an odd low-frequency bigram among otherwise high-frequency bigrams, and (b) reject a string with several individual low-but-not-quite-below-the-threshold bigrams.
Either of those would require some tuning of the threshold(s), the second technique more so than the first.
Doing the same thing with trigrams would likely be more robust, though it'll also likely lead to a somewhat more strict set of "valid" strings. Whether that's a win or not depends on your application.
Bigram and trigram tables based on existing research corpora may be available for free or purchase (I didn't find any freely available but only did a cursory google so far), but you can calculate a bigram or trigram table from yourself from any good-sized corpus of English text. Just crank through each word as a token and tally up each bigram—you might handle this as a hash with a given bigram as the key and an incremented integer counter as the value.
English morphology and English phonetics are (famously!) less than isometric, so this technique might well generate strings that "look" English but present troublesome prounciations. This is another argument for trigrams rather than bigrams—the weirdness produced by analysis of sounds that use several letters in sequence to produce a given phoneme will be reduced if the n-gram spans the whole sound. (Think "plough" or "tsunami", for example.)

It's quite easy to generate English sounding words using a Markov chain. Going backwards is more of a challenge, however. What's the acceptable margin of error for the results? You could always have a list of common letter pairs, triples, etc, and grade them based on that.

You should research "pronounceable" password generators, since they're trying to accomplish the same task.
A Perl solution would be Crypt::PassGen, which you can train with a dictionary (so you could train it to various languages if you need to). It walks through the dictionary and collects statistics on 1, 2, and 3-letter sequences, then builds new "words" based on relative frequencies.

I'd be tempted to run the soundex algorithm over a dictionary of English words and cache the results, then soundex your candidate string and match against the cache.
Depending on performance requirements, you could work out a distance algorithm for soundex codes and accept strings within a certain tolerance.
Soundex is very easy to implement - see Wikipedia for a description of the algorithm.
An example implementation of what you want to do would be:
def soundex(name, len=4):
digits = '01230120022455012623010202'
sndx = ''
fc = ''
for c in name.upper():
if c.isalpha():
if not fc: fc = c
d = digits[ord(c)-ord('A')]
if not sndx or (d != sndx[-1]):
sndx += d
sndx = fc + sndx[1:]
sndx = sndx.replace('0','')
return (sndx + (len * '0'))[:len]
real_words = load_english_dictionary()
soundex_cache = [ soundex(word) for word in real_words ]
if soundex(candidate) in soundex_cache:
print "keep"
else:
print "discard"
Obviously you'll need to provide an implementation of read_english_dictionary.
EDIT: Your example of "KEAL" will be fine, since it has the same soundex code (K400) as "KEEL". You may need to log rejected words and manually verify them if you want to get an idea of failure rate.

Metaphone and Double Metaphone are similar to SOUNDEX, except they may be tuned more toward your goal than SOUNDEX. They're designed to "hash" words based on their phonetic "sound", and are good at doing this for the English language (but not so much other languages and proper names).
One thing to keep in mind with all three algorithms is that they're extremely sensitive to the first letter of your word. For example, if you're trying to figure out if KEAL is English-sounding, you won't find a match to REAL because the initial letters are different.

Do they have to be real English words, or just strings that look like they could be English words?
If they just need to look like possible English words you could do some statistical analysis on some real English texts and work out which combinations of letters occur frequently. Once you've done that you can throw out strings that are too improbable, although some of them may be real words.
Or you could just use a dictionary and reject words that aren't in it (with some allowances for plurals and other variations).

You could compare them to a dictionary (freely available on the internet), but that may be costly in terms of CPU usage. Other than that, I don't know of any other programmatic way to do it.

That sounds like quite an involved task! Off the top of my head, a consonant phoneme needs a vowel either before or after it. Determining what a phoneme is will be quite hard though! You'll probably need to manually write out a list of them. For example, "TR" is ok but not "TD", etc.

I would probably evaluate each word using a SOUNDEX algorithm against a database of english words. If you're doing this on a SQL-server it should be pretty easy to setup a database containing a list of most english words (using a freely available dictionary), and MSSQL server has SOUNDEX implemented as an available search-algorithm.
Obviously you can implement this yourself if you want, in any language - but it might be quite a task.
This way you'd get an evaluation of how much each word sounds like an existing english word, if any, and you could setup some limits for how low you'd want to accept results. You'd probably want to consider how to combine results for multiple words, and you would probably tweak the acceptance-limits based on testing.

I'd suggest looking at the phi test and index of coincidence. http://www.threaded.com/cryptography2.htm

I'd suggest a few simple rules and standard pairs and triplets would be good.
For example, english sounding words tend to follow the pattern of vowel-consonant-vowel, apart from some dipthongs and standard consonant pairs (e.g. th, ie and ei, oo, tr). With a system like that you should strip out almost all words that don't sound like they could be english. You'd find on closer inspection that you will probably strip out a lot of words that do sound like english as well, but you can then start adding rules that allow for a wider range of words and 'train' your algorithm manually.
You won't remove all false negatives (e.g. I don't think you could manage to come up with a rule to include 'rythm' without explicitly coding in that rythm is a word) but it will provide a method of filtering.
I'm also assuming that you want strings that could be english words (they sound reasonable when pronounced) rather than strings that are definitely words with an english meaning.

Related

Fuzzy search algorithm (approximate string matching algorithm)

I wish to create a fuzzy search algorithm.
However, upon hours of research I am really struggling.
I want to create an algorithm that performs a fuzzy search on a list of names of schools.
This is what I have looked at so far:
Most of my research keep pointing to "string metrics" on Google and Stackoverflow such as:
Levenshtein distance
Damerau-Levenshtein distance
Needleman–Wunsch algorithm
However this just gives a score of how similar 2 strings are. The only way I can think of implementing it as a search algorithm is to perform a linear search and executing the string metric algorithm for each string and returning the strings with scores above a certain threshold. (Originally I had my strings stored in a trie tree, but this obviously won't help me here!)
Although this is not such a bad idea for small lists, it would be problematic for lists with lets say a 100,000 names, and the user performed many queries.
Another algorithm I looked at is the Spell-checker method, where you just do a search for all potential misspellings. However this also is highly inefficient as it requires more than 75,000 words for a word of length 7 and error count of just 2.
What I need?
Can someone please suggest me a good efficient fuzzy search algorithm. with:
Name of the algorithm
How it works or a link to how it works
Pro's and cons and when it's best used (optional)
I understand that all algorithms will have their pros and cons and there is no best algorithm.
Considering that you're trying to do a fuzzy search on a list of school names, I don't think you want to go for traditional string similarity like Levenshtein distance. My assumption is that you're taking a user's input (either keyboard input or spoken over the phone), and you want to quickly find the matching school.
Distance metrics tell you how similar two strings are based on substitutions, deletions, and insertions. But those algorithms don't really tell you anything about how similar the strings are as words in a human language.
Consider, for example, the words "smith," "smythe," and "smote". I can go from "smythe" to "smith" in two steps:
smythe -> smithe -> smith
And from "smote" to "smith" in two steps:
smote -> smite -> smith
So the two have the same distance as strings, but as words, they're significantly different. If somebody told you (spoken language) that he was looking for "Symthe College," you'd almost certainly say, "Oh, I think you mean Smith." But if somebody said "Smote College," you wouldn't have any idea what he was talking about.
What you need is a phonetic algorithm like Soundex or Metaphone. Basically, those algorithms break a word down into phonemes and create a representation of how the word is pronounced in spoken language. You can then compare the result against a known list of words to find a match.
Such a system would be much faster than using a distance metric. Consider that with a distance metric, you need to compare the user's input with every word in your list to obtain the distance. That is computationally expensive and the results, as I demonstrated with "smith" and "smote" can be laughably bad.
Using a phonetic algorithm, you create the phoneme representation of each of your known words and place it in a dictionary (a hash map or possibly a trie). That's a one-time startup cost. Then, whenever the user inputs a search term, you create the phoneme representation of his input and look it up in your dictionary. That is a lot faster and produces much better results.
Consider also that when people misspell proper names, they almost always get the first letter right, and more often than not pronouncing the misspelling sounds like the actual word they were trying to spell. If that's the case, then the phonetic algorithms are definitely the way to go.
I wrote an article about how I implemented a fuzzy search:
https://medium.com/#Srekel/implementing-a-fuzzy-search-algorithm-for-the-debuginator-cacc349e6c55
The implementation is in Github and is in the public domain, so feel free to have a look.
https://github.com/Srekel/the-debuginator/blob/master/the_debuginator.h#L1856
The basics of it is: Split all strings you'll be searching for into parts. So if you have paths, then "C:\documents\lol.txt" is maybe "C", "documents", "lol", "txt".
Ensure you lowercase these strings to ensure that you it's case insensitive. (Maybe only do it if the search string is all-lowercase).
Then match your search string against this. In my case I want to match it regardless of order, so "loldoc" would still match the above path even though "lol" comes after "doc".
The matching needs to have some scoring to be good. The most important part I think is consecutive matching, so the more characters directly after one another that match, the better. So "doc" is better than "dcm".
Then you'll likely want to give extra score for a match that's at the start of a part. So you get more points for "doc" than "ocu".
In my case I also give more points for matching the end of a part.
And finally, you may want to consider giving extra points for matching the last part(s). This makes it so that matching the file name/ending scores higher than the folders leading up to it.
You're confusing fuzzy search algorithms with implementation: a fuzzy search of a word may return 400 results of all the words that have Levenshtein distance of, say, 2. But, to the user you have to display only the top 5-10.
Implementation-wise, you'll pre-process all the words in the dictionary and save the results into a DB. The popular words (and their fuzzy-likes) will be saved into cache-layer - so you won't have to hit the DB for every request.
You may add an AI layer that will add the most common spelling mistakes and add them to the DB. And etc.
A simple algorithm for "a kind of fuzzy search"
To be honest, in some cases, fuzzy search is mostly useless and I think that a simpler algorithm can improve the search result while providing the feeling that we are still performing a fuzzy search.
Here is my use case: Filtering down a list of countries using "Fuzzy search".
The list I was working with had two countries starting with Z: Zambia and Zimbabwe.
I was using Fusejs.
In this case, when entering the needle "zam", the result set was having 19 matches and the most relevant one for any human (Zambia) at the bottom of the list. And most of the other countries in the result did not even have the letter z in their name.
This was for a mobile app where you can pick a country from a list. It was supposed to be much like when you have to pick a contact from the phone's contacts. You can filter the contact list by entering some term in the search box.
IMHO, this kind of limited content to search from should not be treated in a way that will have people asking "what the heck?!?".
One might suggest to sort by most relevant match. But that's out of the question in this case because the user will then always have to visually find the "Item of Interest" in the reduced list. Keep in mind that this is supposed to be a filtering tool, not a search engine "à la Google". So the result should be sorted in a predictable way. And before filtering, the sorting was alphabetical. So the filtered list should just be an alphabetically sorted subset of the original list.
So I came up with the following algorithm ...
Grab the needle ... in this case: zam
Insert the .* pattern at the beginning and end of the needle
Insert the .* pattern between each letter of the needle
Perform a Regex search in the haystack using the new needle which is now .*z.*a.*m.*
In this case, the user will have a much expected result by finding everything that has somehow the letters z, a and m appearing in this order. All the letters in the needles will be present in the matches in the same order.
This will also match country names like Mozambique ... which is perfect.
I just think that sometimes, we should not try to kill a fly with a bazooka.
Fuzzy Sort is a javascript library is helpful to perform string matching from a large collection of data.
The following code will helpful to use fuzzy sort in react.js.
Install fuzzy sort through npm,
npm install fuzzysort
Full demo code in react.js
import React from 'react';
import './App.css';
import data from './testdata';
const fuzzysort = require('fuzzysort');
class App extends React.Component {
constructor(props){
super(props)
this.state = {
keyword: '',
results: [],
}
console.log("data: ", data["steam_games"]);
}
search(keyword, category) {
return fuzzysort.go(keyword, data[category]);
}
render(){
return (
<div className="App">
<input type="text" onChange={(e)=> this.setState({keyword: e.target.value})}
value={this.state.keyword}
/>
<button onClick={()=>this.setState({results: this.search(this.state.keyword, "steam_games")})}>Search</button>
{this.state.results !== null && this.state.results.length > 0 ?
<h3>Results:</h3> : null
}
<ul>
{this.state.results.map((item, index) =>{
return(
<li key={index}>{item.score} : {item.target}</li>
)
})
}
</ul>
</div>
);
}
}
export default App;
For more refer FuzzySort
The problem can be broken down into two parts:
1) Choosing the correct string metric.
2) Coming up with a fast implementation of the same.
Choosing the correct metric: This part is largely dependent on your use case. However, I would suggest using a combination of a distance-based score and a phonetic-based encoding for greater accuracy i.e. initially computing a score based on the Levenshtein distance and later using Metaphone or Double Metaphone to complement the results.
Again, you should base your decision on your use case. If you can do with using just the Metaphone or Double Metaphone algorithms, then you needn't worry much about the computational cost.
Implementation: One way to cap down the computational cost is to cluster your data into several small groups based on your use case and load them into a dictionary.
For example, If you can assume that your user enters the first letter of the name correctly, you can store the names based on this invariant in a dictionary.
So, if the user enters the name "National School" you need to compute the fuzzy matching score only for school names starting with the letter "N"

Is there an alternative to SOUNDEX or metaphone that handles names with fewer collisions?

I'm trying to find near duplicates in a large list of names by computing the metaphone key for each string, and then, within each set of possible duplicates, use something like Levenshtein distance to get a more refined estimate of duplicate likelihood.1
However, I'm finding that metaphone is heavily determined by the first characters in the strings, and so if I feed it a long list of people's names, I get huge buckets where everyone's name is "Jennifer X" or "Richard Y", but otherwise haven't got much in common.
If I reverse the string before generating the key, the results are more sensible, in that they group by last name, but still I find that the first names aren't particularly similar.
So is there a similar algorithm that samples more of the input string to produce a sound key, perhaps by using a longer key string?
[1] Ideally, I'd compute the string distances directly, but if my list has 10,000 names, that would mean 100,000,000 computations, which is why I'm trying to divide and conquer by sound keying each name first and only checking for similarities within the buckets. But if there's a better way, I'd love to hear about it!
Try eudex.
It's described as "A blazingly fast phonetic reduction/hashing algorithm."
There are many easy ways to use it, as it encodes a word into a 64-bit integer with most discriminating features towards the MSB. The hamming difference between hashes is also useful as a difference metric between words and spellings.

String matching algorithm : (multi token strings)

I have a dictionary which contains a big number of strings. Each string could have a range of 1 to 4 tokens (words). Example :
Dictionary :
The Shawshank Redemption
The Godfather \
Pulp Fiction
The Dark Knight
Fight Club
Now I have a paragraph and I need to figure out how many strings in the para are part of the dictionary.
Example, when the para below :
The Shawshank Redemption considered the greatest movie ever made according to the IMDB Top 250.For at least the year or two that I have occasionally been checking in on the IMDB Top 250 The Shawshank Redemption has been
battling The Godfather for the top spot.
is run against the dictionary, I should be getting the ones in bold as the ones that are part of the dictionary.
How can I do this with the least dictionary calls.
Thanks
You might be better off using a Trie. A Trie is better suited to finding partial matches (i.e. as you search through the text of a paragraph) that are potentially what you're looking for, as opposed to making a bunch of calls to a dictionary that will mostly fail.
The reason why I think a Trie (or some variation) is appropriate is because it's built to do exactly what you're trying to do:
If you use this (or some modification that has the tokenized words at each node instead of a letter), this would be the most efficient (at least that I know of) in terms of storage and retrieval; Storage because instead of storing the word "The" a couple thousand times in each Dict entry that has that word in the title (as is the case with movie titles), it would be stored once in one of the nodes right under the root. The next word, "Shawshank" would be in a child node, and then "redemption" would be in the next, with a total of 3 lookups; then you would move to the next phrase. If it fails, i.e. the phrase is only "The Shawshank Looper", then you fail after the same 3 lookups, and you move to the failed word, Looper (which as it happens, would also be a child node under the root, and you get a hit. This solution works assuming you're reading a paragraph without mashup movie names).
Using a hash table, you're going to have to split all the words, check the first word, and then while there's no match, keep appending words and checking if THAT phrase is in the dictionary, until you get a hit, or you reach the end of the paragraph. So if you hit a paragraph with no movie titles, you would have as many lookups as there are words in the paragraph.
This is not a complete answer, more like an extended-comment.
In literature it's called "multi-pattern matching problem". Since you mentioned that the set of patterns has millions of elements, Trie based solutions will most probably perform poorly.
As far as I know, in practice traditional string search is used with a lot of heuristics. DNA search, antivirus detection, etc. all of these fields need fast and reliable pattern matching, so there should be decent amount of research done.
I can imagine how Rabin-Karp with rolling-hash functions and some filters (Bloom filter) can be used in order to speed up the process. For example, instead of actually matching the substrings, you could first filter (e.g. with weak-hashes) and then actually verify, thus reducing number of verifications needed. Plus this should reduce the work done with the original dictionary itself, as you would store it's hashes, or other filters.
In Python:
import re
movies={1:'The Shawshank Redemption', 2:'The Godfather', 3:'Pretty Woman', 4:'Pulp Fiction'}
text = 'The Shawshank Redemption considered the greatest movie ever made according to the IMDB Top 250.For at least the year or two that I have occasionally been checking in on the IMDB Top 250 The Shawshank Redemption has been battling The Godfather for the top spot.'
repl_str ='(?P<title>' + '|'.join(['(?:%s)' %movie for movie in movies.values()]) + ')'
result = re.sub(repl_str, '<b>\g<title></b>',text)
Basically it consists of forming up a big substitution instruction string out of your dict values.
I don't know whether regex and sub have a limitation in the size of the substitution instructions you give them though. You might want to check.
lai

What is the best way to classify following words in POS tagging?

I am doing POS tagging. Given the following tokens in the training set, is it better to consider each token as Word1/POStag and Word2/POStag or consider them as one word that is Word1/Word2/POStag ?
Examples: (the POSTag is not required to be included)
Bard/EMS
Interstate/Johnson
Polo/Ralph
IBC/Donoghue
ISC/Bunker
Bendix/King
mystery/comedy
Jeep/Eagle
B/T
Hawaiian/Japanese
IBM/PC
Princeton/Newport
editing/electronic
Heller/Breene
Davis/Zweig
Fleet/Norstar
a/k/a
1/2
Any suggestion is appreciated.
The examples don't seem to fall into one category with respect to the use of the slash -- a/k/a is a phrase acronym, 1/2 is a number, mystery/comedy indicates something in between the two words, etc.
I feel there is no treatment of the component words that would work for all the cases in question, and therefore the better option is to handle them as unique words. At decoding stage, when the tagger will probably be presented with more previously unseen examples of such words, the decision can often be made based on the context, rather than the word itself.

How to understand and add syllable break in this example?

I am new in machine learning and computing probabilities. This is an example from Lingpipe for adding syllabification in a word by training data.
Given a source model p(h) for hyphenated words, and a channel model p(w|h) defined so that p(w|h) = 1 if w is equal to h with the hyphens removed and 0 otherwise. We then seek to find the most likely source message h to have produced message w by:
ARGMAXh p(h|w) = ARGMAXh p(w|h) p(h) / p(w)
= ARGMAXh p(w|h) p(h)
= ARGMAXh s.t. strip(h)=w p(h)
where we use strip(h) = w to mean that w is equal to h with the hyphenations stripped out (in Java terms, h.replaceAll(" ","").equals(w)). Thus with a deterministic channel, we wind up looking for the most likely hyphenation h according to p(h), restricting our search to h that produce w when the hyphens are stripped out.
I do not understand how to use it to build a syllabification model.
If there is a training set containing:
a bid jan
a bide
a bie
a bil i ty
a bim e lech
How to have a model that will syllabify words? I mean what to be computed in order to find possible syllable breaks of a new word.
First compute what? then compute what? Can you please be specific with example?
Thanks a lot.
The method described in the article is based on a statistical law allowing to compute the correct value observing a noisy value. In other words, non-syllabified word is noisy or incorrect, like picnic, and the goal is finding a probably correct value, which is pic-nic.
Here is an excellent video lesson on very this topic (scroll to 1:25, but the whole set of lectures worth watching).
This method is specifically useful for word delimiting, but some use it for syllabification as well. Chinese language has space delimiters only for logical constructs, but most words follow each other with no delimiters. However, each character is a syllable, no exception.
There are other languages that have more complicated grammar. For instance, Thai has no spaces between the words, but each syllable may be constructed from several symbols, e.g. สวัสดี -> ส-วัส-ดี. Rule-based syllabification may be hard but possible.
As per English, I would not bother with Markov chains and N-grams and instead just use several simple rules that give pretty good match ratio (not perfect, however):
Two consonants between two vowels VCCV - split between them VC-CV as in cof-fee, pic-nic, except the "cluster consonant" that represents a single sound: meth-od, Ro-chester, hang-out
Three or more consonants between the vowels VCCCV - split keeping the blends together as in mon-ster or child-ren (this seems the most difficult as you cannot avoid a dictionary)
One consonant between two vowels VCV - split after the first vowel V-CV as in ba-con, a-rid
The rule above also has an exception based on blends: cour-age, play-time
Two vowels together VV - split between, except they represent a "cluster vowel": po-em, but glacier, earl-ier
I would start with the "main" rules first, and then cover them with "guard" rules preventing cluster vowels and consonants to be split. Also, there would be an obvious guard rule to prevent a single consonant to become a syllable. When done, I would have added another guard rule based on a dictionary.

Resources