How to check SOUNDEX in php when php SOUNDEX outputs only 3 digits - soundex

I am trying to do a comparison between company names using SOUNDEX, but the php call for soundex only outputs 3 digits so the comparisons aren't quite accurate. Is there a way to get a better soundex output so that the results are more accurate?

try using metaphone instead

Depending on what you are SOUNDEXing against, it might be cheaper to do run SOUNDEX()
at the database level:
$result = $db->query("
SELECT
company.id,
company.name,
SOUNDEX(company.name) AS soundex
FROM
company
WHERE
company.name SOUNDS LIKE '$companyName'
");

Related

Python big texts comparing

I'm not at good with math and I post my question here. Hope, will not get tons of dislikes.
I have a lot of big texts from 200.000 to 1.000.000 chars in each of them. And I need to compare texts to find duplicates. I decided to use fingerprint (md5 hashing) and then compare the fingerprint. But then I realised a new way of comparison - count chars in text.
So which one will work faster and which one will get less CPU power?
P.S. IMPORTANT: there CANNOT be 2 or more different texts with the same chars count
Taking the length of the string will be a lot faster and use less cpu power
This is because it is only one task and is easy for python and has the benifet of being an in built function.
However to preform an md5, it will need to do calculations on each character to produce the overall hash which will take a lot longer.
If the texts are exact duplicates you can get the hashes, or even faster, the lengths of texts and sort the lengths (coupled by id of text or by text reference itself) identifying the repetitions of lengths (or hashes).
For sorting you can use fast sorting algorithm, for example quicksort.
In fact there is even special *nix command line utility for sorting the items with support of duplicate removal, it is sort -u.
If the texts are near duplicates, not exact ones, the things go harder, you need to use special duplication aware hashing algorithms and sort the resultant hashes using their similarity metrics advanced so they count near things similar if distance between two compared items is lesser then some threshold of similarly.
Then again pass by resulting sorted list and get the near duplicates.

How to detect homophone

I am fairly new to speech processing, but wondering how homophones are detected. I am in search for an API which gives similarity between two words on the basis of how they are pronounced.
for example: "to" and "two" are highly similar in terms of how they sound with respect to say "to" and "from".
You might want to try calculating the edit distance not on the original strings, but on pronunciations, like they are available in the CMU Pronouncing Dictionary at http://www.speech.cs.cmu.edu/cgi-bin/cmudict
The following are used for indexing words by their English pronunciation Soundex or Metaphone. You can use python packages like Fuzzy that implement several indexing algorithms.
import fuzzy

Comparing strings for their similarities?

I want to count the number of times there is an ocurrence of certain college course on a list of thousands of entries. The problem is the course is not always spelled the same. For example, Computer Engineering can be spelled Computers Engineering. What is a proper, elegant way to test if 2 strings are very similar?
I would try to canonize the strings using stemming. The idea is - give each string its canonized form, and two different strings, that represent the same word are very likely to have the same canon form (for example, Computer and Computers will have the same cannon form, and you will get a match).
Porter stemming algorithm is often used for canonization.
An alternative - is grading the strings with a distance between each other, the suggested Levenshtein Distance can help you with it, but personally - I'd prefer canonization.

How do I find strings and string patterns in a set of many files?

I have a collection of about two million text files, that total about 10GB uncompressed. I would like to find documents containing phrases in this collection, which look like "every time" or "bill clinton" (simple case-insensitive string matching). I would also like to find phrases with fuzzy contents; e.g. "for * weeks".
I've tried indexing with Lucene but it is no good at finding phrases containing stopwords, as these are removed at index time by default. xargs and grep are a slow solution. What's fast and appropriate for this amount of data?
You may want to check out the ugrep utility for fuzzy search, which is much faster than agrep:
ugrep -i -Z PATTERN ...
This runs multiple threads (typically 8 or more) to search files concurrently. Option -i is for case-insensitive search and -Z specifies fuzzy search. You can increase the fuzziness from 1 to 3 with -Z3 to allow up to 3 errors (max edit distance 3) or only allow up to 3 insertions (extra characters) with -Z+3 for example. Unicode regex matching is supported by default. For example for fuzzy-matches für (i.e. one substitution).
you could use a postgreSQL datbase. There is full text search implementation and by using dictionaries you can define your own stop words. I don't know if it helps much, but I would give it a try.

How do you implement a "Did you mean"? [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
How does the Google “Did you mean?” Algorithm work?
Suppose you have a search system already in your website. How can you implement the "Did you mean:<spell_checked_word>" like Google does in some search queries?
Actually what Google does is very much non-trivial and also at first counter-intuitive. They don't do anything like check against a dictionary, but rather they make use of statistics to identify "similar" queries that returned more results than your query, the exact algorithm is of course not known.
There are different sub-problems to solve here, as a fundamental basis for all Natural Language Processing statistics related there is one must have book: Foundation of Statistical Natural Language Processing.
Concretely to solve the problem of word/query similarity I have had good results with using Edit Distance, a mathematical measure of string similarity that works surprisingly well. I used to use Levenshtein but the others may be worth looking into.
Soundex - in my experience - is crap.
Actually efficiently storing and searching a large dictionary of misspelled words and having sub second retrieval is again non-trivial, your best bet is to make use of existing full text indexing and retrieval engines (i.e. not your database's one), of which Lucene is currently one of the best and coincidentally ported to many many platforms.
Google's Dr Norvig has outlined how it works; he even gives a 20ish line Python implementation:
http://googlesystem.blogspot.com/2007/04/simplified-version-of-googles-spell.html
http://www.norvig.com/spell-correct.html
Dr Norvig also discusses the "did you mean" in this excellent talk. Dr Norvig is head of research at Google - when asked how "did you mean" is implemented, his answer is authoritive.
So its spell-checking, presumably with a dynamic dictionary build from other searches or even actual internet phrases and such. But that's still spell checking.
SOUNDEX and other guesses don't get a look in, people!
Check this article on wikipedia about the Levenshtein distance. Make sure you take a good look at Possible improvements.
I was pleasantly surprised that someone has asked how to create a state-of-the-art spelling suggestion system for search engines. I have been working on this subject for more than a year for a search engine company and I can point to information on the public domain on the subject.
As was mentioned in a previous post, Google (and Microsoft and Yahoo!) do not use any predefined dictionary nor do they employ hordes of linguists that ponder over the possible misspellings of queries. That would be impossible due to the scale of the problem but also because it is not clear that people could actually correctly identify when and if a query is misspelled.
Instead there is a simple and rather effective principle that is also valid for all European languages. Get all the unique queries on your search logs, calculate the edit distance between all pairs of queries, assuming that the reference query is the one that has the highest count.
This simple algorithm will work great for many types of queries. If you want to take it to the next level then I suggest you read the paper by Microsoft Research on that subject. You can find it here
The paper has a great introduction but after that you will need to be knowledgeable with concepts such as the Hidden Markov Model.
I would suggest looking at SOUNDEX to find similar words in your database.
You can also access google own dictionary by using the Google API spelling suggestion request.
You may want to look at Peter Norvig's "How to Write a Spelling Corrector" article.
I believe Google logs all queries and identifies when someone makes a spelling correction. This correction may then be suggested when others supply the same first query. This will work for any language, in fact any string of any characters.
http://en.wikipedia.org/wiki/N-gram#Google_use_of_N-gram
I think this depends on how big your website it. On our local Intranet which is used by about 500 member of staff, I simply look at the search phrases that returned zero results and enter that search phrase with the new suggested search phrase into a SQL table.
I them call on that table if no search results has been returned, however, this only works if the site is relatively small and I only do it for search phrases which are the most common.
You might also want to look at my answer to a similar question:
"Similar Posts" like functionality using MS SQL Server?
If you have industry specific translations, you will likely need a thesaurus. For example, I worked in the jewelry industry and there were abbreviate in our descriptions such as kt - karat, rd - round, cwt - carat weight... Endeca (the search engine at that job) has a thesaurus that will translate from common misspellings, but it does require manual intervention.
I do it with Lucene's Spell Checker.
Soundex is good for phonetic matches, but works best with peoples' names (it was originally developed for census data)
Also check out Full-Text-Indexing, the syntax is different from Google logic, but it's very quick and can deal with similar language elements.
Soundex and "Porter stemming" (soundex is trivial, not sure about porter stemming).
There's something called aspell that might help:
http://blog.evanweaver.com/files/doc/fauna/raspell/classes/Aspell.html
There's a ruby gem for it, but I don't know how to talk to it from python
http://blog.evanweaver.com/files/doc/fauna/raspell/files/README.html
Here's a quote from the ruby implementation
Usage
Aspell lets you check words and suggest corrections. For example:
string = "my haert wil go on"
string.gsub(/[\w\']+/) do |word|
if !speller.check(word)
# word is wrong
puts "Possible correction for #{word}:"
puts speller.suggest(word).first
end
end
This outputs:
Possible correction for haert:
heart
Possible correction for wil:
Will
Implementing spelling correction for search engines in an effective way is not trivial (you can't just compute the edit/levenshtein distance to every possible word). A solution based on k-gram indexes is described in Introduction to Information Retrieval (full text available online).
U could use ngram for the comparisment: http://en.wikipedia.org/wiki/N-gram
Using python ngram module: http://packages.python.org/ngram/index.html
import ngram
G2 = ngram.NGram([ "iis7 configure ftp 7.5",
"ubunto configre 8.5",
"mac configure ftp"])
print "String", "\t", "Similarity"
for i in G2.search("iis7 configurftp 7.5", threshold=0.1):
print i[1], "\t", i[0]
U get:
>>>
String Similarity
0.76 "iis7 configure ftp 7.5"
0.24 "mac configure ftp"
0.19 "ubunto configre 8.5"
Why not use google's did you mean in your code.For how see here
http://narenonit.blogspot.com/2012/08/trick-for-using-googles-did-you-mean.html

Resources