String-to-string correction algorithm - string

I'm not sure I've titled this post correctly, but I'm wondering if there's a name for this type of algorithm:
What I'm trying to accomplish is to create a minimal set of instructions to go from one string to its permutation, so for example:
STACKOVERFLOW -> STAKCOVERFLOW
would require a minimum of one operation, which is to
shift K before C.
Are there any good online examples of
Finding the minimum instruction set (I believe this is also often called the edit distance), and
Listing the instruction set
Thanks!

There is something known as the Levenshtein distance that tells you how many changes are needed to go from one string to another and there are many C# implementations, many other languages too.
Here's the wiki:
http://en.wikipedia.org/wiki/Levenshtein_distance
Edit: As TheHorse has indicated, the Levenshtein distance doesn't understand Shift changes, but there is an improved algorithm:
Damerau-Levenshtein distance

Related

Hypothesis search tree

I have a object with many fields. Each field has different range of values. I want to use hypothesis to generate different instances of this object.
Is there a limit to the number of combination of field values Hypothesis can handle? Or what does the search tree hypothesis creates look like? I don't need all the combinations but I want to make sure that I get a fair number of combinations where I test many different values for each field. I want to make sure Hypothesis is not doing a DFS until it hits the max number of examples to generate
TLDR: don't worry, this is a common use-case and even a naive strategy works very well.
The actual search process used by Hypothesis is complicated (as in, "lead author's PhD topic"), but it's definitely not a depth-first search! Briefly, it's a uniform distribution layered on a psudeo-random number generator, with a coverage-guided fuzzer biasing that towards less-explored code paths, with strategy-specific heuristics on top of that.
In general, I trust this process to pick good examples far more than I trust my own judgement, or that of anyone without years of experience in QA or testing research!

Close-Enough TSP implementation

I'm looking for a solution to a Close-Enough Traveling Salesman Problem (CETSP) where I have a set of nodes that I need to visit all within a certain distance of optimally. I've found a couple of sources for some approaches towards this TSP variant but was unable to find a solver or a algorithm that I could easily use.
Do you have any suggestions for how I can go about getting a solution to my CETSP problem, whether it be running an implementation of it myself or using an existing solver.
You can try using UFFLP. They have an example where you can find the correct coordinates the salesman is supposed to pass given a predetermined sequence. So you can generate thousands of sequences and choose the best one (just a simple heuristic).
Have a look at http://www.gapso.com.br/en/ufflp-en/
You will find useful information.

Is there a hashing function that can be used in finding similar (not necessarily equal) strings?

What I need is a hashing function that operates on fixed data sizes, obviously for non security purposes. It needs to map similar strings to similar or equal hashes, in other words small changes in strings should make no or really small changes to hashes.
for example: my name is John, my name is Jon should have the same or really similar hashes.
my name is John, your name is Liam should result in somewhat similar hashes.
my name is John, I live in USA should give totally different hashes.
and so on!
Is there a hashing function for similar purposes?
There is no reliable way of achieving this. This is due to the pigeonhole principle; there are far fewer ways that two short strings can be "close" than two long strings.
However, there is the concept of fuzzy hashing, which might get you part of the way there.
It sounds like you're looking for Levenshtein distance (see http://en.wikipedia.org/wiki/Levenshtein_distance).
There are plenty of implementations of this in various languages.
I think in this case Jacard index may be helpful.The Jaccard index is a simple measure of how similiar two sets are. It's simply the ratio of the size of the intersection of the sets and the size of the union of the sets.
There is a blog discussing about Jaccard Similarity Index for measuring Document Similarity which I found more closer to your needs.

automatic keyword generation evaluation

I have a simple text analyzer with generates keywords for a given input text. Until now I have been doing a manual evaluation of it, i.e., manually selecting keywords of a text and comparing them against the ones generated by the analyzer.
Is there any way in which I can automate this? I tried googling a lot for some free keyword generators which can help in this evaluation but have not found any till now. I will appreciate any suggestions on how to go about this.
Testing keyword generation is a difficult problem. In the past, I have used the following method to evaluate it.
Identify the popular association-rule generation methods like Confidence, Jaccard, Lift, Chi-Squared, Mutual Information etc. There are many papers that compare such measures.
Implementing these measures is fairly simple. They all involve some simple algebraic expression using one or more of term frequencies, document frequencies and co-occurrence frequencies.
Generate related keywords using all of these measures and compute their union. Call this set TOTAL.
Compute the intersection of the keywords generated by your algorithm with the above TOTAL-set. When viewed as a fraction (intersection/TOTAL), it is a rough indicator of how powerful your measure is.
I found an automatic keyword generation evaluation tool Text Mechanic's Keyword Suggestion Generator, which might help.
It says:
The Text Mechanic's "Keyword Suggestion Generator" will retrieve Google.com auto suggest results* for your entered seed text in an easy to investegate format. Seed text can be a letter, number, word, phrase, related to what you (and others) are querying to find in Google search results.
I believe it can be automated.

String Matching Algorithms

I have a python app with a database of businesses and I want to be able to search for businesses by name (for autocomplete purposes).
For example, consider the names "best buy", "mcdonalds", "sony" and "apple".
I would like "app" to return "apple", as well as "appel" and "ple".
"Mc'donalds" should return "mcdonalds".
"bst b" and "best-buy" should both return "best buy".
Which algorithm am I looking for, and does it have a python implementation?
Thanks!
The Levenshtein distance should do.
Look around - there are implementations in many languages.
Levenshtein distance will do this.
Note: this is a distance, you have to calculate it to every string in your database, which can be a big problem if you have a lot of entries.
If you have this problem then record all the typos the users make (typo=no direct match) and offline build a correction database which contains all the typo->fix mappings. some companies do this even more clever, eg: google watches how users correct their own typos and learns the mappings from this.
Soundex or Metaphone might work.
I think what you are looking for is a huge field of Data Quality and Data Cleansing. I fear if you could find a python implementation regarding this as it has to be something which cleanses considerable amount of data in db which could be of business value.
Levensthein distance goes in the right direction but only half the way. There are several tricks to get it to use the half matches as well.
One would be to use a subsequence dynamic time warping (DTW is actually a generalization of levensthein distance). For this you relax the start and end cases when calcualting the cost matrix. If you only relax one of the conditions you can get autocompletion with spell checking. I am not sure if there is a python implementation available, but if you want to implement it for yourself it should not be more than 10-20 LOC.
The other idea would be to use a Trie for speed up, which can do DTW/Levensthein on multiple results simultaniously (huge speedup if your database is large). There is a paper on Levensthein on Tries at IEEE, so you can find the algorithm there. Again for this you would need to relax the final boundary condition, so you get partial matches. However since you step down in the trie you just need to check when you have fully consumed the input and then return all leaves.
check this one http://docs.python.org/library/difflib.html
it should help you

Resources