mapping strings - string

I want to map some strings(word) with number. the similar the string, the nearer their value(mapped number) . also, while checking the positional combination of the letters should impact the mapping.the mapping function should be function of letters, positions (combination given position of letter thepriority such as pit and tip should be different), number of letters.
Well, I would give some examples : starter, stater , stapler, startler, tstarter are some words. These words are of format "(*optinal)sta(*opt)*er" where * denotes some sort of variable in our case it is either 't' or 'l' (i.e. in case of starter and staler). these all should be mapped INDIVIDUALLY, without context to other such that their value are not of much difference. and later on which creating groups I can put appropriate range of numbers for differentiating groups.
So while mapping the string their values should be similar. there are many words, so comparing each other would be complex. so mapping with some numeric value for each word independently and putting the similar string (as they have similar value) in a group and then later find these pattern by other means.
So, for now I need to look up for some existing methods of mapping such that similar strings (I guess I have clarify the term 'similar' for my context) have similar value and these value should be different to the dissimilar ones. please, again I emphasize that the number of string would be huge and comparing each with other is practically impossible(or computationally expensive and much slow).SO WHAT I THINK IS TO DEVISE AN ALGORITHM(taking help from existing ones) FOR MAPPING WORD(STRING) ON ITS OWN
Have I made you clear? Please give me some idea to start with. some terms to search and research.
I think I need some type of "bad" hash function to hash strings and then put them in bucket according to that hash value. at least some idea or algorithm names.

Seems like it would best to use a known algorithm like Levenshtein Distance

This search on StackOverflow
reveals this question about finding-groups-of-similar-strings-in-a-large-set-of-strings, which links to this article describing a SimHash which sounds exactly like what you want.

Related

Algorithm to un-concatenate words from string without spaces and punctuation

I've been given a problem in my data structures class to find the solution to this problem. It's similar to an interview question. If someone could explain the thinking process or solution to the problem. Pseudocode can be used. So far i've been thinking to use tries to hold the dictionary and look up words that way for efficiency.
This is the problem:
Oh, no! You have just completed a lengthy document when you have an unfortunate Find/Replace mishap. You have accidentally removed all spaces, punctuation, and capitalization in the document. A sentence like "I reset the computer. It still didn't boot!" would become "iresetthecomputeritstilldidntboot". You figure that you can add back in the punctation and capitalization later, once you get the individual words properly separated. Most of the words will be in a dictionary, but some strings, like proper names, will not.
Given a dictionary (a list of words), design an algorithm to find the optimal way of "unconcatenating" a sequence of words. In this case, "optimal" is defined to be the parsing which minimizes the number of unrecognized sequences of characters.
For example, the string "jesslookedjustliketimherbrother" would be optimally parsed as "JESS looked just like TIM her brother". This parsing has seven unrecognized characters, which we have capitalized for clarity.
For each index, n, into the string, compute the cost C(n) of the optimal solution (ie: the number of unrecognised characters in the optimal parsing) starting at that index.
Then, the solution to your problem is C(0).
There's a recurrence relation for C. At each n, either you match a word of i characters, or you skip over character n, incurring a cost of 1, and then parse the rest optimally. You just need to find which of those choices incurs the lowest cost.
Let N be the length of the string, and let W(n) be a set containing the lengths of all words starting at index n in your string. Then:
C(N) = 0
C(n) = min({C(n+1) + 1} union {C(n+i) for i in W(n)})
This can be implemented using dynamic programming by constructing a table of C(n) starting from the end backwards.
If the length of the longest word in your dictionary is L, then the algorithm runs in O(NL) time in the worst case and can be implemented to use O(L) memory if you're careful.
You could use rolling hashes of different lengths to speed up the search.
You can try a partial pattern matcher for example aho-corasick algorithm. Basically it's a special space optimized version of a suffix tree.

Is there an alternative to SOUNDEX or metaphone that handles names with fewer collisions?

I'm trying to find near duplicates in a large list of names by computing the metaphone key for each string, and then, within each set of possible duplicates, use something like Levenshtein distance to get a more refined estimate of duplicate likelihood.1
However, I'm finding that metaphone is heavily determined by the first characters in the strings, and so if I feed it a long list of people's names, I get huge buckets where everyone's name is "Jennifer X" or "Richard Y", but otherwise haven't got much in common.
If I reverse the string before generating the key, the results are more sensible, in that they group by last name, but still I find that the first names aren't particularly similar.
So is there a similar algorithm that samples more of the input string to produce a sound key, perhaps by using a longer key string?
[1] Ideally, I'd compute the string distances directly, but if my list has 10,000 names, that would mean 100,000,000 computations, which is why I'm trying to divide and conquer by sound keying each name first and only checking for similarities within the buckets. But if there's a better way, I'd love to hear about it!
Try eudex.
It's described as "A blazingly fast phonetic reduction/hashing algorithm."
There are many easy ways to use it, as it encodes a word into a 64-bit integer with most discriminating features towards the MSB. The hamming difference between hashes is also useful as a difference metric between words and spellings.

Algorithm for string processing

I am looking for a algorithm for string processing, I have searched for it but couldn't find a algorithm that meets my requirements. I will explain what the algorithm should do with an example.
There are two sets of word sets defined as shown below:
**Main_Words**: swimming, driving, playing
**Words_in_front**: I am, I enjoy, I love, I am going to go
The program will search through a huge set of words as soon it finds a word that is defined in Main_Words it will check the words in front of that Word to see if it has any matching words defined in Words_in_front.
i.e If the program encounters the word "Swimming" it has to check if the words in front of the word "Swimming" are one of these: I am, I enjoy, I love, I am going to go.
Are there any algorithms that can do this?
A straightforward way to do this would be to just do a linear scan through the text, always keeping track of the last N+1 words (or characters) you see, where N is the number of words (or characters) in the longest phrase contained in your words_in_front collection. When you have a "main word", you can just check whether the sequence of N words/characters before it ends with any of the prefixes you have.
This would be a bit faster if you transformed your words_in_front set into a nicer data structure, such as a hashmap (perhaps keyed by last letter in the phrase..) or a prefix/suffix tree of some sort, so you wouldn't have to do an .endsWith over every single member of the set of prefixes each time you have a matching "main word." As was stated in another answer, there is much room for optimization and a few other possible implementations, but there's a start.
Create a map/dictionary/hash/associative array (whatever is defined in your language) with key in Main_Words and Words_in_front are the linked list attached to the entry pointed by the key. Whenever you encounter a word matching a key, go to the table and see if in the attached list there are words that match what you have in front.
That's the basic idea, it can be optimized for both speed and space.
You should be able to build a regular expression along these lines:
I (am|enjoy|love|am going to go) (swimming|driving|playing)

Is there a formal definition of character difference across a string and if so how is it calculated?

Overview
I'm looking to analyse the difference between two characters as part of a password strength checking process.
I'll explain what I'm trying to achieve and why and would like to know if what I'm looking to do is formally defined and whether there are any recommended algorithms for achieving this.
What I'm looking to do
Across a whole string, I'm looking to compare the current character with the previous character and determine how different they are.
As this relates to password strength checking, the difference between one character and it's predecessor in a string might be defined as being how predictable character N is from knowing character N - 1. There might be a formal definition for this of which I'm not aware.
Example
A password of abc123 could be arguably less secure than azu590. Both contain three letters followed by three numbers, however in the case of the former the sequence is more predictable.
I'm assuming that a password guesser might try some obvious sequences such that abc123 would be tried much before azu590.
Considering the decimal ASCII values for the characters in these strings, and given that b is 1 different from a and c is 1 different again from b, we could derive a simplistic difference calculation.
Ignoring cases where two consecutive characters are not in the same character class, we could say that abc123 has an overall character to character difference of 4 whereas azu590 has a similar difference of 25 + 5 + 4 + 9 = 43.
Does this exist?
This notion of character to character difference across a string might be defined, similar to the Levenshtein distance between two strings. I don't know if this concept is defined or what it might be called. Is it defined and if so what is it called?
My example approach to calculating the character to character difference across a string is a simple and obvious approach. It may be flawed, it may be ineffective. Are there any known algorithms for calculating this character to character difference effectively?
It sounds like you want a Markov Chain model for passwords. A Markov Chain has a number of states and a probability of transitioning between the states. In your case the states are the characters in the allowed character set and the probability of a transition is proportional to the frequency that those two letters appear consecutively. You can construct the Markov Chain by looking at the frequency of the transitions in an existing text, for example a freely available word list or password database.
It is also possible to use variations on this technique (Markov chain of order m) where you for example consider the previous two characters instead of just one.
Once you have created the model you can use the probability of generating the password from the model as a measure of its strength. This is the product of the probabilities of each state transition.
For general signals/time-series data, this is known as Autocorrelation.
You could try adapting the Durbin–Watson statistic and test for positive auto-correlation between the characters. A naïve way may be to use the unicode code-points of each character, but I'm sure that will not be good enough.

Getting fuzzy string matches from database very fast

I have a database of ~150'000 words and a pattern (any single word) and I want to get all words from the database which has Damerau-Levenshtein distance between it and the pattern less than given number. I need to do it extremely fast. What algorithm could you suggest? If there's no good algorithm for Damerau-Levenshtein distance, just Levenshtin distance will be welcome as well.
Thank you for your help.
P.S. I'm not going to use SOUNDEX.
I would start with a SQL function to calculate the Levenshtein distance (in T-SQl or .Net) (yes, I'm a MS person...) with a maximum distance parameter that would cause an early exit.
This function could then be used to compare your input with each string to check the distanve and move on to the next if it breaks the threshold.
I was also thinking you could, for example, set the maximum distance to be 2, then filter all words where the length is more than 1 different whilst the first letter is different. With an index this may be slightly quicker.
You could also shortcut to bring back all strings that are perfect matches (indexing will speed this up) as these will actually take longer to calculate the Levenshtein distance of 0.
Just some thoughts....
I do not think you can calculate this kind of function without actually enumerating all rows.
So the solutions are:
Make it a very fast enumeration (but this doesn't really scale)
Filter initial variants somehow (index by a letter, at least x common letters)
Use alternative (indexable) algorithm, such as N-Grams (however I do not have details on result quality of ngrams versus D-L distance).
A solution off the top of my head might be to store the database in a sorted set (e.g., std::set in C++), as it seems to me that strings sorted lexicographically would compare well. To approximate the position of the given string in the set, use std::upper_bound on the string, then iterate over the set outward from the found position in both directions, computing the distance as you go, and stop when it falls below a certain threshold. I have a feeling that this solution would probably only match strings with the same start character, but if you're using the algorithm for spell-checking, then that restriction is common, or at least unsurprising.
Edit: If you're looking for an optimisation of the algorithm itself, however, this answer is irrelevant.
I have used KNIME for string fuzzy matching and has got very fast results. It is also very easy to make visual workflows in it. Just install KNIME free edition from https://www.knime.org/ then use "String Distance" and "Similarity Search" nodes to get your results. I have attached a small fuzzy matching smaple workflow in here (the input data come from top and the patterns to search for come from the bottom in this case):
I would recommend looking into Ankiro.
I'm not certain that it meets your requirements for precision, but it is fast.

Resources