Is this already a string similarity algorithm?

Is this already a string similarity algorithm? - string

I'm unfamiliar with string similarity algorithms except for Levenshtein Distance because that's what I'm using and it has turned out to be less than ideal.
So I've kind of got an idea of a recursive algorithm I'd like to implement but I want to know if it exists already so I can leverage other's expertise.
Here's the algorithm by example:
string 1: "Paul Johnson"
string 2: "John Paulson"
Step 1: find all longest matches
Match 1: "Paul"
Match 2: "John"
Match 3: "son"
Match 4: " "
Step 2: Calculate scores for each match with this formula: ((match.len/string.len)*match.len) This allows longer strings to be weighted more at a balanced rate based on the length of the string.
Match 1: (4/12)*4 = 1.333...
Match 2: 1.333...
Match 3: .75
Match 4: .083
Step 3: do steps 1 and 2 on larger scales, (matches of matches.) This I don't have figured out exactly. but my thinking is if "son" comes after "Paul John" and it comes after "John Paul" then that should count for something.
Step 4: sum all the scores that have been calculated.
Scores: 1.333 + 1.333 + .75 + .083333 = 3.4999... (plus whatever scores step 3 produces)
Does this look familiar to anyone? I hope someone else has gone to the trouble of actually making an algorithm along these lines so I don't have to figure it out myself.

What you describe somewhat resembles what the following paper calls the Longest Common Substring (LCS). For a brief description and comparison to other algorithms:
A Comparison of Personal Name Matching
This algorithm [11] repeatedly finds and removes the longest common
sub-string in the two strings compared, up to a minimum lengths
(normally set to 2 or 3).
...
A similarity measure can be calculated by
dividing the total length of the common sub-strings by the minimum,
maximum or average lengths of the two original strings (similar to
Smith-Waterman).
...
this algorithm is
suitable for compound names that have words (like given- and surname)
swapped.

Related

Efficient way to check if string A is contained in string B with at most k errors

Given a string A and a string B (A shorter or the same length as B), I would like to check whether B contains a substring A' such that the Hamming distance between A and A' is at most k.
Does anyone know of an efficient algorithm to do this? Obviously I can just run a sliding window, but this is not feasible for the amount of data I'm working with. The Knuth-Morris-Pratt algorithm (https://en.wikipedia.org/wiki/Knuth%E2%80%93Morris%E2%80%93Pratt_algorithm) would work when k=0, but I don't know whether it's modifiable to account for k>0.
Thanks!
Edit: I apparently forgot to clarify, I am looking for a consecutive substring, so for example the substring from position 3 to position 7, without skipping characters. So levenshtein distance is not applicable.

This is what you are looking for : https://en.wikipedia.org/wiki/Levenshtein_distance

If you use the Levenshtein distance and k=1, then you can use the fact that if the length of A is 2n+1 or 2n+2, then either the first or the last n characters of A must be in B.
So you can use strstr to find all places in B where the first or last n characters match exactly and then check the Levenshtein distance.
Special case A = 1 characters: matches everywhere with one error. Special case where A = 2 characters ab: call strchr(a), if it fails call strchr(b).

Find if two strings are anagrams

Faced this question in an interview, which basically stated
Find if the given two strings are anagrams of each other in O(N) time without any extra space
I tried the usual solutions:
Using a character frequency count (O(N) time, O(26) space) (as a variation, iterating 26 times to calculate frequency of each character as well)
Sorting both strings and comparing (O(NlogN) time, constant space)
But the interviewer wanted a "better" approach. At the extreme end of the interview, he hinted at "XOR" for the question. Not sure how that works, since "aa" XOR "bb" should also be zero without being anagrams.
Long story short, are the given constraints possible? If so, what would be the algorithm?

Given word_a and word_b in the same length, I would try the following:
Define a variable counter and initialise the value to 0.
For each letter ii in the alphabet do the following:
2.1. for jj in length(word_a):
2.1.1. if word_a[jj] == ii increase counter by 1: counter += 1
2.1.2. if word_b[jj] == ii decrease the counter by 1: counter -= 1
2.2. if after passing all the characters in the words, counter is different than 0, you have a different number of ii characters in each word and in particular they are not anagrams, break out of the loop and return False
Return True
Explanation
In case the words are anagrams, you have the same number of each of the characters, therefore the use of the histogram makes sense, but histograms require space. Using this method, you run over the n characters of the words exactly 26 times in the case of the English alphabet or any other constant c representing the number of letters in the alphabet. Therefor, the runtime of the process is O(c*n) = O(n) since c is constant and you do not use any other space besides the one variable

I haven't proven to myself that this is infallible yet, but it's a possible solution.
Go through both strings and calculate 3 values: the sum, the accumulated xor, and the count. If all 3 are equal then the strings should be anagrams.

Time complexity of my backtracking to find the optimal solution of the maximum sum non adjacent

I'm trying to do dynamic programming backtracking of maximum sum of non adjacent elements to construct the optimal solution to get the max sum.
Background:
Say if input list is [1,2,3,4,5]
The memoization should be [1,2,4,6,9]
And my maximum sum is 9, right?
My solution:
I find the first occurence of the max sum in memo (as we may not choose the last item) [this is O(N)]
Then I find the previous item chosen by using this formula:
max_sum -= a_list[index]
As in this example, 9 - 5 = 4, which 4 is on index 2, we can say that the previous item chosen is "3" which is also on the index 2 in the input list.
I find the first occurence of 4 which is on index 2 (I find the first occurrence because of the same concept in step 1 as we may have not chosen that item in some cases where there are multiple same amounts together) [Also O(N) but...]
The issue:
The third step of my solution is done in a while loop, let's say the non adjacent constraint is 1, the max amount we have to backtrack when the length of list is 5 is 3 times, approx N//2 times.
But the 3rd step, uses Python's index function to find the first occurence of the previous_sum [which is O(N)] memo.index(that_previous_sum)
So the total time complexity is about O(N//2 * N)
Which is O(N^2) !!!
Am I correct on the time complexity? Or am I wrong? Is there a more efficient way to backtrack the memoization list?
P.S. Sorry for the formatting if I done it wrong, thanks!

Solved:
I looped from behind checking if the item in front is same or not
If it's same, means it's not first occurrence. If not, it's first occurrence.
Tada! No Python's index function to find from the first index! We find it now from the back
So the total time complexity is about O(N//2 * N)
Now O(N//2 + 1), which is O(N).

Find maximum scoring characters with non overlapping occurrences from a string

I have a problem related to Episode Mining in which I need to find the maximal utility of a episode given its occurrences in an event sequence.But I am presenting the question in a different form so that it's easier to explain.
There is a long string, S, where each of its character have some positive score. Given another string, T, find a match with S containing all occurrences of the sequence of characters of T such that :-
The occurrences are non-overlapping.
The sequence of characters in S must be same as present in T but it can be discontinuous.
Each occurrence should lie in a given window.
The total score of a match can be found by simply adding the scores of the characters at each occurrence.The problem is to find a match with maximum score of all the matches possible.
Example - String S - a(2) b(3) e(1) d(10) d(7) c(1) a(5) d(8) b(5) d(6)
String T - a b d
Window size - 5
Two matches of string T are:-
[1,2,4], [7,9,10]. Score - [2+3+10] + [5+5+6]= 31
[1,2,5], [7,9,10]. Score - [2+3+7] + [5+5+6]= 28. And the score is maximum in match 1 so it is the required answer.
We didn't consider the occurrence [1,2,8] or [1,2,10] as they are not in the given window as (8-1) > 5.
So, I would like to know if there is some solution to find the set of occurrences or match that gives the maximum score efficiently.

Binary search - worst/avg case

I'm finding it difficult to understand why/how the worst and average case for searching for a key in an array/list using binary search is O(log(n)).
log(1,000,000) is only 6. log(1,000,000,000) is only 9 - I get that, but I don't understand the explanation. If one did not test it, how do we know that the avg/worst case is actually log(n)?
I hope you guys understand what I'm trying to say. If not, please let me know and I'll try to explain it differently.

Worst case
Every time the binary search code makes a decision, it eliminates half of the remaining elements from consideration. So you're dividing the number of elements by 2 with each decision.
How many times can you divide by 2 before you are down to only a single element? If n is the starting number of elements and x is the number of times you divide by 2, we can write this as:
n / (2 * 2 * 2 * ... * 2) = 1 [the '2' is repeated x times]
or, equivalently,
n / 2^x = 1
or, equivalently,
n = 2^x
So log base 2 of n gives you x, which is the number of decisions being made.
Finally, you might ask, if I used log base 2, why is it also OK to write it as log base 10, as you have done? The base does not matter because the difference is only a constant factor which is "ignored" by Big O notation.
Average case
I see that you also asked about the average case. Consider:
There is only one element in the array that can be found on the first try.
There are only two elements that can be found on the second try. (Because after the first try, we chose either the right half or the left half.)
There are only four elements that can be found on the third try.
You can see the pattern: 1, 2, 4, 8, ... , n/2. To express the same pattern going in the other direction:
Half the elements take the maximum number of decisions to find.
A quarter of the elements take one fewer decision to find.
etc.
Since half of the elements take the maximum amount of time, it doesn't matter how much less time the other elements take. We could assume that all elements take the maximum amount of time, and even if half of them actually take 0 time, our assumption would not be more than double whatever the true average is. We can ignore "double" since it is a constant factor. So the average case is the same as the worst case, as far as Big O notation is concerned.

For binary search, the array should be arranged in ascending or descending order.
In each step, the algorithm compares the search key value with the key value of the middle element of the array.
If the keys match, then a matching element has been found and its index, or position, is returned.
Otherwise, if the search key is less than the middle element's key, then the algorithm repeats its action on the sub-array to the left of the middle element.
Or, if the search key is greater,then the algorithm repeats its action on the sub-array to the right.
If the remaining array to be searched is empty, then the key cannot be found in the array and a special "not found" indication is returned.
So, a binary search is a dichotomic divide and conquer search algorithm. Thereby it takes logarithmic time for performing the search operation as the elements are reduced by half in each of the iteration.

For sorted lists which we can do a binary search, each "decision" made by the binary search compares your key to the middle element, if greater it takes the right half of the list, if less it will take the left half of the list (if it's a match it will return the element at that position) you effectively reduce your list by half for every decision yielding O(logn).
Binary search however, only works for sorted lists. For un-sorted lists you can do a straight search starting with the first element yielding a complexity of O(n).
O(logn) < O(n)
Although it entirely depends on how many searches you'll be doing, your inputs, etc what your best approach would be.

For Binary search the prerequisite is a sorted array as input.
• As the list is sorted:
• Certainly we don't have to check every word in the dictionary to look up a word.
• A basic strategy is to repeatedly halve our search range until we find the value.
• For example, look for 5 in the list of 9 #s below.v = 1 1 3 5 8 10 18 33 42
• We would first start in the middle: 8
• Since 5<8, we know we can look at just the first half: 1 1 3 5
• Looking at the middle # again, narrow down to 3 5
• Then we stop when we're down to one #: 5
How many comparison is needed: 4 =log(base 2)(9-1)=O(log(base2)n)
int binary_search (vector<int> v, int val) {
int from = 0;
int to = v.size()-1;
int mid;
while (from <= to) {
mid = (from+to)/2;
if (val == v[mid])
return mid;
else if (val > v[mid])
from = mid+1;
else
to = mid-1;
}
return -1;
}

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string