Explanation of normalized edit distance formula - string

Based on this paper:
IEEE TRANSACTIONS ON PAITERN ANALYSIS : Computation of Normalized Edit Distance and Applications In this paper Normalized Edit Distance as followed:
Given two strings X and Y over a finite alphabet, the normalized edit
distance between X and Y, d( X , Y ) is defined as the minimum of W( P
) / L ( P )w, here P is an editing path between X and Y , W ( P ) is
the sum of the weights of the elementary edit operations of P, and
L(P) is the number of these operations (length of P).
Can i safely translate the normalized edit distance algorithm explained above as this:
normalized edit distance =
levenshtein(query 1, query 2)/max(length(query 1), length(query 2))

You are probably misunderstanding the metric. There are two issues:
The normalization step is to divide W(P) which is the weight of the edit procedure over L(P), which is the length of the edit procedure, not over the max length of the strings as you did;
Also, the paper showed that (Example 3.1) normalized edit distance cannot be simply computed with levenshtein distance. You probably need to implement their algorithm.
An explanation of Example 3.1 (c):
From aaab to abbb, the paper used the following transformations:
match a with a;
skip a in the first string;
skip a in the first string;
skip b in the second string;
skip b in the second string;
match the final bs.
These are 6 operations which is why L(P) is 6; from the matrix in (a), matching has cost 0, skipping has cost 2, thus we have total cost of 0 + 2 + 2 + 2 + 2 + 0 = 8, which is exactly W(P), and W(P) / L(P) = 1.33. Similar results can be obtained for (b), which I'll left to you as exercise :-)

The 3 in figure 2(a) refers to the cost of changing "a" to "b" or the cost of changing "b" to "a". The columns with lambdas in figure 2(a) mean that it costs 2 in order to insert or delete either an "a" or a "b".
In figure 2(b), W(P) = 6 because the algorithm does the following steps:
keep first a (cost 0)
convert first b to a (cost 3)
convert second b to a (cost 3)
keep last b (cost 0)
The sum of the costs of the steps is W(P). The number of steps is 4 which is L(P).
In figure 2(c), the steps are different:
keep first a (cost 0)
delete first b (cost 2)
delete second b (cost 2)
insert a (cost 2)
insert a (cost 2)
keep last b (cost 0)
In this path there are six steps so the L(P) is 6. The sum of the costs of the steps is 8 so W(P) is 8. Therefore the normalized edit distance is 8/6 = 4/3 which is about 1.33.

Related

Binary Search Complexity

What is the time complexity of binary search taking an array of n elements as input from user..
As the time complexity of binary search is O(log n)
Whereas, the time complexity of taking the array as input from user is O(n)
For the whole program the answer is: O(logn) + O(n) = O(n)
Why is it? because your input of n numbers into an array and binary search is independent of each other.
Now we need to understand time complexity in general:
In the simplest terms, for a problem where the input size is n:
Best case = fastest time to complete, with optimal inputs chosen.
For example, the best case for a sorting algorithm would be data that's already sorted.
Worst case = slowest time to complete, with pessimal inputs are chosen.
For example, the worst case for a sorting algorithm might be data that are sorted in reverse order (but it depends on the particular algorithm).
Average case = arithmetic mean. Run the algorithm many times, using many different inputs of size n that come from some distribution that generates these inputs (in the simplest case, all the possible inputs are equally likely), and compute the total running time (by adding the individual times), and divide by the number of trials. You may also need to normalize the results based on the size of the input sets.
Example (Binary search):
Suppose we have the following binary search function:
binarySearch(arr, x, low, high)
repeat till low = high
mid = (low + high)/2
if (x == arr[mid])
return mid
else if (x > arr[mid]) // x is on the right side
low = mid + 1
else // x is on the left side
high = mid - 1
Now, let's analyze its time complexity.
Best Case Time Complexity of Binary Search
The best case of Binary Search occurs when:
The element to be searched is in the middle of the list
In this case, the element is found in the first step itself and this involves 1 comparison.
Therefore, Best Case Time Complexity of Binary Search is O(1).
Average Case Time Complexity of Binary Search
Let the input be N distinct numbers: a1, a2, ..., a(N-1), aN
We need to find element P.
There are two cases:
Case 1: The element P can be in N distinct indexes from 0 to N-1.
Case 2: There will be a case when the element P is not present in the list.
There are N case 1 and 1 case 2. So, there are N+1 distinct cases to consider in total.
If element P is in index K, then Binary Search will do K+1 comparisons.
This is because:
The element at index N/2 can be found in 1 comparison as Binary Search starts from middle.
Similarly, in the 2nd comparisons, elements at index N/4 and 3N/4 are compared based on the result of 1st comparison.
On this line, in the 3rd comparison, elements at index N/8, 3N/8, 5N/8, 7N/8 are compared based on the result of 2nd comparison.
Based on this, we know that:
Elements requiring 1 comparison: 1
Elements requiring 2 comparisons: 2
Elements requiring 3 comparisons: 4
Therefore, Elements requiring I comparisons: 2^(I-1)
The maximum number of comparisons = Number of times N is divided by 2 so that result is 1 = Comparisons to reach 1st element = logN comparisons
I can vary from 0 to logN
Total number of comparisons = 1 * (Elements requiring 1 comparison) + 2 * (Elements requiring 2 comparisons) + ... + logN * (Elements requiring logN comparisons)
Total number of comparisons = 1 * (1) + 2 * (2) + 3 * (4) + ... + logN * (2^(logN-1))
Total number of comparisons = 1 + 4 + 12 + 32 + ... = 2^logN * (logN - 1) + 1
Total number of comparisons = N * (logN - 1) + 1
Total number of cases = N+1
Therefore, average number of comparisons = ( N * (logN - 1) + 1 ) / (N+1)
Average number of comparisons = N * logN / (N+1) - N/(N+1) + 1/(N+1)
The dominant term is N * logN / (N+1) which is approximately logN. Therefore, Average Case Time Complexity of Binary Search is O(logN).Let there be N distinct numbers: a1, a2, ..., a(N-1), aN
We need to find element P.
There are two cases:
Case 1: The element P can be in N distinct indexes from 0 to N-1.
Case 2: There will be a case when the element P is not present in the list.
There are N case 1 and 1 case 2. So, there are N+1 distinct cases to consider in total.
If element P is in index K, then Binary Search will do K+1 comparisons.
This is because:
The element at index N/2 can be found in 1 comparison as Binary Search starts from the middle.
Similarly, in the 2nd comparison, elements at index N/4 and 3N/4 are compared based on the result of the 1st comparison.
On this line, in the 3rd comparison, elements at index N/8, 3N/8, 5N/8, and 7N/8 are compared based on the result of the 2nd comparison.
Based on this, we know that:
Elements requiring 1 comparison: 1
Elements requiring 2 comparisons: 2
Elements requiring 3 comparisons: 4
Therefore, Elements requiring I comparisons: 2^(I-1)
The maximum number of comparisons = Number of times N is divided by 2 so that result is 1 = Comparisons to reach 1st element = logN comparisons
I can vary from 0 to logN
Total number of comparisons = 1 * (Elements requiring 1 comparison) + 2 * (Elements requiring 2 comparisons) + ... + logN * (Elements requiring logN comparisons)
Total number of comparisons = 1 * (1) + 2 * (2) + 3 * (4) + ... + logN * (2^(logN-1))
Total number of comparisons = 1 + 4 + 12 + 32 + ... = 2^logN * (logN - 1) + 1
Total number of comparisons = N * (logN - 1) + 1
Total number of cases = N+1
Therefore, average number of comparisons = ( N * (logN - 1) + 1 ) / (N+1)
Average number of comparisons = N * logN / (N+1) - N/(N+1) + 1/(N+1)
The dominant term is N * logN / (N+1) which is approximately logN. Therefore, the Average Case Time Complexity of Binary Search is O(logN).
Worst Case Time Complexity of Binary Search
The worst case of Binary Search occurs when:
The element to search is in the first index or last index
In this case, the total number of comparisons required is logN comparisons.
Therefore, the Worst Case Time Complexity of Binary Search is O(logN).

Sum of arrays with repeated indices

How can I add an array of numbers to another array by indices? Especially with repeated indices. Like that
x
1 2 3 4
idx
0 1 0
y
5 6 7
] x add idx;y NB. (1 + 5 + 7) , (2 + 6) , 3 , 4
13 8 3 4
All nouns (x, idx, y) can be millions of items and I need to fast 'add' verb.
UPDATE
Solution (thanks to Dan Bron):
cumIdx =: 1 : 0
:
'i z' =. y
n =. ~. i
x n}~ (n{x) + i u//. z
)
(1 2 3 4) + cumIdx (0 1 0);(5 6 7)
13 8 3 4
For now, a short answer in the "get it done" mode:
data =. 1 2 3 4
idx =. 0 1 0
updat =. 5 6 7
cumIdx =: adverb define
:
n =. ~. m
y n}~ (n{y) + m +//. x
)
updat idx cumIdx data NB. 13 8 3 4
In brief:
Start by grouping the update array (in your post, y¹) where your index array has the same value, and taking the sum of each group
Accomplish this using the adverb key (/.) with sum (+/) as its verbal argument, deriving a dyadic verb whose arguments are idx on the left and the update array (your y, my updat) on the right.
Get the nub (~.) of your index array
Select these (unique) indices from your value array (your x, my data)
This will, by definition, have the same length as the cumulative sums we calculated in (1.)
Add these to the cumulative sum
Now you have your final updates to the data; updat and idx have the same length, so you just merge them into your value array using }, as you did in your code
Since we kept the update array small (never greater than its original length), this should have decent performance on larger inputs, though I haven't run any tests. The only performance drawback is the double computation of the nub of idx (once explicitly with ~. and once implicitly with /.), though since your values are integers, this should be relatively cheap; it's one of J's stronger areas, performance-wise.
¹ I realize renaming your arrays makes this answer more verbose than it needs to be. However, since you named your primary data x rather than y (which is the convention), if I had just kept your naming convention, then when I invoked cumIdx, the names of the nouns inside the definition would have the opposite meanings to the ones outside the definition, which I thought would cause greater confusion. For this reason, it's best to keep "primary data" on the right (y), and "control data" on the left (x).You might also consider constraining your use of the special names x,y,u,v,m and n to where they're already implicitly defined by invoking an explicit definition; definitely never change their nameclasses.
This approach also uses key (/.) but is a bit more simplistic in its approach.
It is likely to use more space especially for big updates than Dan Bron's.
addByIdx=: {{ (m , i.## y) +//. x,y }}
updat idx addByIdx data
13 8 3 4

Index values in Excel - Highest value is X, lowest is Y and all in between is divided in between X and Y

I need to index prices in Excel, so the highest price is 5 and the lowest is 1.
All the prices in between needs to be automatically sorted and given a value, depending on the amount of entries.
eg.
8$ -> 5.0 (Highest price = index score 5)
3$ -> 1.0 (Lowest price = index score 1)
7$ -> 2.5 (Price between the highest and lowest, gets a weighted score, depending on amount of entries in the list)
Your sentence "Price between the highest and lowest, gets a weighted score, depending on amount of entries in the list" is very unspecific. how is the weighted score calculated? How does the amount of entries in the list impact it? Can you give a specific example so we understand why the value for 7$ is 2.5? Is the weight depending on the position of 7$ in the sorted list of values?
On the other hand, if all you're looking for is an increasing affine function that maps values in a given range [a,b] to another range [c,d], then there is a simple formula. In your example, a=3$, b=8$, c=1.0, d=5.0.
For a new value x in the range [a,b], the corresponding value y in the range [c,d] is given by:
y = ( (d - c) * x + (bc - ad) ) / (b - a)
With your example, (d-c)/(b-a) = (5.0-1.0)/(8$-3$) = 0.8/$ and (bc-ad)/(b-a) = (8$*1.0-3$*5.0)/(8$-3$) = -1.4.
Therefore y = (0.8/$) * x - 1.4.
For instance if x = 7$, then y = (0.8/$) * 7$ - 1.4 = 4.4.

String manipulation with dynamic programming

I have a problem where I have a string of length N, where (1 ≤ N ≤ 10^5). This string will only have lower case letters.
We have to rewrite the string so that it has a series of "streaks", where the same letter is included at least K (1 ≤ K ≤ N) times in a row.
It costs a_ij to change a single specific letter in the string from i to j. There are M different possible letters you can change each letter to.
Example: "abcde" is the input string. N = 5 (length of "abcde"), M = 5 (letters are A, B, C, D, E), and K = 2 (each letter must be repeated at least 2 times) Then we are given a M×M matrix of values a_ij, where a_ij is an integer in the range 0…1000 and a_ii = 0 for all i.
0 1 4 4 4
2 0 4 4 4
6 5 0 3 2
5 5 5 0 4
3 7 0 5 0
Here, it costs 0 to change from A to A, 1 to change from A to B, 4 to change from A to C, and so on. It costs 2 to change from B to A.
The optimal solution in this example is to change the a into b, change the d into e, and then change both e’s into c’s. This will take 1 + 4 + 0 + 0 = 5 moves, and the final combo string will be "bbccc".
It becomes complicated as it might take less time to switch from using button i to an intermediate button k and then from button k to button j rather than from i to j directly (or more generally, there may be a path of changes starting with i and ending with j that gives the best overall cost for switching from button i ultimately to button j).
To solve for this issue, I am treating the matrix as a graph, and then performing Floyd Warshall to find the fastest time to switch letters. This will take O(M^3) which is only 26^3.
My next step is to perform dynamic programming on each additional letter to find the answer. If someone could give me advice on how to do this, I would be thankful!
Here are some untested ideas. I'm not sure if this is efficient enough (or completely worked out) but it looks like 26 * 3 * 10^5. The recurrence could be converted to a table, although with higher Ks, memoisation might be more efficient because of reduced state possibilities.
Assume we've recorded 26 prefix arrays for conversion of the entire list to each of the characters using the best conversion schedule, using a path-finding method. This lets us calculate the cost of a conversion of a range in the string in O(1) time, using a function, cost.
A letter in the result can be one of three things: either it's the kth instance of character c, or it's before the kth, or it's after the kth. This leads to a general recurrence:
f(i, is_kth, c) ->
cost(i - k + 1, i, c) + A
where
A = min(
f(i - k, is_kth, c'),
f(i - k, is_after_kth, c')
) forall c'
A takes constant time since the alphabet is constant, assuming earlier calls to f have been tabled.
f(i, is_before_kth, c) ->
cost(i, i, c) + A
where
A = min(
f(i - 1, is_before_kth, c),
f(i - 1, is_kth, c'),
f(i - 1, is_after_kth, c')
) forall c'
Again A is constant time since the alphabet is constant.
f(i, is_after_kth, c) ->
cost(i, i, c) + A
where
A = min(
f(i - 1, is_after_kth, c),
f(i - 1, is_kth, c)
)
A is constant time in the latter. We would seek the best result of the recurrence applied to each character at the end of the string with either state is_kth or state is_after_kth.

Calculating contrast values on Excel

I am currently studying experimental designs in statistics and I am calculating values pertaining to 2^3 factorial designs.
The question that I have is particularly with the calculations of the "contrasts".
My goal of this question is to learn how to use the table "Coded Factors" and "Total" in order to get the values "Contrast" using the IF THEN function in Excel.
For example, Contrast A is calculated as : x - y . Where
x = sum of the values in the Total, where the Coded Factor A is + .
And y= sum of the values in the Total, where the Coded Factor A is - .
This would be rather simple, but for the interactions it is a bit more complex.
For example, contrast AC is obtained as : x - y . Where
x = sum of the values in the Total, where the product of Coded Factor A and that of C becomes + .
And y = sum of the values in the Total, where the product of Coded Factor A and that of B becomes - .
I would really appreciate your help.
Edited:
Considering the way how IF statements work, I thought that it might be a good idea to convert the + into 1 and - into -1 to make the calculation straight forward.
Convert all +/- to 1/-1. Use some cells as helper..
Put in these formulas :
J2 --> =LEFT(J1)
K2 --> =MID(J1,2,1)
L2 --> =MID(J1,3,1)
Put
J3 --> =IF(J$2="",1,INDEX($B3:$D3,MATCH(J$2,$B$2:$D$2,0)))
and drag to L10. Then
M3 --> =J3*K3*L3*G3
and drag to M10. Lastly,
M1 --> =SUM(M3:M10)
How to use : Input the Factor comb in cell J1 and the result will be in M1.
Idea : separate the factor text > load the multiplier > multiply Total values with multiplier > get sum.
Hope it helps.

Resources