I found this filled college worksheet. It states that minimum (Hamming) distance of ISBN code is 2 (Excercise 5). I know how to prove this and why is that. But then in excercise 8 it states that ISBN cannot self-repair, if n-th digit is corrupted and n is not known.
To show why, it references Ex. 5 and place:
H(x,y) = 2 < 2(1) + 1.
How is this showing that ISBN cannot correct single digit in general? What kind of formula is this?
Ok. I probably already found an answer. If you want to add something, please feel free for any other people.
A code C is said to be k-errors correcting if, for every word w in the underlying Hamming space H, there exists at most one codeword c (from C) such that the Hamming distance between w and c is at most k. In other words, a code is k-errors correcting if, and only if, the minimum Hamming distance between any two of its codewords is at least 2k+1
Wikipedia from Robinson, Derek J. S. (2003). An Introduction to Abstract Algebra. Walter de Gruyter. pp. 255–257. ISBN 978-3-11-019816-4.
Related
Given a string A and a string B (A shorter or the same length as B), I would like to check whether B contains a substring A' such that the Hamming distance between A and A' is at most k.
Does anyone know of an efficient algorithm to do this? Obviously I can just run a sliding window, but this is not feasible for the amount of data I'm working with. The Knuth-Morris-Pratt algorithm (https://en.wikipedia.org/wiki/Knuth%E2%80%93Morris%E2%80%93Pratt_algorithm) would work when k=0, but I don't know whether it's modifiable to account for k>0.
Thanks!
Edit: I apparently forgot to clarify, I am looking for a consecutive substring, so for example the substring from position 3 to position 7, without skipping characters. So levenshtein distance is not applicable.
This is what you are looking for : https://en.wikipedia.org/wiki/Levenshtein_distance
If you use the Levenshtein distance and k=1, then you can use the fact that if the length of A is 2n+1 or 2n+2, then either the first or the last n characters of A must be in B.
So you can use strstr to find all places in B where the first or last n characters match exactly and then check the Levenshtein distance.
Special case A = 1 characters: matches everywhere with one error. Special case where A = 2 characters ab: call strchr(a), if it fails call strchr(b).
Suppose we are trying to measure similarity between two very similar documents.
Document A: "a b c d"
Document B: "a b c e"
This corresponds to a term-frequency matrix
a b c d e
A 1 1 1 1 0
B 1 1 1 0 1
where the cosine similarity on the raw vectors is the dot product of the two vectors A and B, divided by the product of their magnitudes:
3/4 = (1*1 + 1*1 + 1*1 + 1*0 + 1*0) / (sqrt(4) * sqrt(4)).
But when we apply an inverse document frequency transformation by multiplying each term in the matrix by (log(N / df_i), where N is the number of documents in the matrix, 2, and df_i is the number of documents in which a term is present, we get a tf-idf matrix of
a b c d e
A: 0 0 0 log2 0
B: 0 0 0 0 1og2
Since "a" appears in both documents, it has an inverse-document-frequency value of 0. This is the same for "b" and "c". Meanwhile, "d" is in document A, but not in document B, so it is multiplied by log(2/1). "e" is in document B, but not in document A, so it is also multiplied by log(2/1).
The cosine similarity between these two vectors is 0, suggesting the two are totally different documents. Obviously, this is incorrect. For these two documents to be considered similar to each other using tf-idf weightings, we would need a third document C in the matrix which is vastly different from documents A and B.
Thus, I am wondering whether and/or why we would use tf-idf weightings in combination with a cosine similarity metric to compare highly similar documents. None of the tutorials or StackOverflow questions I've read have been able to answer this question.
This post discusses similar failings with tf-idf weights using cosine similarities, but offers no guidance on what to do about them.
EDIT: as it turns out, the guidance I was looking for was in the comments of that blog post. It recommends using the formula
1 + log ( N / ni + 1)
as the inverse document frequency transformation instead. This would keep the weights of terms which are in every document close to their original weights, while inflating the weights of terms which are not present in a lot of documents by a greater degree. Interesting that this formula is not more prominently found in posts about tf-idf.
Since "a" appears in both documents, it has an inverse-document-frequency value of 0
This is where you have made an error in using inverse document frequency (idf). Idf is meant to be computed over a large collection of documents (not just across two documents), the purpose being to be able to predict the importance of term overlaps in document pairs.
You would expect that common terms, such as 'the', 'a' etc. overlap across all document pairs. Should that be having any contribution to your similarity score? - No.
That is precisely the reason why the vector components are multiplied by the idf factor - just to dampen or boost a particular term overlap (a component of the form a_i*b_i being added to the numerator in the cosine-sim sum).
Now consider you have a collection on computer science journals. Do you believe that an overlap of terms such as 'computer' and 'science' across a document pair is considered to be important? - No.
And this will indeed happen because the idf of these terms would be considerably low in this collection.
What do you think will happen if you extend the collection to scientific articles of any discipline? In that collection, the idf value of the word 'computer' will no longer be low. And that makes sense because in this general collection, you would like to think that two documents are similar enough if they are on the same topic - computer science.
As it turns out, the guidance I was looking for was in the comments of that blog post. It recommends using the formula
1 + log ( N / ni + 1)
as the inverse document frequency transformation instead. This would keep the weights of terms which are in every document close to their original weights, while inflating the weights of terms which are not present in a lot of documents by a greater degree. Interesting that this formula is not more prominently found in posts about tf-idf.
My formula is giving me unexpected responses.
=IF(I5+H5=0,"Paid","Due")
see below
H I J k
-£34.40 £34.40 £0.00 Due
Cell H is calculated with this
=(SUM(F5+G5))*-1
See correct output with exact same formula on same worksheet
=IF(I3+H3=0,"Paid","Due")
H I J K
-£205.44 £205.44 £0.00 Paid
Cell H is calculated he same
=(SUM(F3+G3))*-1
Any ideas why the top calculation not correct but the bottom one is.
This is most likely the floating point issue. You should not compare floating point numbers directly with = because computers can't store the full decimal places. Just like if you divide 1 dollar by 3, you end up with .3333333333333 cents, well if you add 3 of those you don't necessarily get back 1 dollar, but slightly less due to the "lost" 3333's at the end. The proper way to compare is using a Delta threshold, meaning "how close" it needs to be.
so instead of
if (a+b=c,"paid", "due")
you would do
if(ABS(c-(a+b))<.01, "paid", "due")
so in that case .01 is the delta, or "how close" it has to be. It has to be within 1 cent. the formula literally means "if the absolute value of the difference between c and (a+b) is less than 1 cent, return paid, else return due. (of course, this will say due if they overpaid, so keep that in mind)
you should always do this.
I tried to solve the problem of two dimensional search using a combination of Aho-Corasick and a single dimensional KMP, however, I still need something faster.
To elaborate, I have a matrix A of characters of size n1*n2 and I wish to find all occurrences of a smaller matrix B of size m1*m2 and I want that to be in O(n1*n2+m1*m2) if possible.
For example:
A = a b c b c b
b c a c a c
d a b a b a
q a s d q a
and
B = b c b
c a c
a b a
the algorithm should return the indexes of say, the upper left corner of the match, which in this case should return (0,1) and (0,3). notice that the occurrences may overlap.
There is an algorithm called the Baker-Bird algorithm that I just recently encountered that appears to be a partial generalization of KMP to two dimensions. It uses two algorithms as subroutines - the Aho-Corasick algorithm (which itself is a generalization of KMP), and the KMP algorithm - to efficiently search a two-dimensional grid for a pattern.
I'm not sure if this is what you're looking for, but hopefully it helps!
Would it be reasonable to systematically try all possible placements in a word search?
Grids commonly have dimensions of 15*15 (15 cells wide, 15 cells tall) and contain about 15 words to be placed, each of which can be placed in 8 possible directions. So in general it seems like you can calculate all possible placements by the following:
width*height*8_directions_to_place_word*number of words
So for such a grid it seems like we only need to try 15*15*8*15 = 27,000 which doesn't seem that bad at all. I am expecting some huge number so either the grid size and number of words is really small or there is something fishy with my math.
Formally speaking, assuming that x is number of rows and y is number of columns you should sum all the probabilities of every possible direction for every possible word.
Inputs are: x, y, l (average length of a word), n (total words)
so you have
horizontally a word can start from 0 to x-l and going right or from l to x going left for each row: 2x(x-l)
same approach is used for vertical words: they can go from 0 to y-l going down or from l to y going up. So it's 2y(y-l)
for diagonal words you shoul consider all possible start positions x*y and subtract l^2 since a rect of the field can't be used. As before you multiply by 4 since you have got 4 possible directions: 4*(x*y - l^2).
Then you multiply the whole result for the number of words included:
total = n*(2*x*(x-l)+2*y*(y-l)+4*(x*y-l^2)