KMP algorithm for multiple occurrences - string

Is it possible to still perform a O(n) time complexity to search multiple occurrences of Knuth–Morris–Pratt algorithm?

Suppose we have a string S[0,...,N]. Recall that the ith entry in the prefix array stores the length of the maximal prefix of S[0,...,i] that matches the suffix.
We can calculate the prefix array P for pattern$subject (assuming that $ doesn't occur in subject). It remains to find indices such that P[i]==length(pattern), which can be done in linear time.

Related

Given k words, determine words equality in constant time

I have encountered this question while studying for algorithms test:
Given a set of k words (strings), with a total character count of n, (meaning the sum of all words lengths are n), perform some sort of manipulation on the words in O(n) time, such that whenever 2 words are being compared, return answer (whether they are identical or not) in O(1) time.
It's an interesting question but I could not find any direction to deal with it...
Construct a trie of all of the words, and for each word store the index of its last character in the array. This is a O(n) operation.
Given two words, they are the same if and only if the index of their last character is the same.

String length using pattern matching technique

What is the maximum number of comparisons required to search a string of length L in a text whose length is T using first pattern matching technique?
Knuth–Morris–Pratt algorithm gives you the time complexity of O(L+T).

Comparing two strings fast

is that possible to compare two substrings of the same length from the same given big string faster than O(n) ? ( where n is the length of the substrings )
I mean, if you have queries like "compare substring between x1, y1 positions with the one from x2, y2 positions"
You could compute a suffix array for the big string.
This array tells you the order of the string starting at x1 compared to the order of the string starting at X2.
You will need to check that the strings have diverged (or else the strings could be equal) before you get to the end. You could do this using a rolling hash, or by using a longest common prefix array.
There is a good tutorial on suffix array HERE
In principal, there are faster string searching algorithms than checking every character, but they are all proportional to the amount of data being searched.
e.g. extbndm, Boyer-Moore
There are multi-string searching algorithms (e.g. Aho-Corasaik), but at the end of the day, they are all order O(n). You can manage the constant part, but you need to search the whole string.

longest common substring for 2/3 strings : suffix array vs dynamic programming approach

If I want to find the longest common substring for 2 strings then which approach will be more efficient in terms of time/space complexity: using suffix arrays of DP?
DP will incur O(m*n) space with O(m*n) time complexity, what will be the time complexity of the suffix array approach?
1) Calculate the suffixes O(m) + O(n)
2) Sort them O(m+n log2(m+n))
3) Finding longest common prefix for m+n-1 strings? [I'm not sure how to calculate #of comparisons]
Suffix arrays allow us to do many more things with the sub-strings (like search for sub-string etc.), but since in this case rest of the functions are not needed, will DP be considered an easier/cleaner approach?Which one should be used in the case where we are comparing 2 strings?
Also, what if we have more than 2 strings?
Suffix array would be better. The LCS(longest common substring for n strings) problem can be solve as below:
Concatenate S1, S2, ..., Sn as follows:
S = S1$1S2$2...$nSn, Here $i are special symbols (sentinels) that are different and
lexicographically less than other symbols of the initial alphabet.
Compute the suffix array. Generally, We implemented suffix array in O(n*log n) but there is an important algorithm called DC3 which computes suffix arrays in O(n), n is the total length of N strings. You can google this algorithm.
Compute the LCP of all adjacent suffixes.

Shortest uncommon substring: shortest substring of one string, that is not a substring of another string

We need to find shortest uncommon substring between two strings
i.e. if we have two strings a and b so we need to find the length of shortest substring of a that is not a substring of b.
How to solve this problem using suffix array ?
To be solved with complexity of not more than n*lg(n)
This may be solved in O(N) time with Generalized suffix tree.
After constructing the generalized suffix tree in O(N) time, you need to perform breadth-first search and find the first node not belonging to both strings. The path from the root to this node gives the shortest uncommon substring.
The same thing may be done using the generalized suffix array for two input strings, also in O(N) time.
Construct the generalized suffix array along with LCP array (or construct the LCP array later from the suffix array). Add a single zero element as a prefix of the LCP array; add another zero element as a suffix. Find a pair of minimal LCP entries in such a way that there are suffixes of only one string delimited by these entries. This means you need to perform a linear scan of the LCP array, extracting two minimal values, but reset both minimal values to infinity every time you see a suffix of a different string or if you see a suffix belonging to both strings. The larger element of the best of these pairs (having the least value for the larger element in the pair) gives the length of the shortest uncommon substring. This works because this pair of minimal values delimits all descendants of the first node (closest to the root), not belonging to both strings in the corresponding suffix tree.

Resources