When is good to use KMP algorithm? - string

I understand that KMP algorithm depends on the helper array that there are prefixes that are similar to suffixes.
It won't efficient when the above condition is not fulfilled as in the helper array contains all zeroes.
Would the runtime be O(m + n) ?
If I am right, what is a better substring algorithm in this case?

To understand when KMP is a good algorithm to use, it's often helpful to ask the question "what's the alternative?"
KMP has the nice advantage that it is guaranteed worst-case efficient. The preprocessing time is always O(n), and the searching time is always O(m). There are no worst-case inputs, no probability of getting unlucky, etc. In cases where you are searching for very long strings (large n) inside of really huge strings (large m), this may be highly desirable compared to other algorithms like the naive one (which can take time Θ(mn) in bad cases), Rabin-Karp (pathological inputs can take time Θ(mn)), or Boyer-Moore (worst-case can be Θ(mn)). You're right that KMP might not be all that necessary in the case where there aren't many overlapping parts of the string, but the fact that you never need to worry about whether there's a bad case is definitely a nice thing to have!
KMP also has the nice property that the processing can be done a single time. If you know you're going to search for the same substring lots and lots of times, you can do the O(n) preprocessing work once and then have the ability to search in any length-m string you'd like in time O(m).

Related

Are there multiple KMP algorithmic approaches with different space complexities? What is the difference?

I am reading about the KMP substring search algorithm and the examples I find online use an one-dimensional table to build the prefix information table.
I also read the Sedgewick explanation and he used a 2-D array to build the table and explicitly states that the space complexity of KMP is O(RM) where R is the alphabet size and M the pattern size while everywhere else it is stated that the space complexity is just O(M + N) i.e. the text to process and the pattern size itself.
So I am confused on the difference. Are there multiple KMP algorithmic approaches? And do they have different scope? Or what am I missing?
Why is the 2D needed if 1D can solve the substring problem too?
I guess Sedgewick wanted to demonstrate a variant of KMP that constructs a deterministic finite automaton in the standard sense of that term. It's a weird choice that (as you observe) bloats the running time, but maybe there was a compelling pedagogical reason that I don't appreciate (then again my PhD was on algorithms, so...). I'd find another description that follows the original more closely.

Is there a way to optimize KMP algorithm to involve the character we are comparing?

I noticed that KMP algorithm simply consults failure function when it finds a mismatch at Haystack[a] and Needle[b] but it never look at what Needle[b] is. Is it possible to make KMP faster if we considered what Needle[b] is?
You could in principle update KMP so that the failure table looks at the input character that caused the mismatch. However, there's a question of how much it would cost you to do this and how much benefit you'd get by doing so.
One of the advantages of KMP is that the preprocessing step takes time O(n), where n is the length of the pattern string. There's no dependence on how many characters the string is composed of (e.g. Unicode, ASCII, etc.) Simply building a table associating each character with what to do in some circumstance would negatively impact the runtime of the preprocessing step.
That being said, there are other algorithms that do indeed look at the mismatched character in the needle. The Boyer-Moore algorithm, for example, does decide where to search next based on which mismatched character is found, and it can run much faster than KMP as a result. You may want to look into that algorithm if you're curious how this works in practice.

Fast substring in scala

According to Time complexity of Java's substring(), java's substring takes linear time.
Is there a faster way (may be in some cases)?
I may suggest iterator, but suspect that it also takes O(n).
val s1: String = s.iterator.drop(5).mkString
But several operations on an iterator would be faster than same operations on string, right?
If you need to edit very long string, consider using data structure called Rope.
Scalaz library has Cord class which is implementation of modified version of Rope.
A Cord is a purely functional data structure for efficiently
storing and manipulating Strings that are potentially very long.
Very similar to Rope[Char], but with better constant factors and a
simpler interface since it's specialized for Strings.
As Strings are - according to the linked question - always backed by a unique character array, substring can't be faster than O(n). You need to copy the character data.
As for alternatives: there will at least be one operation which is O(n). In your example, that's mkString which collects the characters in the iterator and builds a string from them.
However, I wouldn't worry about that too much. The fact that you're using a high level language means (should mean) that developer time is more valuable than CPU time for your particular task. substring is also the canonical way to ... take a substring, so using it makes your program more readable.
EDIT: I also like this sentence (from this answer) a lot: O(n) is O(1) if n does not grow large. What I take away from this is: you shouldn't write inefficient code, but asymptotical efficiency is not the same as real-world efficiency.

When to use Rabin-Karp or KMP algorithms?

I have generated an string using the following alphabet.
{A,C,G,T}. And my string contains more than 10000 characters. I'm searching the following patterns in it.
ATGGA
TGGAC
CCGT
I have asked to use a string matching algorithm which has O(m+n) running time.
m = pattern length
n = text length
Both KMP and Rabin-Karp algorithms have this running time. What is the most suitable algorithm (between Rabin-Carp and KMP) in this situation?
When you want to search for multiple patterns, typically the correct choice is to use Aho-Corasick, which is somewhat a generalization of KMP. Now in your case you are only searching for 3 patterns so it may be the case that KMP is not that much slower(at most three times), but this is the general approach.
Rabin-Karp is easier to implement if we assume that a collision will never happen, but if the problem you have is a typical string searching KMP will be more stable no matter what input you have. However, Rabin-Karp has many other applications, where KMP is not an option.

O(n^2) (or O(n^2lg(n)) ?)algorithm to calculate the longest common subsequence (LCS) of two 'ring' string

This is a problem appeared in today's Pacific NW Region Programming Contest during which no one solved it. It is problem B and the complete problem set is here: http://www.acmicpc-pacnw.org/icpc-statements-2011.zip. There is a well-known O(n^2) algorithm for LCS of two strings using Dynamic Programming. But when these strings are extended to rings I have no idea...
P.S. note that it is subsequence rather than substring, so the elements do not need to be adjacent to each other
P.S. It might not be O(n^2) but O(n^2lgn) or something that can give the result in 5 seconds on a common computer.
Searching the web, this appears to be covered by section 4.3 of the paper "Incremental String Comparison", by Landau, Myers, and Schmidt at cost O(ne) < O(n^2), where I think e is the edit distance. This paper also references a previous paper by Maes giving cost O(mn log m) with more general edit costs - "On a cyclic string to string correcting problem". Expecting a contestant to reproduce either of these papers seems pretty demanding to me - but as far as I can see the question does ask for the longest common subsequence on cyclic strings.
You can double the first and second string and then use the ordinary method, and later wrap the positions around.
It is a good idea to "double" the strings and apply the standard dynamic programing algorithm. The problem with it is that to get the optimal cyclic LCS one then has to "start the algorithm from multiple initial conditions". Just one initial condition (e.g. setting all Lij variables to 0 at the boundaries) will not do in general. In practice it turns out that the number of initial states that are needed are O(N) in number (they span a diagonal), so one gets back to an O(N^3) algorithm.
However, the approach does has some virtue as it can be used to design efficient O(N^2) heuristics (not exact but near exact) for CLCS.
I do not know if a true O(N^2) exist, and would be very interested if someone knows one.
The CLCS problem has quite interesting properties of "periodicity": the length of a CLCS of
p-times reapeated strings is p times the CLCS of the strings. This can be proved by adopting a geometric view off the problem.
Also, there are some additional benefits of the problem: it can be shown that if Lc(N) denotes the averaged value of the CLCS length of two random strings of length N, then
|Lc(N)-CN| is O(\sqrt{N}) where C is Chvatal-Sankoff's constant. For the averaged length L(N) of the standard LCS, the only rate result of which I know says that |L(N)-CN| is O(sqrt(Nlog N)). There could be a nice way to compare Lc(N) with L(N) but I don't know it.
Another question: it is clear that the CLCS length is not superadditive contrary to the LCS length. By this I mean it is not true that CLCS(X1X2,Y1Y2) is always greater than CLCS(X1,Y1)+CLCS(X2,Y2) (it is very easy to find counter examples with a computer).
But it seems possible that the averaged length Lc(N) is superadditive (Lc(N1+N2) greater than Lc(N1)+Lc(N2)) - though if there is a proof I don't know it.
One modest interest in this question is that the values Lc(N)/N for the first few values of N would then provide good bounds to the Chvatal-Sankoff constant (much better than L(N)/N).
As a followup to mcdowella's answer, I'd like to point out that the O(n^2 lg n) solution presented in Maes' paper is the intended solution to the contest problem (check http://www.acmicpc-pacnw.org/ProblemSet/2011/solutions.zip). The O(ne) solution in Landau et al's paper does NOT apply to this problem, as that paper is targeted at edit distance, not LCS. In particular, the solution to cyclic edit distance only applies if the edit operations (add, delete, replace) all have unit (1, 1, 1) cost. LCS, on the other hand, is equivalent to edit distances with (add, delete, replace) costs (1, 1, 2). These are not equivalent to each other; for example, consider the input strings "ABC" and "CXY" (for the acyclic case; you can construct cyclic counterexamples similarly). The LCS of the two strings is "C", but the minimum unit-cost edit is to replace each character in turn.
At 110 lines but no complex data structures, Maes' solution falls towards the upper end of what is reasonable to implement in a contest setting. Even if Landau et al's solution could be adapted to handle cyclic LCS, the complexity of the data structure makes it infeasible in a contest setting.
Last but not least, I'd like to point out that an O(n^2) solution DOES exist for CLCS, described here: http://arxiv.org/abs/1208.0396 At 60 lines, no complex data structures, and only 2 arrays, this solution is quite reasonable to implement in a contest setting. Arriving at the solution might be a different matter, though.

Resources