When to use Rabin-Karp or KMP algorithms? - string

I have generated an string using the following alphabet.
{A,C,G,T}. And my string contains more than 10000 characters. I'm searching the following patterns in it.
ATGGA
TGGAC
CCGT
I have asked to use a string matching algorithm which has O(m+n) running time.
m = pattern length
n = text length
Both KMP and Rabin-Karp algorithms have this running time. What is the most suitable algorithm (between Rabin-Carp and KMP) in this situation?

When you want to search for multiple patterns, typically the correct choice is to use Aho-Corasick, which is somewhat a generalization of KMP. Now in your case you are only searching for 3 patterns so it may be the case that KMP is not that much slower(at most three times), but this is the general approach.
Rabin-Karp is easier to implement if we assume that a collision will never happen, but if the problem you have is a typical string searching KMP will be more stable no matter what input you have. However, Rabin-Karp has many other applications, where KMP is not an option.

Related

Are there multiple KMP algorithmic approaches with different space complexities? What is the difference?

I am reading about the KMP substring search algorithm and the examples I find online use an one-dimensional table to build the prefix information table.
I also read the Sedgewick explanation and he used a 2-D array to build the table and explicitly states that the space complexity of KMP is O(RM) where R is the alphabet size and M the pattern size while everywhere else it is stated that the space complexity is just O(M + N) i.e. the text to process and the pattern size itself.
So I am confused on the difference. Are there multiple KMP algorithmic approaches? And do they have different scope? Or what am I missing?
Why is the 2D needed if 1D can solve the substring problem too?
I guess Sedgewick wanted to demonstrate a variant of KMP that constructs a deterministic finite automaton in the standard sense of that term. It's a weird choice that (as you observe) bloats the running time, but maybe there was a compelling pedagogical reason that I don't appreciate (then again my PhD was on algorithms, so...). I'd find another description that follows the original more closely.

Is there a way to optimize KMP algorithm to involve the character we are comparing?

I noticed that KMP algorithm simply consults failure function when it finds a mismatch at Haystack[a] and Needle[b] but it never look at what Needle[b] is. Is it possible to make KMP faster if we considered what Needle[b] is?
You could in principle update KMP so that the failure table looks at the input character that caused the mismatch. However, there's a question of how much it would cost you to do this and how much benefit you'd get by doing so.
One of the advantages of KMP is that the preprocessing step takes time O(n), where n is the length of the pattern string. There's no dependence on how many characters the string is composed of (e.g. Unicode, ASCII, etc.) Simply building a table associating each character with what to do in some circumstance would negatively impact the runtime of the preprocessing step.
That being said, there are other algorithms that do indeed look at the mismatched character in the needle. The Boyer-Moore algorithm, for example, does decide where to search next based on which mismatched character is found, and it can run much faster than KMP as a result. You may want to look into that algorithm if you're curious how this works in practice.

When is good to use KMP algorithm?

I understand that KMP algorithm depends on the helper array that there are prefixes that are similar to suffixes.
It won't efficient when the above condition is not fulfilled as in the helper array contains all zeroes.
Would the runtime be O(m + n) ?
If I am right, what is a better substring algorithm in this case?
To understand when KMP is a good algorithm to use, it's often helpful to ask the question "what's the alternative?"
KMP has the nice advantage that it is guaranteed worst-case efficient. The preprocessing time is always O(n), and the searching time is always O(m). There are no worst-case inputs, no probability of getting unlucky, etc. In cases where you are searching for very long strings (large n) inside of really huge strings (large m), this may be highly desirable compared to other algorithms like the naive one (which can take time Θ(mn) in bad cases), Rabin-Karp (pathological inputs can take time Θ(mn)), or Boyer-Moore (worst-case can be Θ(mn)). You're right that KMP might not be all that necessary in the case where there aren't many overlapping parts of the string, but the fact that you never need to worry about whether there's a bad case is definitely a nice thing to have!
KMP also has the nice property that the processing can be done a single time. If you know you're going to search for the same substring lots and lots of times, you can do the O(n) preprocessing work once and then have the ability to search in any length-m string you'd like in time O(m).

Boyer-Moore string search algorithm run time complexity

In Boyer-Moore string search algorithm wiki link, it is stated that worst case complexity of Boyer-Moore is
O(m+n) if pattern does not appear in the text
O(mn) if pattern does appear in the text
But in String Search Algorithm wiki, it is stated that worst case complexity of Boyer-Moore is O(n). Why is this disparity ?
Here also it is stated to be O(mn) in worst case.
So what is the correct run time complexity of Boyer-Moore algorithm ?
The difference comes from different definitions. In the general string search page, algorithms complexity is split to preprocessing and matching, whereas the page for the algorithm itself didn't make that distinction.
The preprocessing will be Θ(m + k) plus O(n) for matching.

Is it possible to use the KMP algorithm to find a longest substring?

Suppose I have a pattern P and some text T, and I want to find the largest prefix of P that matches a substring of T. Is it possible to modify the KMP algorithm to do such an operation? (If I remember correctly, the KMP algorithm does partial matches, but I am interested in the longest possible match).
As KMP is scanning the text, the state of the KMP shows the length of the longest prefix of the pattern that matches the text up to the current point, so you could record the maximum length seen and the point in the pattern at which it was seen, and it does look like you could use this to find a longest matching prefix of P.
Another way of doing this would be to put all prefixes of P into Aho-Corasick. The run-time behaviour would be very similar, although it would consume a little more store. It would allow you to use an existing library - if you had one for Aho-Corasick, instead of modifying a KMP implementation.
Actually it is a typical scenario of the so called "extended-KMP".
See the sample code here and here.

Resources