Boyer-Moore string search algorithm run time complexity - string

In Boyer-Moore string search algorithm wiki link, it is stated that worst case complexity of Boyer-Moore is
O(m+n) if pattern does not appear in the text
O(mn) if pattern does appear in the text
But in String Search Algorithm wiki, it is stated that worst case complexity of Boyer-Moore is O(n). Why is this disparity ?
Here also it is stated to be O(mn) in worst case.
So what is the correct run time complexity of Boyer-Moore algorithm ?

The difference comes from different definitions. In the general string search page, algorithms complexity is split to preprocessing and matching, whereas the page for the algorithm itself didn't make that distinction.
The preprocessing will be Θ(m + k) plus O(n) for matching.

Related

Are there multiple KMP algorithmic approaches with different space complexities? What is the difference?

I am reading about the KMP substring search algorithm and the examples I find online use an one-dimensional table to build the prefix information table.
I also read the Sedgewick explanation and he used a 2-D array to build the table and explicitly states that the space complexity of KMP is O(RM) where R is the alphabet size and M the pattern size while everywhere else it is stated that the space complexity is just O(M + N) i.e. the text to process and the pattern size itself.
So I am confused on the difference. Are there multiple KMP algorithmic approaches? And do they have different scope? Or what am I missing?
Why is the 2D needed if 1D can solve the substring problem too?
I guess Sedgewick wanted to demonstrate a variant of KMP that constructs a deterministic finite automaton in the standard sense of that term. It's a weird choice that (as you observe) bloats the running time, but maybe there was a compelling pedagogical reason that I don't appreciate (then again my PhD was on algorithms, so...). I'd find another description that follows the original more closely.

Is there a way to optimize KMP algorithm to involve the character we are comparing?

I noticed that KMP algorithm simply consults failure function when it finds a mismatch at Haystack[a] and Needle[b] but it never look at what Needle[b] is. Is it possible to make KMP faster if we considered what Needle[b] is?
You could in principle update KMP so that the failure table looks at the input character that caused the mismatch. However, there's a question of how much it would cost you to do this and how much benefit you'd get by doing so.
One of the advantages of KMP is that the preprocessing step takes time O(n), where n is the length of the pattern string. There's no dependence on how many characters the string is composed of (e.g. Unicode, ASCII, etc.) Simply building a table associating each character with what to do in some circumstance would negatively impact the runtime of the preprocessing step.
That being said, there are other algorithms that do indeed look at the mismatched character in the needle. The Boyer-Moore algorithm, for example, does decide where to search next based on which mismatched character is found, and it can run much faster than KMP as a result. You may want to look into that algorithm if you're curious how this works in practice.

When is good to use KMP algorithm?

I understand that KMP algorithm depends on the helper array that there are prefixes that are similar to suffixes.
It won't efficient when the above condition is not fulfilled as in the helper array contains all zeroes.
Would the runtime be O(m + n) ?
If I am right, what is a better substring algorithm in this case?
To understand when KMP is a good algorithm to use, it's often helpful to ask the question "what's the alternative?"
KMP has the nice advantage that it is guaranteed worst-case efficient. The preprocessing time is always O(n), and the searching time is always O(m). There are no worst-case inputs, no probability of getting unlucky, etc. In cases where you are searching for very long strings (large n) inside of really huge strings (large m), this may be highly desirable compared to other algorithms like the naive one (which can take time Θ(mn) in bad cases), Rabin-Karp (pathological inputs can take time Θ(mn)), or Boyer-Moore (worst-case can be Θ(mn)). You're right that KMP might not be all that necessary in the case where there aren't many overlapping parts of the string, but the fact that you never need to worry about whether there's a bad case is definitely a nice thing to have!
KMP also has the nice property that the processing can be done a single time. If you know you're going to search for the same substring lots and lots of times, you can do the O(n) preprocessing work once and then have the ability to search in any length-m string you'd like in time O(m).

When to use Rabin-Karp or KMP algorithms?

I have generated an string using the following alphabet.
{A,C,G,T}. And my string contains more than 10000 characters. I'm searching the following patterns in it.
ATGGA
TGGAC
CCGT
I have asked to use a string matching algorithm which has O(m+n) running time.
m = pattern length
n = text length
Both KMP and Rabin-Karp algorithms have this running time. What is the most suitable algorithm (between Rabin-Carp and KMP) in this situation?
When you want to search for multiple patterns, typically the correct choice is to use Aho-Corasick, which is somewhat a generalization of KMP. Now in your case you are only searching for 3 patterns so it may be the case that KMP is not that much slower(at most three times), but this is the general approach.
Rabin-Karp is easier to implement if we assume that a collision will never happen, but if the problem you have is a typical string searching KMP will be more stable no matter what input you have. However, Rabin-Karp has many other applications, where KMP is not an option.

Scan text and check whether it contains word from specified list

Requirement
currently we have a list containing more than ten thousands keywords or sentences(the number is N)
Input is a long character string , the length is L
Question: Check whether the character string contains keywords or sentences in list given
The question can be described as word filter on wikipedia, but I didn't find any algorithm on that page. The simplest way to fix this problem is iterator all the keywords or sentences and each time check whether the long text contains such substring. As we have a number of keywords, also considering the long text, the performance is very bad. It uses O(NL) time
Seems the better solution should be finished in O(L). Could anyone give some suggestion about this?
There are several approaches to this problem having time complexity O(M + L), where L is length of the string and M is combined length of all patterns:
Aho–Corasick string matching algorithm.
Construct a suffix tree for the string using Ukkonen's algorithm, then find the match for each pattern to this suffix tree.
Construct a generalized suffix tree for set of patterns, then find the match between input string and this suffix tree.
Construct a Suffix array for the string, and use it to search each pattern. This approach has time complexity O(M + L + N log L), where N is the number of patterns.
Commentz-Walter algorithm.
You can find details for all these algorithms (except Commentz-Walter algorithm) in this book: Algorithms on Strings, Trees and Sequences by Dan Gusfield.
Several different (simpler) approaches may be used if you can unambiguously extract separate words/sentences from input string.
Prepare a Bloom filter with size, large enough to guarantee low probability of false positives for N patterns. Add to Bloom filter bits, determined by hash functions of keywords/sentences. Then scan the string, extracting consecutive words/sentences, and check if these words/sentences can be found in the Bloom filter. Only if a word/sentence is present in the Bloom filter, search it in the list of patterns. This algorithm has expected time complexity O(M + L) and is space-efficient.
Insert all patterns into a hash set. Then scan the string, extracting consecutive words/sentences, and check if any of them is in the hash set. This has the same expected time complexity O(M + L), is simpler than the Bloom filter approach, but is not space-efficient.
Insert all patterns into a radix tree (trie). Then use it to search words/sentences from the string. This is not very different from the generalized suffix tree approach, but is simpler and has better performance. It has worst-case time complexity O(M + L), is probably somewhat slower than Bloom filter or hash set approaches, and may be made very space-efficient if necessary.

Resources