How to find top 10 frequent substring from string databases [duplicate]

How to find top 10 frequent substring from string databases [duplicate] - string

Is there any algorithm that can be used to find the most common phrases (or substrings) in a string? For example, the following string would have "hello world" as its most common two-word phrase:
"hello world this is hello world. hello world repeats three times in this string!"
In the string above, the most common string (after the empty string character, which repeats an infinite number of times) would be the space character .
Is there any way to generate a list of common substrings in this string, from most common to least common?

This is as task similar to Nussinov algorithm and actually even simpler as we do not allow any gaps, insertions or mismatches in the alignment.
For the string A having the length N, define a F[-1 .. N, -1 .. N] table and fill in using the following rules:
for i = 0 to N
for j = 0 to N
if i != j
{
if A[i] == A[j]
F[i,j] = F [i-1,j-1] + 1;
else
F[i,j] = 0;
}
For instance, for B A O B A B:
This runs in O(n^2) time. The largest values in the table now point to the end positions of the longest self-matching subquences (i - the end of one occurence, j - another). In the beginning, the array is assumed to be zero-initialized. I have added condition to exclude the diagonal that is the longest but probably not interesting self-match.
Thinking more, this table is symmetric over diagonal so it is enough to compute only half of it. Also, the array is zero initialized so assigning zero is redundant. That remains
for i = 0 to N
for j = i + 1 to N
if A[i] == A[j]
F[i,j] = F [i-1,j-1] + 1;
Shorter but potentially more difficult to understand. The computed table contains all matches, short and long. You can add further filtering as you need.
On the next step, you need to recover strings, following from the non zero cells up and left by diagonal. During this step is also trivial to use some hashmap to count the number of self-similarity matches for the same string. With normal string and normal minimal length only small number of table cells will be processed through this map.
I think that using hashmap directly actually requires O(n^3) as the key strings at the end of access must be compared somehow for equality. This comparison is probably O(n).

Python. This is somewhat quick and dirty, with the data structures doing most of the lifting.
from collections import Counter
accumulator = Counter()
text = 'hello world this is hello world.'
for length in range(1,len(text)+1):
for start in range(len(text) - length):
accumulator[text[start:start+length]] += 1
The Counter structure is a hash-backed dictionary designed for counting how many times you've seen something. Adding to a nonexistent key will create it, while retrieving a nonexistent key will give you zero instead of an error. So all you have to do is iterate over all the substrings.

just pseudo code, and maybe this isn't the most beautiful solution, but I would solve like this:
function separateWords(String incomingString) returns StringArray{
//Code
}
function findMax(Map map) returns String{
//Code
}
function mainAlgorithm(String incomingString) returns String{
StringArray sArr = separateWords(incomingString);
Map<String, Integer> map; //init with no content
for(word: sArr){
Integer count = map.get(word);
if(count == null){
map.put(word,1);
} else {
//remove if neccessary
map.put(word,count++);
}
}
return findMax(map);
}
Where map can contain a key, value pairs like in Java HashMap.

Since for every substring of a String of length >= 2 the text contains at least one substring of length 2 at least as many times, we only need to investigate substrings of length 2.
val s = "hello world this is hello world. hello world repeats three times in this string!"
val li = s.sliding (2, 1).toList
// li: List[String] = List(he, el, ll, lo, "o ", " w", wo, or, rl, ld, "d ", " t", th, hi, is, "s ", " i", is, "s ", " h", he, el, ll, lo, "o ", " w", wo, or, rl, ld, d., ". ", " h", he, el, ll, lo, "o ", " w", wo, or, rl, ld, "d ", " r", re, ep, pe, ea, at, ts, "s ", " t", th, hr, re, ee, "e ", " t", ti, im, me, es, "s ", " i", in, "n ", " t", th, hi, is, "s ", " s", st, tr, ri, in, ng, g!)
val uniques = li.toSet
uniques.toList.map (u => li.count (_ == u))
// res18: List[Int] = List(1, 2, 1, 1, 3, 1, 5, 1, 1, 3, 1, 1, 3, 2, 1, 3, 1, 3, 2, 3, 1, 1, 1, 1, 1, 3, 1, 3, 3, 1, 3, 1, 1, 1, 3, 3, 2, 4, 1, 2, 2, 1)
uniques.toList(6)
res19: String = "s "

Perl, O(n²) solution
my $str = "hello world this is hello world. hello world repeats three times in this string!";
my #words = split(/[^a-z]+/i, $str);
my ($display,$ix,$i,%ocur) = 10;
# calculate
for ($ix=0 ; $ix<=$#words ; $ix++) {
for ($i=$ix ; $i<=$#words ; $i++) {
$ocur{ join(':', #words[$ix .. $i]) }++;
}
}
# display
foreach (sort { my $c = $ocur{$b} <=> $ocur{$a} ; return $c ? $c : split(/:/,$b)-split(/:/,$a); } keys %ocur) {
print "$_: $ocur{$_}\n";
last if !--$display;
}
displays the 10 best scores of the most common sub strings (in case of tie, show the longest chain of words first). Change $display to 1 to have only the result.There are n(n+1)/2 iterations.

Related

Find the maximum value of K such that sub-sequences A and B exist and should satisfy the mentioned conditions

Given a string S of length n. Choose an integer K and two non-empty sub-sequences A and B of length K such that it satisfies the following conditions:
A = B i.e. for each i the ith character in A is same as the ith character in B.
Let's denote the indices used to construct A as a1,a2,a3,...,an where ai belongs to S and B as b1,b2,b3,...,bn where bi belongs to S. If we denote the number of common indices in A and B by M then M + 1 <= K.
Find the maximum value of K such that it is possible to find the sub-sequences A and B which satisfies the above conditions.
Constraints:
0 < N <= 10^5
Things which I observed are:
The value of K = 0 if the number of characters in the given string are all distinct i.e S = abcd.
K = length of S - 1 if all the characters in the string are same i.e. S = aaaa.
The value of M cannot be equal to K because then M + 1 <= K will not be true i.e you cannot have a sub-sequence A and B that satifies A = B and a1 = b1, a2 = b2, a3 = b3, ..., an = bn.
If the string S is palindrome then K = (Total number of times a character is repeated in the string if the repeatation count > 1) - 1. i.e. S = tenet then t is repeated 2 times, e is repeated 2 times, Total number of times a character is repeated = 4, K = 4 - 1 = 3.
I am having trouble designing the algorithm to solve the above problem.
Let me know in the comments if you need more clarification.

(Update: see O(n) answer.)
We can modify the classic longest common subsequence recurrence to take an extra parameter.
JavaScript code (not memoised) that I hope is self explanatory:
function f(s, i, j, haveUncommon){
if (i < 0 || j < 0)
return haveUncommon ? 0 : -Infinity
if (s[i] == s[j]){
if (haveUncommon){
return 1 + f(s, i-1, j-1, true)
} else if (i == j){
return Math.max(
1 + f(s, i-1, j-1, false),
f(s, i-1, j, false),
f(s, i, j-1, false)
)
} else {
return 1 + f(s, i-1, j-1, true)
}
}
return Math.max(
f(s, i-1, j, haveUncommon),
f(s, i, j-1, haveUncommon)
)
}
var s = "aabcde"
console.log(f(s, s.length-1, s.length-1, false))

I believe we are just looking for the closest equal pair of characters since the only characters excluded from A and B would be one of the characters in the pair and any characters in between.
Here's O(n) in JavaScript:
function f(s){
let map = {}
let best = -1
for (let i=0; i<s.length; i++){
if (!map.hasOwnProperty(s[i])){
map[s[i]] = i
continue
}
best = Math.max(best, s.length - i + map[s[i]])
map[s[i]] = i
}
return best
}
var strs = [
"aabcde", // 5
"aaababcd", // 7
"aebgaseb", // 4
"aefttfea",
// aeft fea
"abcddbca",
// abcd bca,
"a" // -1
]
for (let s of strs)
console.log(`${ s }: ${ f(s) }`)
O(n) solution in Python3:
def compute_maximum_k(word):
last_occurences = {}
max_k = -1
for i in range(len(word)):
if(not last_occurences or not word[i] in last_occurences):
last_occurences[word[i]] = i
continue
max_k = max(max_k,(len(word) - i) + last_occurences[word[i]])
last_occurences[word[i]] = i
return max_k
def main():
words = ["aabcde","aaababcd","aebgaseb","aefttfea","abcddbca","a","acbdaadbca"]
for word in words:
print(compute_maximum_k(word))
if __name__ == "__main__":
main()

A solution for the maximum length substring would be the following:
After building a Suffix Array you can derive the LCP Array. The maximum value in the LCP array corresponds to the K you are looking for. The overall complexity of both constructions is O(n).
A suffix array will sort all prefixes in you string S in ascending order. The longest common prefix array then computes the lengths of the longest common prefixes (LCPs) between all pairs of consecutive suffixes in the sorted suffix array. Thus the maximum value in this array corresponds to the length of the two maximum length substrings of S.
For a nice example using the word "banana", check out the LCP Array Wikipage

I deleted my previous answer as I don't think we need an LCS-like solution (LCS=longest Common Subsequence).
It is sufficient to find the couple of subsequences (A, B) that differ in one character and share all the others.
The code below finds the solution in O(N) time.
def function(word):
dp = [0]*len(word)
lastOccurences = {}
for i in range(len(dp)-1, -1, -1):
if i == len(dp)-1:
dp[i] = 0
else:
if dp[i+1] > 0:
dp[i] = 1 + dp[i+1]
elif word[i] in lastOccurences:
dp[i] = len(word)-lastOccurences[word[i]]
lastOccurences[word[i]] = i
return dp[0]
dp[i] is equal to 0 when all characters from i to the end of the string are different.
I will explain my code by an example.
For "abcack", there are two cases:
Either the first 'a' will be shared by the two subsequences A and B, in this case the solution will be = 1 + function("bcack")
Or 'a' will not be shared between A and B. In this case the result will be 1 + "ck". Why 1 + "ck" ? It's because we have already satisfied M+1<=K so just add all the remaining characters. In terms of indices, the substrings are [0, 4, 5] and [3, 4, 5].
We take the maximum between these two cases.
The reason I'm scanning right to left is to not have O(N) search for the current character in the rest of the string, I maintain the index of the last visited occurence of the character in the dict lastOccurences.

number of occurrences of list of words in a string with O(n)

I have already seen this answer to a similar question:
https://stackoverflow.com/a/44311921/5881884
Where the ahocorasick algorithm is used to show if each word in a list exists in a string or not with O(n). But I want to get the frequency of each word in a list in a string.
For example if
my_string = "some text yes text text some"
my_list = ["some", "text", "yes", "not"]
I would want the result:
[2, 3, 1, 0]
I did not find an exact example for this in the documentation, any idea how to accomplish this?
Other O(n) solutions than using ahocorasick would also be appreciated.

Implementation:
Here's an Aho-Corasick frequency counter:
import ahocorasick
def ac_frequency(needles, haystack):
frequencies = [0] * len(needles)
# Make a searcher
searcher = ahocorasick.Automaton()
for i, needle in enumerate(needles):
searcher.add_word(needle, i)
searcher.make_automaton()
# Add up all frequencies
for _, i in searcher.iter(haystack):
frequencies[i] += 1
return frequencies
(For your example, you'd call ac_frequency(my_list, my_string) to get the list of counts)
For medium-to-large inputs this will be substantially faster than other methods.
Notes:
For real data, this method will potentially yield different results than the other solutions posted, because Aho-Corasick looks for all occurrences of the target words, including substrings.
If you want to find full-words only, you can call searcher.add_word with space/punctuation-padded versions of the original string:
...
padding_start = [" ", "\n", "\t"]
padding_end = [" ", ".", ";", ",", "-", "–", "—", "?", "!", "\n"]
for i, needle in enumerate(needles):
for s, e in [(s,e) for s in padding_start for e in padding_end]:
searcher.add_word(s + needle + e, i)
searcher.make_automaton()
# Add up all frequencies
for _, i in searcher.iter(" " + haystack + " "):
...

The Counter in the collections module may be of use to you:
from collections import Counter
my_string = "some text yes text text some"
my_list = ["some", "text", "yes", "not"]
counter = Counter(my_string.split(' '))
[counter.get(item, 0) for item in my_list]
# out: [2, 3, 1, 0]

You can use list comprehensions to count the number of times the specific list occurs in my_string:
[my_string.split().count(i) for i in my_list]
[2, 3, 1, 0]

You can use a dictionary to count the occurrences of the words you care about:
counts = dict.fromkeys(my_list, 0) # initialize the counting dict with all counts at zero
for word in my_string.split():
if word in counts: # this test filters out any unwanted words
counts[word] += 1 # increment the count
The counts dict will hold the count of each word. If you really do need a list of counts in the same order as the original list of keywords (and the dict won't do), you can add a final step after the loop has finished:
results = [counts[word] for word in my_list]

Find and print vowels from a string using a while loop

Study assignment (using python 3):
For a study assignment I need to write a program that prints the indices of all vowels in a string, preferably using a 'while-loop'.
So far I have managed to design a 'for-loop' to get the job done, but I could surely need some help on the 'while-loop'
for-loop solution:
string = input( "Typ in a string: " )
vowels = "a", "e", "i", "o", "u"
indices = ""
for i in string:
if i in vowels:
indices += i
print( indices )
while-loop solution:
string = input( "Typ in a string: " )
vowels = "a", "e", "i", "o", "u"
indices = ""
while i < len( string ):
<code>
i += 1
print( indices )
Would the use 'index()' or 'find()' work here?

Try This :
string = input( "Typ in a string: " )
vowels = ["a", "e", "i", "o", "u"]
higher_bound=1
lower_bound=0
while lower_bound<higher_bound:
convert_str=list(string)
find_vowel=list(set(vowels).intersection(convert_str))
print("Vowels in {} are {}".format(string,"".join(find_vowel)))
lower_bound+=1
You can also set higher_bound to len(string) then it will print result as many times as len of string.
Since this is your Study assignment you should look and practice yourself instead of copy paste. Here is additional info for solution :
In mathematics, the intersection A ∩ B of two sets A and B is the set
that contains all elements of A that also belong to B (or
equivalently, all elements of B that also belong to A), but no other
elements. For explanation of the symbols used in this article, refer
to the table of mathematical symbols.
In python :
The syntax of intersection() in Python is:
A.intersection(*other_sets)
A = {2, 3, 5, 4}
B = {2, 5, 100}
C = {2, 3, 8, 9, 10}
print(B.intersection(A))
print(B.intersection(C))
print(A.intersection(C))
print(C.intersection(A, B))

You can get the character at index x of a string by doing string[x]!
i = 0 # initialise i to 0 here first!
while i < len( string ):
if string[i] in vowels:
indices += str(i)
i += 1
print( indices )
However, is making indices a str really suitable? I don't think so, since you don't have separators between the indices. Is the string "12" mean that there are 2 vowels at index 1 and 2, or one vowel index 12? You can try using a list to store the indices:
indices = []
And you can add i to it by doing:
indices.append(i)
BTW, your for loop solution will print the vowel characters, not the indices.
If you don't want to use lists, you can also add an extra space after each index.
indices += str(I) + " "

find minimum steps required to change one binary string to another

Given two string str1 and str2 which contain only 0 or 1, there
are some steps to change str1 to str2,
step1: find a substring of str1 of length 2 and reverse the substring, and str1 becomes str1' (str1' != str1)
step2: find a substring of str1' of length 3, and reverse the substring, and str1' becomes str1'' (str1'' != str1')
the following steps are similar.
the string length is in the range [2, 30]
Requirement: each step must be performed once and we can not skip
previous steps and perform the next step.
If it is possible to change str1 to str2, output the minimum steps required, otherwise, output -1
Example 1
str1 = "1010", str2 = "0011", the minimum step required is 2
first, choose substring in range [2, 3], "1010" --> "1001",
then choose substring in the range [0, 2], "1001" --> "0011"
Example 2
str1 = "1001", str2 = "0110", it is impossible to change str1 to str2,
because in step1, str1 can be changed to "0101" or "1010", but in step3, it is impossible to change a length3 substring to make it different. So the output is -1.
Example 3
str1 = "10101010", str2 = "00101011"， output is 7
I can not figure out example 3, because there are two many possibilities. Can anyone gives some hint on how to solve this problem? What is the type of this
problem? Is it dynamic programming?

This is in fact a dynamic programming problem. To solve it, we are going to try all possible permutations, but memoize the results along the way. It could seem that there are way too many options - there are 2^30 different binary strings of length 30, but keep in mind that reverting a string doesn't change number of zeroes and ones we have, so the upper bound is in fact 30 choose 15 = 155117520 when we have a string of 15 zeroes and ones. Around 150 million possible results is not too bad.
So starting with our start string, we are going to derive all possible string from each string we derived so far, until we generate end string. We are also going to track predecessors to reconstruct generation. Here's my code:
start = '10101010'
end = '00101011'
dp = [{} for _ in range(31)]
dp[1][start] = '' # Originally only start string is reachable
for i in range(2, len(start) + 1):
for s in dp[i - 1].keys():
# Try all possible reversals for each string in dp[i - 1]
for j in range(len(start) - i + 1):
newstr = s
newstr = newstr[:j] + newstr[j:j+i][::-1] + newstr[j+i:]
dp[i][newstr] = s
if end in dp[i]:
ans = []
cur = end
for j in range(i, 0, -1):
ans.append(cur)
cur = dp[j][cur]
print(ans[::-1])
exit(0)
print('Impossible!')
And for your third example, this gives us sequence ['10101010', '10101001', '10101100', '10100011', '00101011'] - from your str1 to str2. If you check differences between the strings, you'll see which transitions were made. So this transformation can be done in 4 steps rather than 7 like you suggested.
Lastly, this will be a bit slow for 30 in python, but if you rewrite it into C++, it's going to be a couple of seconds tops.

This Question can be solved using Backtracking. here is my C++ Code, Which runs smooth with my testcases. This Question Came in an OA of Persistent systems and i was a bit confused about the steps, but this is simple Backtracking. Wants your suggestions if Dp can Optimize my solution!.
//prabaljainn
#include <bits/stdc++.h>
using namespace std;
string s1,s2;
int ans=1e9; int n;
void rec(string s1,int level){
if(s1==s2){
ans = min(ans,level-2);
return;
}
for(int i=0; i<= n-level; i++){
reverse(s1.begin()+i, s1.begin()+i+level);
rec(s1,level+1);
reverse(s1.begin()+i, s1.begin()+i+level);
}
}
int main(){
cin>>s1>>s2;
n = s1.size();
rec(s1,2);
if(ans==1e9)
cout<<"-1"<<endl;
else
cout<<ans<<endl;
}
Happy coding

This problem can be solved using breadth-first search. The following solution uses a queue which stores a pair having the current string as the first member and current operation length(initially 2) as the second member. A set is used to store already visited strings to prevent entering redundant states. For current string, we reverse every substring of length k where k is current operation length and add it to the queue if it hasn't been seen before. If the current string equals the desired string then answer is 'current operation length-2'. If queue becomes empty, then the answer isn't possible.
string str1,str2;
cin>>str1>>str2;
queue<pair<string, int>> q;
set<string> s;
q.push({str1,2});
s.insert(str1);
while(!q.empty())
{
auto p=q.front();
q.pop();
if(p.first==str2)
{
cout<<p.second-2;
return 0;
}
if(p.second<=p.first.size())
{
for(int i=0;i<=p.first.size()-p.second;i++)
{
string x=p.first;
reverse(x.begin()+i,x.begin()+i+p.second);
if(s.find(x)==s.end())
{
q.push({x,p.second+1});
s.insert(x);
}
}
}
}
cout<<-1;

save str1 as start of BFS and at each step,reverse values of all substrings of length 2 and 3 and see if the new strings formed after reversing have been seen previously or not.....if not seen....push them in the queue and also maintain count of steps...if the string at the front of queue is str2 at any time...that step is the answer

Understanding Knuth-Morris-Pratt Algorithm

Can someone explain this to me? I've been reading about it and it still is hard to follow.
text : ababdbaababa
pattern: ababa
table for ababa is -1 0 0 1 2.
I think I understand how the table is constructed but, I dont understand how to shift once mismatch has occurred. Seems like we dont even use the table when shifting?
when do we use the table?

Here I have briefly described computing the prefix function and shifting through the text here.
For further information: Knuth–Morris–Pratt string search algorithm
Shifting through the text :
Text: ABC ABCDAB ABCDABCDABDE
Pattern : ABCDABD
Scenario 1 - There is/are some matching character/s in Pattern and Text.
e.g 1: In here there are 3 matching characters.
Get the value from table for 3 characters. (index 2, ABC) i.e 0
Therefore shift = 3 - 0 i.e 3
e.g 2: In here there are 6 matching characters.
Get the value from table for 6 characters. (index 5, ABCDAB) i.e 2
Therefore shift = 6 - 2 i.e 4
Scenario 2 - If there is no matching characters then shift by one.

the table is used when your mismatch occurs. Let's apply the pattern to your text:
You start matching text with pattern and test if your pattern could be in text, starting at the first position. You compare text[1] with pattern[1] and that turns out to be a match. You do the same for text[2], text[3] and text[4].
when you want to match text[5] with pattern[5] you don't have a match (d<>a). You then know that your pattern will not start at the first position. You could then start the matching all over again for position 2 but that is not efficient. You can use the table now.
The error occured at pattern[5] so you go to table[5] which is 2. That tells you that you can start matching at the current position again with 2 already matched characters. Instead of having to start matching position 2, you can start at your previous position (1) + table[5] (2)=3. Indeed, If we look at text[3] and text[4], we see that it is equal to pattern[1] and pattern[2], respectivily.
The numbers in table tell you how many positions are already matched when an error occurs. In this case 2 characters of the next pattern were already matched. You can then immediately start matching for position 3 and skip position 2 (as the pattern can not be found starting at position[2]).

Well this is an old topic but hopefully someone who searches for this in the future will see it. Answer given above is good but I worked through an example myself to see what's going on exactly.
First part of the exposition is taken from wiki, the part I really wanted to elaborate on is how this backtracking array is constructed.
Here goes:
we work through a (relatively artificial) run of the algorithm, where
W = "ABCDABD" and
S = "ABC ABCDAB ABCDABCDABDE".
At any given time, the algorithm is in a state determined by two integers:
m which denotes the position within S which is the beginning of a prospective match for W
i the index in W denoting the character currently under consideration.
In each step we compare S[m+i] with W[i] and advance if they are equal. This is depicted, at the start of the run, like
1 2
m: 01234567890123456789012
S: ABC ABCDAB ABCDABCDABDE
W: ABCDABD
i: 0123456
We proceed by comparing successive characters of W to "parallel" characters of S, moving from one to the next if they match. However, in the fourth step,
we get S[3] is a space and W[3] = 'D', a mismatch. Rather than beginning to search again at S[1], we note that no 'A' occurs between positions 0 and 3 in S
except at 0; hence, having checked all those characters previously, we know there is no chance of finding the beginning of a match if we check them again.
Therefore we move on to the next character, setting m = 4 and i = 0.
1 2
m: 01234567890123456789012
S: ABC ABCDAB ABCDABCDABDE
W: ABCDABD
i: 0123456
We quickly obtain a nearly complete match "ABCDAB" when, at W[6] (S[10]), we again have a discrepancy. However, just prior to the end of the current partial
match, we passed an "AB" which could be the beginning of a new match, so we must take this into consideration. As we already know that these characters match
the two characters prior to the current position, we need not check them again; we simply reset m = 8, i = 2 and continue matching the current character. Thus,
not only do we omit previously matched characters of S, but also previously matched characters of W.
1 2
m: 01234567890123456789012
S: ABC ABCDAB ABCDABCDABDE
W: ABCDABD
i: 0123456
This search fails immediately, however, as the pattern still does not contain a space, so as in the first trial, we return to the beginning of W and begin
searching at the next character of S: m = 11, reset i = 0.
1 2
m: 01234567890123456789012
S: ABC ABCDAB ABCDABCDABDE
W: ABCDABD
i: 0123456
Once again we immediately hit upon a match "ABCDAB" but the next character, 'C', does not match the final character 'D' of the word W. Reasoning as before,
we set m = 15, to start at the two-character string "AB" leading up to the current position, set i = 2, and continue matching from the current position.
1 2
m: 01234567890123456789012
S: ABC ABCDAB ABCDABCDABDE
W: ABCDABD
i: 0123456
This time we are able to complete the match, whose first character is S[15].
The above example contains all the elements of the algorithm. For the moment, we assume the existence of a "partial match" table T, described below, which
indicates where we need to look for the start of a new match in the event that the current one ends in a mismatch. The entries of T are constructed so that
if we have a match starting at S[m] that fails when comparing S[m + i] to W[i], then the next possible match will start at index m + i - T[i] in S (that is,
T[i] is the amount of "backtracking" we need to do after a mismatch). This has two implications: first, T[0] = -1, which indicates that if W[0] is a mismatch,
we cannot backtrack and must simply check the next character; and second, although the next possible match will begin at index m + i - T[i], as in the example
above, we need not actually check any of the T[i] characters after that, so that we continue searching from W[T[i]].
BACKTRACKING ARRAY CONSTRUCTION:
so this backtracking array T[] we will call lps[], let's see how we calculate this guy
lps[i] = the longest proper prefix of pat[0..i]
which is also a suffix of pat[0..i].
Examples:
For the pattern “AABAACAABAA”,
lps[] is [0, 1, 0, 1, 2, 0, 1, 2, 3, 4, 5]
//so just going through this real quick
lps[0] is just 0 by default
lps[1] is 1 because it's looking at AA and A is both a prefix and suffix
lps[2] is 0 because it's looking at AAB and suffix is B but there is no prefix equal to B unless you count B itself which I guess is against the rules
lps[3] is 1 because it's looking at AABA and first A matches last A
lps[4] is 2 becuase it's looking at AABAA and first 2 A matches last 2 A
lps[5] is 0 becuase it's looking at AABAAC and nothing matches C
...
For the pattern “ABCDE”, lps[] is [0, 0, 0, 0, 0]
For the pattern “AAAAA”, lps[] is [0, 1, 2, 3, 4]
For the pattern “AAABAAA”, lps[] is [0, 1, 2, 0, 1, 2, 3]
For the pattern “AAACAAAAAC”, lps[] is [0, 1, 2, 0, 1, 2, 3, 3, 3, 4]
And this totally makes sense if you think about it...if you mismatch, you want to go back as far as you can obviously, how far back you go (the suffix
portion) is essentially the prefix since you must start matching from the first character again by definition. so if your string looks like
aaaaaaaaaaaaaaa..b..aaaaaaaaaaaaaaac and you mismatche on the last char c, then you want to reuse aaaaaaaaaaaaaaa as your new head, just think it through

A Complete Solution using Java:
package src.com.recursion;
/*
* This Expains the Search of pattern in text in O(n)
*/
public class FindPatternInText {
public int checkIfExists(char[] text, char[] pattern) {
int index = 0;
int[] lps = new int[pattern.length];
createPrefixSuffixArray(pattern, lps);
int i = 0;
int j = 0;
int textLength = text.length;
while (i < textLength) {
if (pattern[j] == text[i]) {
j++;
i++;
}
if (j == pattern.length)
return i - j;
else if (i < textLength && pattern[j] != text[i]) {
if (j != 0) {
j = lps[j - 1];
} else {
i++;
}
}
}
return index;
}
private void createPrefixSuffixArray(char[] pattern, int[] lps) {
lps[0] = 0;
int index = 0;
int i = 1;
while (i < pattern.length) {
if (pattern[i] == pattern[index]) {
lps[i] = index;
i++;
index++;
} else {
if (index != 0) {
index = lps[index - 1];
} else {
lps[i] = 0;
i++;
}
}
}
}
public static void main(String args[]) {
String text = "ABABDABACDABABCABAB";
String pattern = "ABABCABAB";
System.out.println("Point where the pattern match starts is "
+ new FindPatternInText().checkIfExists(text.toCharArray(), pattern.toCharArray()));
}
}

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to find top 10 frequent substring from string databases [duplicate] - string

Related

Find the maximum value of K such that sub-sequences A and B exist and should satisfy the mentioned conditions

number of occurrences of list of words in a string with O(n)

Find and print vowels from a string using a while loop

find minimum steps required to change one binary string to another

Understanding Knuth-Morris-Pratt Algorithm

Categories

Resources