Find subsequence ocurrence in multiple strings - string

Given one string say S of length m and a set of other strings R all with lengths equal or bigger than m. Find the strings in the set that have S as a subsequence.
So, if S is blr and the set of strings is:
bangalore
booleer
bamboo
It should return the first two strings.
I'm aware that I can find if a string S of length m is a subsecuence of other string T of length n in time complexity O(n+m). So I know I could just do this algorithm for each element in the set, but that would be a time complexity of O(k*(n+m)), being k the size of the set (and assuming all the strings have the same length). This makes me wonder if there's some kind of preprocessing that helps me solve this problem with multiple strings.
So, is there any preprocessing or structure I can use to solve this problem?
What's the best time complexity I can achieve?
Are there any other approachs to solve this problem?

For tow string ch and s if you want to find if ch is in the set that have S as a subsequence, the algorithm will have the complexity O(n)
public bool function(string ch, string s)
{
if (ch.Length < s.Length)
return false;
int j = 0;
for (int i = 0; i < ch.Length; i++)
{
if (ch[i] == s[j])
{
j++;
if (s.Length == j)
{
return true;
}
}
}
return false;
}
After thatyou have to apply it for all your string in R

I haven't got a code implementation, but I did manage to find a 1984 paper "Computing a longest common subsequence for a set of strings", by W. J. Hsu and M. W. Du.
Their conclusion is that by doing O(L) preprocessing time (where L is the aggregate length of all the strings in the set), it is possible to perform each search in O(P), where P is the number of times the needle appears in the haystacks.

Related

Using a trie for string segmentation - time complexity?

Problem to be solved:
Given a non-empty string s and a string array wordArr containing a list
of non-empty words, determine if s can be segmented into a
space-separated sequence of one or more dictionary words. You may
assume the dictionary does not contain duplicate words.
For example, given s = "leetcode", wordArr = ["leet", "code"].
Return true because "leetcode" can be segmented as "leet code".
In the above problem, would it work to build a trie that has each string in wordArr. Then, for each char in given string s, work down the trie. If a trie branch terminates, then this substring is complete so pass the remaining string up to the root and do the exact same thing recursively.
This should be O(N) time and O(N) space correct? I ask because the problem I'm working on says this will be O(N^2) time in the most optimal way and I'm not sure what's wrong with my approach.
For example, if s = "hello" and wordArr = ["he", "ll", "ee", "zz", "o"], then "he" will be completed in the first branch of the trie, "llo" will be passed up to the root recursively. Then, "ll" will be completed, so "o" gets passed up to root of trie. Then "o" is completed, which is the end of s, so return true. If the end of s isn't completed, return false.
Is this correct?
Your example would indeed suggest a linear time complexity, but look at this example:
s = "hello"
wordArr = ["hell", "he", "e", "ll", "lo", "l", "h"]
Now, first "hell" is tried, but in the next recursion cycle, no solution is found (there is no "o"), so the algorithm needs to backtrack and assume "hell" is not suitable (pun not intended), so you try "he", and in the next level you find "ll", but then again it fails, as there is no "o". Again backtracking is needed. Now start with "h", then "e" and then again a failure is coming: you try "ll" without success, so backtracking to use "l" instead: the solution is now available: "h e l lo".
So, no this does not have O(n) time complexity.
I suspect off-hand that the issue is backtracking. What if the word is not segmentable based on a particular dictionary, or what if there are multiple possible substrings with a common prefix? E.g., suppose the dictionary contains he, llenic, and llo. Failure down one branch of the trie would require backtracking, with some corresponding increase in time complexity.
This is similar to a regex-match problem: the example you give is like testing an input word against
^(he|ll|ee|zz|o)+$
(any number of dictionary members, in any order, and nothing else). I don't know the time complexity of regex matchers offhand, but I know backtracking can get you into serious time trouble.
I did find this answer which says:
Running a DFA-compiled regular expression against a string is indeed O(n), but can require up to O(2^m) construction time/space (where m = regular expression size).
So maybe it is O(n^2) with reduced construction effort.
Let's start by converting the trie to a nfa. We create an accept node on the root and add an edge that moves from every word end of the dictionary in the trie to the root node for the empty char.
Time complexity: since each step in the trie we can move only to one edge that represent the current char in the input string and the root.
T(n) = 2×T (n-1)+c
That gives us O(2^n)
Indeed not O(n), But you can do better using Dynamic programming.
We will use top-down approach.
Before we solve it for any string check if we have already solve it.
We can use another HashMap to store the result of already solved strings.
Whenever any recursive call returns false, store that string in HashMap.
The idea is to calculate every suffix of the word only once. We have only n suffixes and It will end up with O(n^2).
Code form algorithms.tutorialhorizon.com:
Map<String, String> memoized;
Set<String> dict;
String SegmentString(String input) {
if (dict.contains(input)) return input;
if (memoized.containsKey(input) {
return memoized.get(input);
}
int len = input.length();
for (int i = 1; i < len; i++) {
String prefix = input.substring(0, i);
if (dict.contains(prefix)) {
String suffix = input.substring(i, len);
String segSuffix = SegmentString(suffix);
if (segSuffix != null) {
memoized.put(input, prefix + " " + segSuffix);
return prefix + " " + segSuffix;
}
}
And you can do better!
Map<String, String> memoized;
Trie<String> dict;
String SegmentString(String input)
{
if (dict.contains(input))
return input;
if (memoized.containsKey(input)
return memoized.get(input);
int len = input.length();
foreach (StringBuilder word in dict.GetAll(input))
{
String prefix = input.substring(0, word.length);
String suffix = input.substring(word.length, len);
String segSuffix = SegmentString(suffix);
if (segSuffix != null)
{
memoized.put(input, word.ToString() + " " + segSuffix);
return prefix + " " + segSuffix;
}
}
retrun null;
}
Using the Trieto find the recursive calls only when Trie reach a word end you will get o (z×n) where z is the length of the Trie.

Number of distinct rotated strings

We have a string S and we want to calculate the number of distinct strings that can be formed by rotating the string.
For example :-
S = "aaaa" , here it would be 1 string {"aaaa"}
S = "abab" , here it would be 2 strings {"abab" , "baba"}
So ,is there an algorithm to solve this in O(|S|) complexity where |S| is the length of string.
Suffix trees, baby!
If string is S. Construct the Suffix Tree for SS (S concatenated to S).
Find number of unique substrings of length |S|. The uniqueness you get automatically. For length |S| you might have to change the suffix tree algo a little (to maintain depth info), but is doable.
(Note that the other answer by johnsoe is actually quadratic, or worse, depending on the implementation of Set).
You can solve this with rolling hash functions used in the Rabin-Karp algorithm.
You can use the rolling hash to update the hash table for all substrings of size |S| (obtained by sliding a |S| window across SS) in constant time (so, O(|S|) in total).
Assuming your string comes from an alphabet of constant size, you can inspect the hash table in constant time to obtain the required metric.
Something like this should do the trick.
public static int uniqueRotations(String phrase){
Set<String> rotations = new HashSet<String>();
rotations.add(phrase);
for(int i = 0; i < phrase.length() - 1; i++){
phrase = phrase.charAt(phrase.length() - 1) + phrase.substring(0, phrase.length() - 1);
rotations.add(phrase);
}
return rotations.size();
}

Find the smallest period of input string in O(n)?

Given the following problem :
Definition :
Let S be a string over alphabet Σ .S' is the smallest period of S
if S' is the smallest string such that :
S = (S')^k (S'') ,
where S'' is a prefix of S. If no such S' exists , then S is
not periodic .
Example : S = abcabcabcabca. Then abcabc is a period since S =
abcabc abcabc a, but the smallest period is abc since S = abc abc
abc abc a.
Give an algorithm to find the smallest period of input string S or
declare that S is not periodic.
Hint : You can do that in O(n) ...
My solution : We use KMP , which runs in O(n) .
By the definition of the problem , S = (S')^k (S'') , then I think that if we create
an automata for the shortest period , and find a way to find that shortest period , then I'm done.
The problem is where to put the FAIL arrow of the automata ...
Any ideas would be greatly appreciated ,
Regards
Alright so this problem can definitely be solved in O(n), we just have to cleverly use KMP as you suggested.
Solving the longest proper prefix which is also a suffix problem is a vital part of KMP that we will make use of.
The longest proper prefix which is also a suffix problem is a mouthful so let's just call it the prefix suffix problem for now.
The prefix suffix problem can be pretty hard to understand so I'll include some examples.
The prefix suffix solution for "abcabc" is
"abc" since that is the longest string which is both a proper prefix
and a proper suffix (proper prefixes and suffixes cannot be the entire
string).
The prefix suffix solution for "abcabca" is "a"
Hmmmmmmmmm wait a minute if we just chop off "a" from the end of "abcabca" we are left with "abcabc" and if we get the solution("abc") for this new string and chop it off again we are left with "abc" Hmmmmmmmmm. Very interesting.(This is pretty much the solution but I will talk about why this works)
Alright let's try to formalize this intuition a bit more and see if we can arrive at a solution.
I will use one key assumption in my argument:
The smallest period of our pattern is a valid period of every larger period in our pattern
Let us store the prefix suffix solution for the first i characters of our pattern in lps[i]. This lps array can be calculated in O(n) and it is used in the KMP algorithm, you can read more about how to calculate it in O(n) here: https://www.geeksforgeeks.org/kmp-algorithm-for-pattern-searching/
Just so we are clear I will list some examples of some lps arrays
Pattern:"aaaaa"
lps: [0, 1, 2, 3, 4]
Pattern:"aabbcc"
lps: [0, 1, 0, 0, 0, 0]
Pattern:"abcabcabc"
lps: [0, 0, 0, 1, 2, 3, 4, 5, 6]
Alright now lets define some variables, to help us find out why this lps array is useful.
Let l be the length of our pattern, and let k be the last value in our lps array(k=lps[l-1])
The value k tells us that the first k characters of our string are the same as the last k characters of our string. And we can use this fact to find a period!
Using this information we can now show that the prefix consisting of the first l-k characters of our string form a valid period. This is clear because the next k characters which are not in our prefix must match the first k characters of our prefix, because of how we defined our lps array. The first k characters that from our prefix must be the same as the last k characters which form our suffix.
In practice you can implement this with a simple while loop as shown below where index marks the end of the suffix you are currently considering to be the smallest period.
public static void main(String[] args){
String pattern="abcabcabcabca";
int[] lps= calculateLPS(pattern);
//start at the end of the string
int index=lps.length-1;
while(lps[index]!=0){
//shift back
index-=lps[index];
}
System.out.println(pattern.substring(0,index+1));
}
And since calculating lps happens in O(n), and you are always moving at least 1 step back in the while loop the time complexity for the whole procedure is simply O(n)
I borrowed heavily from the geeksForGeeks implementation of KMP in my calculateLPS() method if you would like to see my exact code it is below, but I reccomend that you also look at their explanation: https://www.geeksforgeeks.org/kmp-algorithm-for-pattern-searching/
static int[] calculateLPS(String pat) {
int[] lps = new int[pat.length()];
int len = 0;
int i = 1;
lps[0] = 0;
while (i < pat.length()) {
if (pat.charAt(i) == pat.charAt(len)) {
len++;
lps[i] = len;
i++;
}
else {
if (len != 0) {
len = lps[len - 1];
}
else {
lps[i] = len;
i++;
}
}
}
System.out.println(Arrays.toString(lps));
return lps;
}
Last but not least, thanks for posting such an interesting problem it was pretty fun to figure out! Also I am new to this so please let me know if any part of my explanation doesn't make sense.
I'm not sure that I understand your attempted solution. KMP is a useful subroutine, though -- the smallest period is how far KMP moves the needle string (i.e., S) after a complete match.
this problem can be solved using the Z function , this tutorial can help you .
This problem can easily be solved by KMP
Concatenate the string to itself and run KMP on it.
Let n be the length of the original string.
Search for the first value >= n in the KMP array. That value must be at a position k >= n (0-based).
Then k - n + 1 is the length of the shortest period of the string.
Example:
Original string = abaaba
n = 6
New string = abaabaabaaba
KMP values for this new string: 0 0 1 1 2 3 4 5 6 7 8 9
The first value >= n is 6 which is at position 8. 8 - 6 + 1 = 3 is the length of the shortest period of the string (aba).
See if this solution works for O(n). I used rotation of strings.
public static int stringPeriod(String s){
String s1= s;
String s2= s1;
for (int i=1; i <s1.length();i++){
s2=rotate(s2);
if(s1.equals(s2)){
return i;
}
}
return -1;
}
public static String rotate(String s1){
String rotS= s1;
rotS = s1.substring(1)+s1.substring(0,1);
return rotS;
}
The complete program is available in this github repository

brute force string pattern matching average analysis

I have brute force string pattern searching algorithms as below:
public static int brute(String text,String pattern) {
int n = text.length(); // n is length of text.
int m = pattern.length(); // m is length of pattern
int j;
for(int i=0; i <= (n-m); i++) {
j = 0;
while ((j < m) && (text.charAt(i+j) == pattern.charAt(j)) ) {
j++;
}
if (j == m)
return i; // match at i
}
return -1; // no match
} // end of brute()
While anlaysising above algorithm here author mentioned worst case and average case.
I undertstood worst case scenario performance but for average how author came with O(m+n) performance? Need help here.
Brute force pattern matching runs in time O(mn) in the worst case.
Average for most searches of ordinary text take O(m+n), which is very quick.
Example of a more average case:
T: "a string searching example is standard"
P: "store"
Thanks for your time and help
What he's referring to with the O(m+n) is the partial matches that would happen in the normal case.
For example, with your normal case you will get:
T: "a string searching example is standard"
P: "store"
iterations:
O(38 + 5) == 43
a - no match (1)
space - no match (2)
s - match (3)
t - match (4)
r - no match (5)
t - no match (6)
r - no match (7)
i - no match (8)
n - no match (9)
g - no match (10)
space - no match (11)
etc...
I indented the inner loop to make it easier to understand.
Eventually you've checked all of m which is O(m), but the partial matches mean that you have either checked all of n which is O(n)(found a complete match), or at least enough charactors to equal the amount of charactors in n (partial matches only).
Overall this leads to an O(m+n) time on average.
Best case would be O(n) if the match is at the very beginning of m.
Brute force pattern matching runs in time O(mn) in the worst case.
Average for most searches of ordinary text take O(m+n), which is very
quick.
Note that you can't have 2 Big-O for the same algorithm.
It seems you are applying a brute-force window-shift algorithm,
Time = (m-n+1)m
worst case is when you have m=1, O(nm)
Best case is when you have m=n, Ω(m)

How do I find the largest sequence in a string that is repeated at least once?

Trying to solve the following problem:
Given a string of arbitrary length, find the longest substring that occurs more than one time within the string, with no overlaps.
For example, if the input string was ABCABCAB, the correct output would be ABC. You couldn't say ABCAB, because that only occurs twice where the two substrings overlap, which is not allowed.
Is there any way to solve this reasonably quickly for strings containing a few thousand characters?
(And before anyone asks, this is not homework. I'm looking at ways to optimize the rendering of Lindenmayer fractals, because they tend to take excessive amounts of time to draw at high iteration levels with a naive turtle graphics system.)
Here's an example for a string of length 11, which you can generalize
Set chunk length to floor(11/2) = 5
Scan the string in chunks of 5 characters left to looking for repeats. There will be 3 comparisons
Left Right
Offset Offset
0 5
0 6
1 5
If you found a duplicate you're done. Otherwise reduce the chunk length to 4 and repeat until chunk length goes to zero.
Here's some (obviously untested) pseudocode:
String s
int len = floor(s.length/2)
for int i=len; i>0; i--
for j=0; j<=len-(2*i); j++
for k=j+i; k<=len-i; k++
if s.substr(j,j+i) == s.substr(k,k+i)
return s.substr(j,j+i)
return null
There may be an off-by-one error in there, but the approach should be sound (and minimal).
it looks like a suffix tree problem. Create the suffix tree, then find the biggest compressed branch with more than one child (occurs more than once in the original string). The number of letters in that compressed branch should be the size of the biggest subsequence.
i found something similar here: http://www.coderanch.com/t/370396/java/java/Algorithm-wanted-longest-repeating-substring
Looks like it can be done in O(n).
First we need to define the start symbol of our substring and define the length. Iterate all possible start positions then figure out the length doing binary search for the length (if you can find substr with lenght a, you may find with the longer length, function looks monotonous so bin search should be fine). Then find equal substring is N, using KMP or Rabin-Karp any linear algo is fine. Total N*N*log(N). Is that too much complexity?
The code is something like:
for(int i=0;i<input.length();++i)
{
int l = i;
int r = input.length();
while(l <= r)
{
int middle = l + ((r - l) >> 1);
Check if string [i;middle] can be found in initial string. Should be done in O(n); You need to check parts of initial string [0,i-1], [middle+1;length()-1];
if (found)
l = middle + 1;
else
r = middle - 1;
}
}
Make sense?
This type of analysis is often done in genome sequences. have a look at this paper. it has an efficient implemention (c++) for solving repeats: http://www.complex-systems.com/pdf/17-4-4.pdf
might be what you are looking for

Resources