Leetcode Problem First Unique Character In a string - string

I am struggling to understand how the leetcode solution for the above problem works. If any help on how the post increment operator is working on the value of the array it would be great.
class Solution {
public int firstUniqChar(String s) {
int [] charArr = new int[26];
for(int i=0;i<s.length();i++){
charArr[s.charAt(i)-'a']++;
}
for(int i=0;i<s.length();i++){
if(charArr[s.charAt(i)-'a']==1) return i;
}
return -1;
}
The problem link here https://leetcode.com/problems/first-unique-character-in-a-string/submissions/!

First, you need to understand there are 26 letters in the English alphabet. So the code creates an array of 26 integers that will hold the count of each letter in the string.
int [] charArr = new int[26];
The count of all a's will be at index 0, the cound of all b's at index 1, etc. The default value for int is 0, so this gives an array of 26 zeros to start with.
Each letter has two character codes; one for upper case and one for lower case. The function String.charAt() returns a char but char is an integral type so you can do math on it. When doing math on a char, it uses the char code. So for example:
char c = 'B';
c -= 'A';
System.out.println((int)c); // Will print 1 since char code of 'A' = 65, 'B' = 66
So this line:
charArr[s.charAt(i)-'a']++;
Takes the char at i and subtracts 'a' from it. The range of lower case codes are 97-122. Subtracting 'a' shifts those values to 0-25 - which gives the indexes into the array. (Note this code only checks lower case letters).
After converting the character to an index, it increments the value at that index. So each item in the array represents the character count of the corresponding letter.
For example, the string "aabbee" will give the array {2, 2, 0, 0, 2, 0, 0....0}

Related

Counting efficiently pairs of strings, which together contain all the vowels

I have solved the following question but still need to improve the performance:
I have N words with me. For each valid i, word i is described by a
string Di containing only lowercase vowels, i.e. characters 'a', 'e',
'i', 'o', 'u'.
What is the total number of (unordered) pairs of words such that when
concatenated, they contain all the vowels?
My C++ code works using a bit array to represent the vowels present in a word. Then I check all pairs of strings by combining their bit array looking if all the vowel bits are set. I do this by comparing the combination to a bit array 'complete' which has all bits set corresponding to the vowels:
#include<bits/stdc++.h>
using namespace std;
int main()
{
int t;
cin>>t; // number of different test cases
while(t--)
{
int n; // number of strings
cin>>n;
int con_s[n];
for(int i=0;i<n;i++)
{
string s;
cin>>s;
con_s[i]=0; // converting vowels present in string into a bit array
for(int j=0;j<s.length();j++)
{
con_s[i] = con_s[i] | (1<<(s[j]-'a'));
}
}
int complete = 0; // bit array corresponding to all possible vowels
complete = (1<<('a'-'a')) | (1<<('e'-'a')) | (1<<('i'-'a')) | (1<<('o'-'a')) | (1<<('u'-'a'));
// cout<<complete;
int count = 0;
for(int i=0;i<n-1;i++) // check the pairs
{
for(int j=i+1;j<n;j++)
{
if((con_s[i] | con_s[j])==complete)count++;
}
}
cout<<count<<"\n";
}
return 0;
}
Although the bit array use very fast bit manipulation, it appears that my algorithm is not efficient enough when having larger set of input strings. Can anyone suggest an efficient solution ?
Additional information
Test Case:
Input:
1
3
aaooaoaooa
uiieieiieieuuu
aeioooeeiiaiei
Result: 2
Explanation:
The 2 pairs (1 and 2) and (2 and 3) contain all 5 vowels when concatenated, while pair (1 and 3) do not match the criteria since the concatenation does not contain 'u'. Hence the result is 2.
Constraints:
1≤T≤1,000
1≤N≤105
1≤|Di|≤1,000 for each valid i
the sum of all |D_i| over all test cases does not exceed 3⋅107
Suppose you have a very large number of strings. Your algorithm will compare all the strings between them, and thats a terrible number of iterations.
Radical improvement:
New approach
Build a map that associates a string of ordered unique vowels (e.g. "ae") to a list of all the string found that contains exactly these unique vowels, whatever the number of repetition and the order. For example:
ao -> aaooaoaooa, aoa, aaooaaooooooo (3 words)
uie -> uiieieiieieuuu, uuuuiiiieeee (2 words)
aeio -> aeioooeeiiaiei (1 word)
Of course, that's a lot of strings, so in your code, you would use your bitmap rather than the string of ordered unique vowels. Also note that you don't want to produce the list of combined string, but only their count. So you don't need the list of all string occurenes, but just to maintain the count of strings matching the bitmap:
16385 -> 3
1048848 -> 2
16657 -> 1
Then look at the winning combinations between the mapping existing indexes, like you did it. For a large list of strings, you would have a much smaller list of mapping indexes, so that would be a significant improvement.
For each winning combination take the size of the first list of strings time the size of the second list of strings to increase your count.
16385 1048848 is complete -> 3 x 2 = 6 combinations
1048848 16657 is complete -> 2 x 1 = 2 combinations
---
8 potential combinations
What's the improvement ?
These combinations were found by analysing 3x2 bitmaps, rather than looking at 6x5 bitmaps corresponding to the unique strings. A gain by a significant order of magnitude if you have larger number of strings.
To be more general, as you have 5 wovels and there must be at least one, you can have only a maximum of 2<<5-1 so 31 different bitmaps and therefore a maximum of C(31,2) that's 465 31*30 that's 930 combinations to check whether you have 100 strings or 10 million strings as input. Would I be correct to state that it's roughly O(n) ?
Possible implementation:
map<int, int> mymap;
int n;
cin>>n;
for(int i=0;i<n;i++) {
string s;
cin>>s;
int bm=0;
for(int j=0;j<s.length();j++)
bm |= (1<<(s[j]-'a'));
mymap[bm]++;
}
int complete = (1<<('a'-'a')) | (1<<('e'-'a')) | (1<<('i'-'a')) | (1<<('o'-'a')) | (1<<('u'-'a'));
int count = 0;
int comparisons = 0;
for (auto i=mymap.begin(); i!=mymap.end(); i++) {
auto j=i;
for(++j;j!=mymap.end();j++) {
comparisons++;
if((i->first | j->first)==complete) {
count += i->second * j->second;
cout << i->first <<" "<<j->first<<" :"<<i->second<<" "<<j->second<<endl;
}
}
}
auto special = mymap.find(complete); // special case: all strings having all letters
if (special!=mymap.end()) { // can be combined with themselves as well
count += special->second * (special->second -1) / 2;
}
cout<<"Result: "<<count<<" (found in "<<comparisons<<" comparisons)\n";
Online demo with 4 examples (only strings with all letters, your initial example, my example above, and an example with couple of more strings)

How to determine string S can be made from string T by deleting some characters, but at most K successive characters

Sorry for the long title :)
In this problem, we have string S of length n, and string T of length m. We can check whether S is a subsequence of string T in time complexity O(n+m). It's really simple.
I am curious about: what if we can delete at most K successive characters? For example, if K = 2, we can make "ab" from "accb", but not from "abcccb". I want to check if it's possible very fast.
I could only find obvious O(nm): check if it's possible for every suffix pairs in string S and string T. I thought maybe greedy algorithm could be possible, but if K = 2, the case S = "abc" and T = "ababbc" is a counterexample.
Is there any fast solution to solve this problem?
(Update: I've rewritten the opening of this answer to include a discussion of complexity and to discussion some alternative methods and potential risks.)
(Short answer, the only real improvement above the O(nm) approach that I can think of is to observe that we don't usually need to compute all n times m entries in the table. We can calculate only those cells we need. But in practice it might be very good, depending on the dataset.)
Clarify the problem: We have a string S of length n, and a string T of length m. The maximum allowed gap is k - this gap is to be enforced at the beginning and end of the string also. The gap is the number of unmatched characters between two matched characters - i.e. if the letters are adjacent, that is a gap of 0, not 1.
Imagine a table with n+1 rows and m+1 columns.
0 1 2 3 4 ... m
--------------------
0 | ? ? ? ? ? ?
1 | ? ? ? ? ? ?
2 | ? ? ? ? ? ?
3 | ? ? ? ? ? ?
... |
n | ? ? ? ? ? ?
At first, we we could define that the entry in row r and column c is a binary flag that tells us whether the first r characters of of S is a valid k-subsequence of the first c characters of T. (Don't worry yet how to compute these values, or even whether these values are useful, we just need to define them clearly first.)
However, this binary-flag table isn't very useful. It's not possible to easily calculate one cell as a function of nearby cells. Instead, we need each cell to store slightly more information. As well as recording whether the relevant strings are a valid subsequence, we need to record the number of consecutive unmatched characters at the end of our substring of T (the substring with c characters). For example, if the first r=2 characters of S are "ab" and the first c=3 characters of T are "abb", then there are two possible matches here: The first characters obviously match with each other, but the b can match with either of the latter b. Therefore, we have a choice of leaving one or zero unmatched bs at the end. Which one do we record in the table?
The answer is that, if a cell has multiple valid values, then we take the smallest one. It's logical that we want to make life as easy as possible for ourselves while matching the remainder of the string, and therefore that the smaller the gap at the end, the better. Be wary of other incorrect optmizations - we do not want to match as many characters as possible or as few characters. That can backfire. But it is logical, for a given pair of strings S,T, to find the match (if there are any valid matches) that minimizes the gap at the end.
One other observation is that if the string S is much shorter than T, then it cannot match. This depends on k also obviously. The maximum length that S can cover is rk, if this is less than c, then we can easily mark (r,c) as -1.
(Any other optimization statements that can be made?)
We do not need to compute all the values in this table. The number of different possible states is k+3. They start off in an 'undefined' state (?). If a matching is not possible for the pair of (sub)strings, the state is -. If a matching is possible, then the score in the cell will be a number between 0 and k inclusive, recording the smallest possible number of unmatched consecutive characters at the end. This gives us a total of k+3 states.
We are interested only in the entry in the bottom right of the table. If f(r,c) is the function that computes a particular cell, then we are interested only in f(n,m). The value for a particular cell can be computed as a function of the values nearby. We can build a recursive algorithm that takes r and c as input and performs the relevant calculations and lookups in term of the nearby values. If this function looks up f(r,c) and finds a ?, it will go ahead and compute it and then store the answer.
It is important to store the answer as the algorithm may query the same cell many times. But also, some cells will never be computed. We just start off attempting to calculate one cell (the bottom right) and just lookup-and-calculate-and-store as necessary.
This is the "obvious" O(nm) approach. The only optimization here is the observation that we don't need to calculate all the cells, therefore this should bring the complexity below O(nm). Of course, with really nasty datasets, you may end up calculating almost all of the cells! Therefore, it's difficult to put an official complexity estimate on this.
Finally, I should say how to compute a particular cell f(r,c):
If r==0 and c <= k, then f(r,c) = 0. An empty string can match any string with up to k characters in it.
If r==0 and c > k, then f(r,c) = -1. Too long for a match.
There are only two other ways a cell can have a successful state. We first try:
If S[r]==T[c] and f(r-1,c-1) != -1, then f(r,c) = 0. This is the best case - a match with no trailing gap.
If that didn't work, we try the next best thing. If f(r,c-1) != -1 and f(r,c) < k, then f(r,c) = f(r,c-1)+1.
If neither of those work, then f(r,c) = -1.
The rest of this answer is my initial, Haskell-based approach. One advantage of it is that it 'understands' that it needn't compute every cell, only computing cells where necessary. But it could make the inefficiency of calculating one cell many times.
*Also note that the Haskell approach is effectively approaching the problem in a mirror image - it trying to build matches from the end substrings of S and T where minimal leading bunch of unmatched characters. I don't have the time to rewrite it in its 'mirror image' form!
A recursive approach should work. We want a function that will take three arguments, int K, String S, and String T. However, we don't just want a boolean answer as to whether S is a valid k-subsequence of T.
For this recursive approach, if S is a valid k-subsequence, we also want to know about the best subsequence possible by returning how few characters from the start of T can be dropped. We want to find the 'best' subsequence. If a k-subsequence is not possible for S and T, then we return -1, but if it is possible then we want to return the smallest number of characters we can pull from T while retaining the k-subsequence property.
helloworld
l r d
This is a valid 4-subsequence, but the biggest gap has (at most) four characters (lowo). This is the best subsequence because it leaves a gap of just two characters at the start (he). Alternatively, here is another valid k-subsequence with the same strings, but it's not as good because it leaves a gap of three at the start:
helloworld
l r d
This is written in Haskell, but it should be easy enough to rewrite in any other language. I'll break it down in more detail below.
best :: Int -> String -> String -> Int
-- K S T return
-- where len(S) <= len(T)
best k [] t_string -- empty S is a subsequence of anything!
| length(t_string) <= k = length(t_string)
| length(t_string) > k = -1
best k sss#(s:ss) [] = (-1) -- if T is empty, and S is non-empty, then no subsequence is possible
best k sss#(s:ss) tts#(t:ts) -- both are non-empty. Various possibilities:
| s == t && best k ss ts /= -1 = 0 -- if s==t, and if best k ss ts != -1, then we have the best outcome
| best k sss ts /= -1
&& best k sss ts < k = 1+ (best k sss ts) -- this is the only other possibility for a valid k-subsequence
| otherwise = -1 -- no more options left, return -1 for failure.
A line-by-line analysis:
(A comment in Haskell starts with --)
best :: Int -> String -> String -> Int
A function that takes an Int, and two Strings, and that returns an Int. The return value is to be -1 if a k-subsequence is not possible. Otherwise it will return an integer between 0 and K (inclusive) telling us the smallest possible gap at the start of T.
We simply deal with the cases in order.
best k [] t -- empty S is a subsequence of anything!
| length(t) <= k = length(t)
| length(t) > k = -1
Above, we handle the case where S is empty ([]). This is simple, as an empty string is always a valid subsequence. But to test if it is a valid k-subsequence, we must calculate the length of T.
best k sss#(s:ss) [] = (-1)
-- if T is empty, and S is non-empty, then no subsequence is possible
That comment explains it. This leaves us with the situations where both strings are non-empty:
best k sss#(s:ss) tts#(t:ts) -- both are non-empty. Various possibilities:
| s == t && best k ss ts /= -1 = 0 -- if s==t, and if best k ss ts != -1, then we have the best outcome
| best k sss ts /= -1
&& best k sss ts < k = 1+ (best k sss ts) -- this is the only other possibility for a valid k-subsequence
| otherwise = -1 -- no more options left, return -1 for failure.
tts#(t:ts) matches a non-empty string. The name of the string is tts. But there is also a convenient trick in Haskell to allow you to give names to the first letter in the string (t) and the remainder of the string (ts). Here ts should be read aloud as the plural of t - the s suffix here means 'plural'. We say have have a t and some ts and together they make the full (non-empty) string.
That last block of code deals with the case where both strings are non-empty. The two strings are called sss and tts. But to save us the hassle of writing head sss and tail sss to access the first letter, and the string-remainer, of the string, we simply use #(s:ss) to tell the compiler to store those quantities into variables s and ss. If this was C++ for example, you'd get the same effect with char s = sss[0]; as the first line of your function.
The best situation is that the first characters match s==t and the remainder of the strings are a valid k-subsequence best k sss ts /= -1. This allows us to return 0.
The only other possibility for success if if the current complete string (sss) is a valid k-subsequence of the remainder of the other string (ts). We add 1 to this and return, but making an exception if the gap would grow too big.
It's very important not to change the order of those last five lines. They are order in decreasing order of how 'good' the score is. We want to test for, and return the very best possibilities first.
Naive recursive solution. Bonus := return value is the number of ways that the string can be matched.
#include <stdio.h>
#include <string.h>
unsigned skipneedle(char *haystack, char *needle, unsigned skipmax)
{
unsigned found,skipped;
// fprintf(stderr, "skipneedle(%s,%s,%u)\n", haystack, needle, skipmax);
if ( !*needle) return strlen(haystack) <= skipmax ? 1 : 0 ;
found = 0;
for (skipped=0; skipped <= skipmax ; haystack++,skipped++ ) {
if ( !*haystack ) break;
if ( *haystack == *needle) {
found += skipneedle(haystack+1, needle+1, skipmax);
}
}
return found;
}
int main(void)
{
char *ab = "ab";
char *test[] = {"ab" , "accb" , "abcccb" , "abcb", NULL}
, **cpp;
for (cpp = test; *cpp; cpp++ ) {
printf( "[%s,%s,%u]=%u \n"
, *cpp, ab, 2
, skipneedle(*cpp, ab, 2) );
}
return 0;
}
An O(p*n) solution where p = number of subsequences possible of S in T.
Scan the string T and maintain a list of possible subsequences of S that would have
1. Index of last character found and
2. Number of characters to be deleted found
Continue to update this list at each character of T.
Not sure if this is what your asking for, but you could create a list of characters from each String, and search for instances of the one list in the other, then if(list2.length-K > list1.length) return false.
Following is a proposed algorithm : - O(|T|*k) average case
1> scan T and store character indices in Hash Table :-
eg. S = "abc" T = "ababbc"
Symbol table entries : -
a = 1 3
b = 2 4 5
c = 6
2.> as we know isValidSub(S,T) = isValidSub(S(0,j),T) && (isValidSub(S(j+1,N),T)||....isValidSub(S(j+K,T),T))
a.> we will use the bottom up approach to solve above problem
b.> we will maintain an valid array Valid(len(S)) where each record points to a Hash Table (Explained as we go along solving further)
c.> Start from the last element of S, Look up for the indices stored corresponding to the character in Symbol Table
eg. in above example S[last] = "c"
in Symbol Table c = 6
Now we put records like (5,6) , (4,6) ,.... (6-k-1,6) into Hash table at Valid(last)
Explanation : - as s(6,len(S)) is valid subsequence hence s(0,6-i) ++ s(6,len(S)) (where i is in range(1,k+1)) is also valid subsequence provided s(0,6-i) is valid subsequence.
3.> start filling up Valid Array from last to 0 element : -
a.> take a indice from hash table entry corresponding to S[j] where j is current indice of Valid Array we are analysing.
b.> Check whether indice is in Valid(j+1) if less then add (indice-i,indice) where i in range(1,k+1) into Valid(j) Hash Table
example:-
S = "abc" T = "ababbc"
iteration 1 :
j = len(S) = 3
S[3] = 'c'
Symbol Table : c = 6
add (5,6),(4,6),(3,6) as K = 2 in Valid(j)
Valid(3) = {(5,6),(4,6),(3,6)}
j = 2
iteration 2 :
S[j] = 'b'
Symbol table: b = 2 4 5
Look up 2 in Valid(3) => not found => skip
Look up 4 in Valid(3) => found => add Valid(2) = {(3,4),(2,4),(1,4)}
Look up 5 in Valid(3) => found => add Valid(2) = {(3,4),(2,4),(1,4),(4,5)}
j = 1
iteration 3:
S[j] = "a"
Symbol Table : a = 1 3
Look up 1 in Valid(2) => not found
Look up 3 in Valid(2) => found => stop as it is last iteration
END
as 3 is found in Valid(2) that means there exists a valid subsequence starting at in T
Start = 3
4.> Reconstruct the solution moving downwards in Valid Array :-
example :
Start = 3
Look up 3 in Valid(2) => found (3,4)
Look up 4 in Valid(3) => found (4,6)
END
reconstructed solution (3,4,6) which is indeed valid subsequence
Remember (3,5,6) can also be a solution if we had added (3,5) instead of (3,4) in that iteration
Analysis of Time complexity & Space complexity : -
Time Complexity :
Step 1 : Scan T = O(|T|)
Step 2 : fill up all Valid entries O(|T|*k) using HashTable lookup is aprox O(1)
Step 3 : Reconstruct solution O(|S|)
Overall average case Time : O(|T|*k)
Space Complexity:
Symbol table = O(|T|+|S|)
Valid table = O(|T|*k) can be improved with optimizations
Overall space = O(|T|*k)
Java Implementation: -
public class Subsequence {
private ArrayList[] SymbolTable = null;
private HashMap[] Valid = null;
private String S;
private String T;
public ArrayList<Integer> getSubsequence(String S,String T,int K) {
this.S = S;
this.T = T;
if(S.length()>T.length())
return(null);
S = S.toLowerCase();
T = T.toLowerCase();
SymbolTable = new ArrayList[26];
for(int i=0;i<26;i++)
SymbolTable[i] = new ArrayList<Integer>();
char[] s1 = T.toCharArray();
char[] s2 = S.toCharArray();
//Calculate Symbol table
for(int i=0;i<T.length();i++) {
SymbolTable[s1[i]-'a'].add(i);
}
/* for(int j=0;j<26;j++) {
System.out.println(SymbolTable[j]);
}
*/
Valid = new HashMap[S.length()];
for(int i=0;i<S.length();i++)
Valid[i] = new HashMap<Integer,Integer >();
int Start = -1;
for(int j = S.length()-1;j>=0;j--) {
int index = s2[j] - 'a';
//System.out.println(index);
for(int m = 0;m<SymbolTable[index].size();m++) {
if(j==S.length()-1||Valid[j+1].containsKey(SymbolTable[index].get(m))) {
int value = (Integer)SymbolTable[index].get(m);
if(j==0) {
Start = value;
break;
}
for(int t=1;t<=K+1;t++) {
Valid[j].put(value-t, value);
}
}
}
}
/* for(int j=0;j<S.length();j++) {
System.out.println(Valid[j]);
}
*/
if(Start != -1) { //Solution exists
ArrayList subseq = new ArrayList<Integer>();
subseq.add(Start);
int prev = Start;
int next;
// Reconstruct solution
for(int i=1;i<S.length();i++) {
next = (Integer)Valid[i].get(prev);
subseq.add(next);
prev = next;
}
return(subseq);
}
return(null);
}
public static void main(String[] args) {
Subsequence sq = new Subsequence();
System.out.println(sq.getSubsequence("abc","ababbc", 2));
}
}
Consider a recursive approach: let int f(int i, int j) denote the minimum possible gap at the beginning for S[i...n] matching T[j...m]. f returns -1 if such matching does not exist. Here's the implementation of f:
int f(int i, int j){
if(j == m){
if(i == n)
return 0;
else
return -1;
}
if(i == n){
return m - j;
}
if(S[i] == T[j]){
int tmp = f(i + 1, j + 1);
if(tmp >= 0 && tmp <= k)
return 0;
}
return f(i, j + 1) + 1;
}
If we convert this recursive approach to a dynamic programming approach, then we can have a time complexity of O(nm).
Here's an implementation that usually* runs in O(N) and takes O(m) space, where m is length(S).
It uses the idea of a surveyor's chain:
Imagine a series of poles linked by chains of length k.
Achor the first pole at the beginning of the string.
Now cary the next pole forward until you find a character match.
Place that pole. If there is slack, move on to the next character;
else the previous pole has been dragged forward, and you need to go back
and move it to the next nearest match.
Repeat until you reach the end or run out of slack.
typedef struct chain_t{
int slack;
int pole;
} chainlink;
int subsequence_k_impl(char* t, char* s, int k, chainlink* link, int len)
{
char* match=s;
int extra = k; //total slack in the chain
//for all chars to match, including final null
while (match<=s+len){
//advance until we find spot for this post or run out of chain
while (t[link->pole] && t[link->pole]!=*match ){
link->pole++; link->slack--;
if (--extra<0) return 0; //no more slack, can't do it.
}
//if we ran out of ground, it's no good
if (t[link->pole] != *match) return 0;
//if this link has slack, go to next pole
if (link->slack>=0) {
link++; match++;
//if next pole was already placed,
while (link[-1].pole < link->pole) {
//recalc slack and advance again
extra += link->slack = k-(link->pole-link[-1].pole-1);
link++; match++;
}
//if not done
if (match<=s+len){
//currrent pole is out of order (or unplaced), move it next to prev one
link->pole = link[-1].pole+1;
extra+= link->slack = k;
}
}
//else drag the previous pole forward to the limit of the chain.
else if (match>=s) {
int drag = (link->pole - link[-1].pole -1)- k;
link--;match--;
link->pole+=drag;
link->slack-=drag;
}
}
//all poles planted. good match
return 1;
}
int subsequence_k(char* t, char* s, int k)
{
int l = strlen(s);
if (strlen(t)>(l+1)*(k+1))
return -1; //easy exit
else {
chainlink* chain = calloc(sizeof(chainlink),l+2);
chain[0].pole=-1; //first pole is anchored before the string
chain[0].slack=0;
chain[1].pole=0; //start searching at first char
chain[1].slack=k;
l = subsequence_k_impl(t,s,k,chain+1,l);
l=l?chain[1].pole:-1; //pos of first match or -1;
free(chain);
}
return l;
}
* I'm not sure of the big-O. I initially thought it was something like O(km+N). In testing, it averages less than 2N for good matches and less than N for failed matches.
...but.. there is a strange degenerate case. For random strings selected from an alphabet of size A, it gets much slower when k = 2A+1. Even this case it's better than O(Nm), and the performance returns to O(N) when k is increased or decreased slightly. Gist Here if anyone is curious.

Pattern matching a string in linear time

Given two strings S and T, where the T is the pattern string. Find if any scrambled form of pattern string exists as SubString in the string S and if present return the start index.
Example:
String S: abcdef
String T: efd
String S has "def", a combination of search string T: "efd".
I have found a solution with a run time of O(m*n). I am working on a linear time solution where I used to HashMaps (static one, maintained for String T, and another a dynamic copy of the previous HashMap used for checking the current substring of T). I'd start checking at the next character where it fails. But this runs in O(m*n) in worst case.
I'd like to get some pointers to make it work in O(m+n) time. Any help would be appreciated.
First of all, I would like to know boundaries for string S length (m) and pattern T length (n).
There exist one general idea but complexity of the solution based on it depends on the pattern length. Complexity varies from O(m) to O(m*n^2) for short patterns with length<=100 and O(n) for long patterns.
Fundamental theorem of arithmetic states that every integer number can be uniquely represented as a product of prime numbers.
Idea - I guess, your alphabet is english letters. So, alphabet size is 26. Let's replace first letter with first prime, second letter with the second and so on. I mean the following replacement: a->2b->3c->5d->7e->11 and so on.
Let's denote product of primes corresponding for the letters of some string as prime product(string). For example, primeProduct(z) will be 101 as 101 is 26-th prime number, primeProduct(abc) will be 2*3*5=30,primeProduct(cba) will also be 5*3*2=30.
Why we choose prime numbers? If we replace a ->2; b ->3, c->4, we won't be able to decipher for exapmle 4 - is it "c" or "aa".
Solution for the short patterns case:
For the string S, we should calculate in linear time prime product for all prefixes. I mean we have to create array A such that A[0] = primeProduct(S[0]), A[1] = primeProduct(S[0]S[1]), A[N] = primeProduct(S). Sample implementation:
A[0] = getPrime(S[0]);
for(int i=1;i<S.length;i++)
A[i]=A[i-1]*getPrime(S[i]);
Searching pattern T. Calculate primeProduct(T). For all 'windows' in S which have the same length with pattern compare it's primeProduct with primeProduct(pattern). If currentWindow is equal to the pattern or currentWindow is a scrumbled form(anagramm) of the pattern primeProducts will be the same.
Important note! We have prepared array A for fast computing primeProduct for any substring of S. primeProduct of(S[i],S[i+1],...S[j]) = getPrime(S[i])*...*getPrime(S[j]) = A[j]/A[i-1];
Complexity: if pattern length is <=9, even 'zzzzzzzzz' is 101^9<=MAX_LONG_INT; All calculations fit in standart long type and complexity is O(N)+O(M) where N is for calculating primeProduct of pattern and M is iterating over all windows in S. If length<=100 you have to add complexity of mul/div long numbers that's why complexity becomes O(m*n^2). length of 101^length is O(N) mul/div of such long numbers is O(N^2)
For the long patterns with length>=1000 it's better to store some hash map(prime,degree). Array of prefixes will become array of hash maps and A[j]/A[i-1] trick will become differenceBetween(A[j] and A[i-1] hashmaps's key sets).
Would this JavaScript example be linear time?
<script>
function matchT(t,s){
var tMap = [], answer = []
//map the character count in t
for (var i=0; i<t.length; i++){
var chr = t.charCodeAt(i)
if (tMap[chr]) tMap[chr]++
else tMap[chr] = 1
}
//traverse string
for (var i=0; i<s.length; i++){
if (tMap[s.charCodeAt(i)]){
var start = i, j = i + 1, tmp = []
tmp[s.charCodeAt(i)] = 1
while (tMap[s.charCodeAt(j)]){
var chr = s.charCodeAt(j++)
if (tmp[chr]){
if (tMap[chr] > tmp[chr]) tmp[chr]++
else break
}
else tmp[chr] = 1
}
if (areEqual (tmp,tMap)){
answer.push(start)
i = j - 1
}
}
}
return answer
}
//function to compare arrays
function areEqual(arr1,arr2){
if (arr1.length != arr2.length) return false
for (var i in arr1)
if (arr1[i] != arr2[i]) return false
return true
}
</script>
Output:
console.log(matchT("edf","ghjfedabcddef"))
[3, 10]
If the alphabet is not too large (say, ASCII), then there is no need to use a hash to take care of strings.
Just use a big array which is of the same size as the alphabet, and the existence checking becomes O(1). Thus the whole algorithm becomes O(m+n).
Let us consider for the given example,
String S: abcdef
String T: efd
Create a HashSet which consists of the characters present in the Substring T. So, the set consists of .
Generate a label for the Substring T: 1e1f1d. (number of occurences of each characters + the character itself, can be done using technique similar to count sort)
Now we have to generate labels for the input of the sub-string's length.
Let us start from the first position, which has character a. Since it is not present we do not create any sub-string and move to the next character b. Similarly, to character c and then stop at d.
Since d is present in the HashSet start generating labels(of the sub-string length) for each time the character appears. We can do this in different function to avoid clearing the count array(doing this reduces the complexity from O(m*n) to O(m+n)). If at any point the input string does not consists of the Substring T we can start the label generation from the next position(since the position till the break occurred cannot be a part of the anagram).
So, by generating the labels we can solve the problem in linear O(m+n) time complexity.
m: length of the input string,
n: length of the sub string.
That Code below I used for the pattern searching questions in GFG its accepted in all test cases and works in linear time.
// { Driver Code Starts
import java.util.*;
class Implement_strstr
{
public static void main(String args[])
{
Scanner sc = new Scanner(System.in);
int t = sc.nextInt();
sc.nextLine();
while(t>0)
{
String line = sc.nextLine();
String a = line.split(" ")[0];
String b = line.split(" ")[1];
GfG g = new GfG();
System.out.println(g.strstr(a,b));
t--;
}
}
}// } Driver Code Ends
class GfG
{
//Function to locate the occurrence of the string x in the string s.
int strstr(String a, String d)
{
if(a.equals("") && d.equals("")) return 0;
if(a.length()==1 && d.length()==1 && a.equals(d)) return 0;
if(d.length()==1 && a.charAt(a.length()-1)==d.charAt(0)) return a.length()-1;
int t=0;
int pl=-1;
boolean b=false;
int fl=-1;
for(int i=0;i<a.length();i++)
{
if(pl!=-1)
{
if(i==pl+1 && a.charAt(i)==d.charAt(t))
{
t++;
pl++;
if(t==d.length())
{
b=true;
break;
}
}
else
{
fl=-1;
pl=-1;
t=0;
}
}
else
{
if(a.charAt(i)==d.charAt(t))
{
fl=i;
pl=i;
t=1;
}
}
}
return b?fl:-1;
}
}
Here is the link to the question https://practice.geeksforgeeks.org/problems/implement-strstr/1

Given a palindromic string, in how many ways we can convert it to a non palindrome by removing one more more characters from it? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions must demonstrate a minimal understanding of the problem being solved. Tell us what you've tried to do, why it didn't work, and how it should work. See also: Stack Overflow question checklist
Closed 9 years ago.
Improve this question
Given a palindromic string, in how many ways we can convert it to a non palindrome by removing one more more characters from it?
For example if the string is "b99b". Then we can do it in 6 ways,
i) Remove 1st character : "99b"
ii) Remove 1st, 2nd characters : "9b"
iii) Remove 1st, 3rd characters : "9b"
iv) Remove 2nd, 4th characters : "b9"
v) Remove 3rd, 4th characters : "b9"
vi) Remove 4th character : "b99"
How to approach this one?
PS:Two ways are considered different if there exists an i such that character at index i is removed in one way and not removed in another.
There's an O(n2) dynamic programming algorithm for counting the number of palindromic subsequences of a string; you can use that to count the number of non-palindromic subsequences by subtracting the number of palindromic subsequences from the number of subsequences (which is simply 2n).
This algorithm counts subsequences by the criterion in the OP; two subsequences are considered different if there is a difference in the list of indices used to select the elements, even if the resulting subsequences have the same elements.
To count palindromic subsequences, we build up the count based on intervals of the sequence. Specifically, we define:
Si,j = the substring of S starting at index i and ending at index j (inclusive)
Pi,j = the number of palindromic subsequences of Si,j
Now, every one-element interval is a palindrome, so:
Pi,i &equals; 1 for all i < n
If a substring does not begin and end with the same element (i.e., Si ≠ Sj) then the palindromic subsequences consist of:
Those which contain Si but do not contain Sj
Those which contain Sj but do not contain Si
Those which contain neither Si nor Sj
Now, note that Pi,j-1 includes both the first and the third set of subsequences, while Pi+1,j includes both the second and the third set; Pi+1,j-1 is precisely the third set. Consequently:
Pi,j &equals; Pi+1,j &plus; Pi,j-1 − Pi+1,j-1 if Si ≠ Sj
But what if Si &equals; Sj? In that case, we have to add the palindromes consisting of Si followed by a subsequence palindrome from Si+1,j-1 followed by Sj, as well as the palindromic subsequence consisting of just the start and end characters. (Technically, an empty sequence is a palindrome, but we don't count those here.) The number of subsequences we add is Pi+1,j-1 &plus; 1, which cancels out the subtracted double count in the above equation. So:
Pi,j &equals; Pi+1,j &plus; Pi,j-1 &plus; 1 if Si &equals; Sj.
In order to save space, we can actually compute Pi,i+k for 0 ≤ i < |S|-k for increasing values of k; we only need to retain two of these vectors in order to generate the final result P0,|S|-1.
EDIT:
Here's a little python program; the first one computes the number of palindromic subsequences, as above, and the driver computes the number of non-palindromic subsequences (i.e. the number of ways to remove zero or more elements and produce a non-palindrome; if the original sequence is a palindrome, then it's the number of ways to remove one or more elements.)
# Count the palindromic subsequences of s
def pcount(s):
length = len(s)
p0 = [0] * (length + 1)
p1 = [1] * length
for l in range(1, length):
for i in range(length - l):
p0[i] = p1[i]
if s[i] == s[i + l]:
p1[i] += p1[i+1] + 1
else:
p1[i] += p1[i+1] - p0[i+1]
# The "+ 1" is to account for the empty sequence, which is a palindrome.
return p1[0] + 1
# Count the non-palindromic subsequences of s
def npcount(s):
return 2**len(s) - pcount(s)
this is not a complete answer, just a suggestion.
i would count the number of ways you can remove one or more characters and keep the string a palindrome. then subtract that from the total number of ways you can modify the string.
the most obvious way to modify a palindrome and keep it a palindrome is to remove the i'th and the (n-i)'th characters (n being the length of the string). there are 2^(n/2) ways you can do that.
the problem with this approach is that it assumes only a symmetric modification can keep the string a palindrome, you need to find a way to handle cases such as "aaaa" where any sort of modification will still result in a palindrome.
Brute force with memoization is pretty straightforward:
numWays(str): return 0 if str is empty
return memo[str] if it exists
memo[str] = numWays(str - firstChar) +
numWays(str - secondChar) +
... +
1 if str is not a palindrome
return memo[str]
Basically, you remove every character in turn and save the answer for the resulting string. The more identical characters you have in the string, the faster this is.
I'm not sure how to do it more efficiently, I will update this if I figure it out.
For a string with N elements, there are 2^N possible substrings (including the whole string and the empty substring). Thus we can encode every substring by a number with a '1' bit at the bitposition for every omitted (or present) character, and a '0' bit otherwise. (assuming the length of the string is smaller then the number of bits in an int (size_t here), otherwise you would need an other representation for the bitstring):
#include <stdio.h>
#include <string.h>
char string[] = "AbbA";
int is_palindrome (char *str, size_t len, size_t mask);
int main(void)
{
size_t len,mask, count;
len = strlen(string);
count =0;
for (mask = 1; mask < (1ul <<len) -1; mask++) {
if ( is_palindrome (string, len, mask)) continue;
count++;
}
fprintf(stderr, "Len:=%u, Count=%u \n"
, (unsigned) len , (unsigned) count );
return 0;
}
int is_palindrome (char *str, size_t len, size_t mask)
{
size_t l,r,pop;
for (pop=l=0, r = len -1; l < r; ) {
if ( mask & (1u <<l)) { l++; continue; }
if ( mask & (1u <<r)) { r--; continue; }
if ( str[l] == str[r] ) return 1;
l++,r--; pop++;
}
return (pop <1) ? 1: 0;
}
Here's a Haskell version:
import Data.List
listNonPalindromes string =
filter (isNotPalindrome) (subsequences string)
where isNotPalindrome str
| fst substr == snd substr = False
| otherwise = True
where substr = let a = splitAt (div (length str) 2) str
in (reverse (fst a), if even (length str)
then snd a
else drop 1 (snd a))
howManyNonPalindromes string = length $ listNonPalindromes string
*Main> listNonPalindromes "b99b"
["b9","b9","b99","9b","9b","99b"]
*Main> howManyNonPalindromes "b99b"
6

Understanding Knuth-Morris-Pratt Algorithm

Can someone explain this to me? I've been reading about it and it still is hard to follow.
text : ababdbaababa
pattern: ababa
table for ababa is -1 0 0 1 2.
I think I understand how the table is constructed but, I dont understand how to shift once mismatch has occurred. Seems like we dont even use the table when shifting?
when do we use the table?
Here I have briefly described computing the prefix function and shifting through the text here.
For further information: Knuth–Morris–Pratt string search algorithm
Shifting through the text :
Text: ABC ABCDAB ABCDABCDABDE
Pattern : ABCDABD
Scenario 1 - There is/are some matching character/s in Pattern and Text.
e.g 1: In here there are 3 matching characters.
Get the value from table for 3 characters. (index 2, ABC) i.e 0
Therefore shift = 3 - 0 i.e 3
e.g 2: In here there are 6 matching characters.
Get the value from table for 6 characters. (index 5, ABCDAB) i.e 2
Therefore shift = 6 - 2 i.e 4
Scenario 2 - If there is no matching characters then shift by one.
the table is used when your mismatch occurs. Let's apply the pattern to your text:
You start matching text with pattern and test if your pattern could be in text, starting at the first position. You compare text[1] with pattern[1] and that turns out to be a match. You do the same for text[2], text[3] and text[4].
when you want to match text[5] with pattern[5] you don't have a match (d<>a). You then know that your pattern will not start at the first position. You could then start the matching all over again for position 2 but that is not efficient. You can use the table now.
The error occured at pattern[5] so you go to table[5] which is 2. That tells you that you can start matching at the current position again with 2 already matched characters. Instead of having to start matching position 2, you can start at your previous position (1) + table[5] (2)=3. Indeed, If we look at text[3] and text[4], we see that it is equal to pattern[1] and pattern[2], respectivily.
The numbers in table tell you how many positions are already matched when an error occurs. In this case 2 characters of the next pattern were already matched. You can then immediately start matching for position 3 and skip position 2 (as the pattern can not be found starting at position[2]).
Well this is an old topic but hopefully someone who searches for this in the future will see it. Answer given above is good but I worked through an example myself to see what's going on exactly.
First part of the exposition is taken from wiki, the part I really wanted to elaborate on is how this backtracking array is constructed.
Here goes:
we work through a (relatively artificial) run of the algorithm, where
W = "ABCDABD" and
S = "ABC ABCDAB ABCDABCDABDE".
At any given time, the algorithm is in a state determined by two integers:
m which denotes the position within S which is the beginning of a prospective match for W
i the index in W denoting the character currently under consideration.
In each step we compare S[m+i] with W[i] and advance if they are equal. This is depicted, at the start of the run, like
1 2
m: 01234567890123456789012
S: ABC ABCDAB ABCDABCDABDE
W: ABCDABD
i: 0123456
We proceed by comparing successive characters of W to "parallel" characters of S, moving from one to the next if they match. However, in the fourth step,
we get S[3] is a space and W[3] = 'D', a mismatch. Rather than beginning to search again at S[1], we note that no 'A' occurs between positions 0 and 3 in S
except at 0; hence, having checked all those characters previously, we know there is no chance of finding the beginning of a match if we check them again.
Therefore we move on to the next character, setting m = 4 and i = 0.
1 2
m: 01234567890123456789012
S: ABC ABCDAB ABCDABCDABDE
W: ABCDABD
i: 0123456
We quickly obtain a nearly complete match "ABCDAB" when, at W[6] (S[10]), we again have a discrepancy. However, just prior to the end of the current partial
match, we passed an "AB" which could be the beginning of a new match, so we must take this into consideration. As we already know that these characters match
the two characters prior to the current position, we need not check them again; we simply reset m = 8, i = 2 and continue matching the current character. Thus,
not only do we omit previously matched characters of S, but also previously matched characters of W.
1 2
m: 01234567890123456789012
S: ABC ABCDAB ABCDABCDABDE
W: ABCDABD
i: 0123456
This search fails immediately, however, as the pattern still does not contain a space, so as in the first trial, we return to the beginning of W and begin
searching at the next character of S: m = 11, reset i = 0.
1 2
m: 01234567890123456789012
S: ABC ABCDAB ABCDABCDABDE
W: ABCDABD
i: 0123456
Once again we immediately hit upon a match "ABCDAB" but the next character, 'C', does not match the final character 'D' of the word W. Reasoning as before,
we set m = 15, to start at the two-character string "AB" leading up to the current position, set i = 2, and continue matching from the current position.
1 2
m: 01234567890123456789012
S: ABC ABCDAB ABCDABCDABDE
W: ABCDABD
i: 0123456
This time we are able to complete the match, whose first character is S[15].
The above example contains all the elements of the algorithm. For the moment, we assume the existence of a "partial match" table T, described below, which
indicates where we need to look for the start of a new match in the event that the current one ends in a mismatch. The entries of T are constructed so that
if we have a match starting at S[m] that fails when comparing S[m + i] to W[i], then the next possible match will start at index m + i - T[i] in S (that is,
T[i] is the amount of "backtracking" we need to do after a mismatch). This has two implications: first, T[0] = -1, which indicates that if W[0] is a mismatch,
we cannot backtrack and must simply check the next character; and second, although the next possible match will begin at index m + i - T[i], as in the example
above, we need not actually check any of the T[i] characters after that, so that we continue searching from W[T[i]].
BACKTRACKING ARRAY CONSTRUCTION:
so this backtracking array T[] we will call lps[], let's see how we calculate this guy
lps[i] = the longest proper prefix of pat[0..i]
which is also a suffix of pat[0..i].
Examples:
For the pattern “AABAACAABAA”,
lps[] is [0, 1, 0, 1, 2, 0, 1, 2, 3, 4, 5]
//so just going through this real quick
lps[0] is just 0 by default
lps[1] is 1 because it's looking at AA and A is both a prefix and suffix
lps[2] is 0 because it's looking at AAB and suffix is B but there is no prefix equal to B unless you count B itself which I guess is against the rules
lps[3] is 1 because it's looking at AABA and first A matches last A
lps[4] is 2 becuase it's looking at AABAA and first 2 A matches last 2 A
lps[5] is 0 becuase it's looking at AABAAC and nothing matches C
...
For the pattern “ABCDE”, lps[] is [0, 0, 0, 0, 0]
For the pattern “AAAAA”, lps[] is [0, 1, 2, 3, 4]
For the pattern “AAABAAA”, lps[] is [0, 1, 2, 0, 1, 2, 3]
For the pattern “AAACAAAAAC”, lps[] is [0, 1, 2, 0, 1, 2, 3, 3, 3, 4]
And this totally makes sense if you think about it...if you mismatch, you want to go back as far as you can obviously, how far back you go (the suffix
portion) is essentially the prefix since you must start matching from the first character again by definition. so if your string looks like
aaaaaaaaaaaaaaa..b..aaaaaaaaaaaaaaac and you mismatche on the last char c, then you want to reuse aaaaaaaaaaaaaaa as your new head, just think it through
A Complete Solution using Java:
package src.com.recursion;
/*
* This Expains the Search of pattern in text in O(n)
*/
public class FindPatternInText {
public int checkIfExists(char[] text, char[] pattern) {
int index = 0;
int[] lps = new int[pattern.length];
createPrefixSuffixArray(pattern, lps);
int i = 0;
int j = 0;
int textLength = text.length;
while (i < textLength) {
if (pattern[j] == text[i]) {
j++;
i++;
}
if (j == pattern.length)
return i - j;
else if (i < textLength && pattern[j] != text[i]) {
if (j != 0) {
j = lps[j - 1];
} else {
i++;
}
}
}
return index;
}
private void createPrefixSuffixArray(char[] pattern, int[] lps) {
lps[0] = 0;
int index = 0;
int i = 1;
while (i < pattern.length) {
if (pattern[i] == pattern[index]) {
lps[i] = index;
i++;
index++;
} else {
if (index != 0) {
index = lps[index - 1];
} else {
lps[i] = 0;
i++;
}
}
}
}
public static void main(String args[]) {
String text = "ABABDABACDABABCABAB";
String pattern = "ABABCABAB";
System.out.println("Point where the pattern match starts is "
+ new FindPatternInText().checkIfExists(text.toCharArray(), pattern.toCharArray()));
}
}

Resources