Linear space data-structure supporing subsequence query on a static string - string

Build a data-stucture from a given string S of length n which supports fast queries for checking whether an input string J of length m is a subsequence of S.
S is a static string and pre-processing time of the data-structure can be ignored.
Requirements:
The space consumption should be linear O(n)
The runtime of subsequence(J) should depend on m - not necessarily O(m) but, the faster the better.
What is subsequence?
A is a subsequence of B if A can be constructed by removing zero or more characters from B. I.e ABA is a subsequence of ADBDBAC
What I tried
A data-structure which supports the Subsequence(J) query stores pointers from each letter in S to the next occurrence in S of every letter in the alphabet.
Let A be an array of length n + 1. A contains hash-tables hashed over alphabet, σ. Each key-value pair (k,v) in the hash-table contains some letter k as key and it's next occurrence as value v.
The hash-table A_0 contains the first occurrence of every letter in the alphabet.
The hash-table A_1 contains the index of second occurrence for the letter at S_0 along with the first occurrence of the other letters.
The hash-table A_2 contains the index of second occurrence for the letters S_1 and S_2 assuming they are different letters - otherwise A_2 will contain the third index of the letter at S_1 - along with the first occurrence of the other letters and so on...
Example: If T is B C A D F B, ¥ represents the hashtable A_0 and represents a Ø null pointer, the data-structure would look like:
|0 1 2 3 4 5
|¥ B C A D B
A|3 3 3 Ø Ø Ø
B|1 5 5 5 5 Ø
C|2 2 Ø Ø Ø Ø
D|4 4 4 4 Ø Ø
The alphabet \sigma is built from the letters in T and is static. Therefore, perfect hashing (FKS) can be used.
Running the query
To perform the Subsequence(J) query with the string J, we lookup the A-index of the first occurrence J_0 in S using A_0.
In the example we could query Subsequence("BAB") to test if BAB is a subsequence:
* look-up B in column 0 which returns index 1
* look-up A in column 1 which returns index 3
* look-up B in column 3 which returns index 5
As long as we don't pass a null-pointer, the string is subsequence. The hash-lookups take constant time and we have to perform at most |J| of them the runtime is O(|J|).
The space consumption is O(|J|·|S|)

The simple and slow way to check whether or not J is a subsequence of S is:
Start at the beginning of S
For each character c in J, in order, move forward in S to the next occurrence of c.
Iff you make it to the end and find a match for every character, then J is a subsequence of S.
You can accelerate these searches by building a map from each character that occurs in S to a sorted array of the positions at which that character occurs.
Then, to find the next occurrence of a character in step (2), you can lookup the position array for that character and do a binary search in the array for the next occurrence after the current position.
Total worst-case complexity to do a subsequence check would be O(m log n).

Related

what will be the dp and transitions in this problem

Vasya has a string s of length n consisting only of digits 0 and 1. Also he has an array a of length n.
Vasya performs the following operation until the string becomes empty: choose some consecutive substring of equal characters, erase it from the string and glue together the remaining parts (any of them can be empty). For example, if he erases substring 111 from string 111110 he will get the string 110. Vasya gets ax points for erasing substring of length x.
Vasya wants to maximize his total points, so help him with this!
https://codeforces.com/problemset/problem/1107/E
i was trying to get my head around the editorial,but couldn't understand it... can anyone tell an easy way to do it?
input:
7
1101001
3 4 9 100 1 2 3
output:
109
Explanation
the optimal sequence of erasings is: 1101001 → 111001 → 11101 → 1111 → ∅.
Here, we consider removing prefixes instead of substrings. Why?
We try to remove a consecutive prefix of a particular state which is actually a substring in the main string. So, our DP states will be start index, end index, prefix length.
Let's consider an example str = "1010110". Here, initially start=0, end=7, and prefix=1(the first '1' will be the only prefix now). we iterate over all the indices in the current state except the starting index and check if str[i]==str[start]. Here, for example, str[4]==str[0]. Now we divide the string into "010" with prefix=1(010) && "110" with prefix=2(1010110). These two are now two individual subproblems. So, when there remains a string with length 1, we return aprefix.
Here is my code.

Minimum no of operations required to create String A By appending subsequence of String B to a empty string C

You have given two strings A and B. You have some empty string C. In one operation You can remove any no of characters (from anywhere) from String B and append it to string C. Minimum no of operations required to convert String C to String A.
e.g if
A is "ABCDE" and
B is "ABDEC" then
In 1st operation you will choose subsequence ABC from B and in 2nd operation DE.
So two operations are required.
if
A is "ABCDE"
B is "EDCBA" then
operations required 5.
Linear complexity expected O(n)
Just use a greedy algorithm.
1 - Let i = 0
2 - Let j = 0
3 - Search for the first A[i] in B after j
4 - If it exists, let j be its index in B, remove it from B, append it to C, increment i, and repeat from 3
5 - If it doesn't exist, repeat from 2
Each time you get to 5 corresponds to one operation.
Assuming all the characters of A (and B) are different, then here is a solution with linear complexity. You need a hashmap or something similar, as well as an array of indices, Y, of equal length to A and B.
1 - Put each character of A in the hashmap as key, with its index as value.
2 - Look up each character of B in the hashmap to get the value i, and put its index into Y at the position i.
3 - Go through Y counting the number of times that Y[i] < Y[i-1]. That's your number of operations.

Number of substrings with given constraints

I am given a sorted string and I wish to count the number of substrings (not necessarily contiguous) that are possible with the following constraints:
All the alphabets in the substring should be in sorted order.
The substring must contain only 1 vowel.
The length of the substring should be greater than or equal to 3.
For example:
for "aabbc",
we have 3 substrings "abc","abb","abbc" that match the above constraints.So, here 3 is the ans.
How do I go about for a general string?
I have tried this for 2-3 hours, but couldn't find a proper way. I was asked this question in a programming coding round today and I fear the same question would be asked in the interview tomorrow. Even hints or approach would be appreciated.
Suppose we have k vowels, and an array A specifying the histogram of each non-vowel. (i.e. A[0] is the number of the first non-vowel, A[1] is the number of the second non-vowel.)
Then (ignoring the length constraint) we have k choices for the vowel, and (A[0]+1)*(A[1]+1)*(A[2]+1)*... choices for the remaining letters (for each non-vowel we can have 0,1,2,...,A[i] choices).
This overcounts by k (for the single letter cases) and by k*len(A) for the double letter cases, so simply subtract these from the total.
Example Python code:
from collections import Counter
s='aabbc'
vowels = 'aeiou'
C = Counter(s)
t = 1
vowel_count = 0
cons_count = 0
for letter,count in C.items():
if letter in vowels:
vowel_count += 1
else:
cons_count += 1
t *= count+1
print vowel_count * (t - cons_count - 1)

Is it possible to count the number of distinct substrings in a string in O(n)?

Given a string s of length n, is it possible to count the number of distinct substrings in s in O(n)?
Example
Input: abb
Output: 5 ('abb', 'ab', 'bb', 'a', 'b')
I have done some research but i can't seem to find an algorithm that solves this problem in such an efficient way. I know a O(n^2) approach is possible, but is there a more efficient algorithm?
I don't need to obtain each of the substrings, just the total number of distinct ones (in case it makes a difference).
You can use Ukkonen's algorithm to build a suffix tree in linear time:
https://en.wikipedia.org/wiki/Ukkonen%27s_algorithm
The number of substrings of s is then the number of prefixes of strings in the trie, which you can calculate simply in linear time. It's just total number of characters in all nodes.
For instance, your example produces a suffix tree like:
/\
b a
| b
b b
5 characters in the tree, so 5 substrings. Each unique string is a path from the root ending after a different letter: abb, ab, a, bb, b. So the number of strings is the number of letters in the tree.
More precisely:
Every substring is the prefix of some suffix of the string;
All the suffixes are in the trie;
So there is a 1-1 correspondence between substrings and paths through the trie (by the definition of trie); and
There is a 1-1 correspondence between letters in the tree and non-empty paths, because:
each distinct non-empty path ends at a distinct position after its last letter; and
the path to the the position following each letter is unique
NOTE for people who are wondering how it could be possible to build a tree that contains O(N^2) characters in O(N) time:
There's a trick to the representation of a suffix tree. Instead of storing the actual strings in the nodes of the tree, you just store pointers into the orignal string, so the node that contains "abb" doesn't have "abb", it has (0,3) -- 2 integers per node, regardless of how long the string in each node is, and the suffix tree has O(N) nodes.
Construct the LCP array and subtract its sum from the number of substrings (n(n+1)/2).

Algorithm to form a given pattern using some strings

Given are 6 strings of any length. The words are to be arranged in the pattern shown below. They can be arranged either vertically or horizontally.
--------
| |
| |
| |
---------------
| |
| |
| |
--------
The pattern need not to be symmetric and there need to be two empty areas as shown.
For example:
Given strings
PQF
DCC
ACTF
CKTYCA
PGYVQP
DWTP
The pattern can be
DCC...
W.K...
T.T...
PGYVQP
..C..Q
..ACTF
where dot represent empty areas.
The other example is
RVE
LAPAHFUIK
BIRRE
KZGLPFQR
LLHU
UUZZSQHILWB
Pattern is
LLHU....
A..U....
P..Z....
A..Z....
H..S....
F..Q....
U..H....
I..I....
KZGLPFQR
...W...V
...BIRRE
If multiple patterns are possible then pattern with lexicographically smallest first line, then second line and so on is to be formed. What algorithm can be used to solve this?
Find strings which suits to this constraint:
strlen(a) + strlen(b) - 1 = strlen(c)
strlen(d) + strlen(e) - 1 = strlen(f)
After that try every possible situation if they are valid. For example;
aaa.....
d.f.....
d.f.....
d.f.....
cccccccc
..f....e
..f....e
..bbbbbb
There will be 2*2*2 = 8 different situation.
There are a number of heuristics that you can apply, but before that, let's go over some properties of the puzzle.
+aa+
c f
+ee+eee+
f d
+bbb+
Let us call the length of the string with the same character as appeared in the diagram above. We have:
a + b - 1 = e
c + d - 1 = f
I will refer to the 2 strings for the cross in the middle as middle strings.
We also infer that the length of the string cannot be less than 2. Therefore, we can infer:
e > a, e > b
f > c, f > d
From this, we know that the 2 shortest strings cannot be middle strings, due to the inequality above.
The 3 largest strings cannot be equal also, since after choosing any of 3 string as middle string, we are left with 2 largest strings that are equal, and it is impossible according to the inequality above.
The puzzle is only tricky when the lengths are regular. When the lengths are irregular, you can do direct mapping from length to position.
If we have the 2 largest strings being equal, due to the inequality above, they are the 2 middle strings. The worst case for this one is a "regular" puzzle, where the length a, b, c, d are equal.
If the 2 largest strings are unequal, the largest string's position can be determined immediately (since its length is unique in the puzzle) - as one of the middle string. In worst case, there can be 3 candidates for the other middle string - just brute force and check all of them.
Algorithm:
Try to map unique length string to the position.
Brute force the 2 strings in the middle (taken into consideration what I mentioned above), and brute force to fill in the rest.
Even with stupid brute force, there are only 6! = 720 cases, if the string can only go from left to right, up to down (no reverse). There will be 46080 cases (* 2^6) if the string is allowed to be in any direction.

Resources