Longest Substring Pair Sequence is it Longest Common Subsequence or what? - string

I have a pair of strings, for example: abcabcabc and abcxxxabc and a List of Common Substring Pairs (LCSP), in this case LCSP is 6 pairs, because three abc in the first string map to two abc in the second string. Now I need to find the longest valid (incrementing) sequence of pairs, in this case there are three equally long solutions: 0:0,3:6; 0:0,6:6; 3:0,6:6 (those numbers are starting positions of each pair in the original strings, the length of substrings is 3 as length of "abc"). I would call it the Longest Substring Pair Sequence or LSPQ. (Q is not to confuse String and Sequence)
Here is the LCSP for this example:
LCSP('abcabcabc', 'abcxxxabc') =
[ [ 6, 6, 3 ],
[ 6, 0, 3 ],
[ 3, 6, 3 ],
[ 0, 6, 3 ],
[ 3, 0, 3 ],
[ 0, 0, 3 ] ]
LSPQ(LCSP('abcabcabc', 'abcxxxabc'), 0, 0, 0) =
[ { a: 0, b: 0, size: 3 }, { a: 3, b: 6, size: 3 } ]
Now I find it with brute force recursively trying all combinations. So I am limited to about 25 pairs, otherwise it is unpractical. Size=[10,15,20,25,26,30], Time ms = [0,15,300,1000,2000,19000]
Is there a way to do that in linear time or at least not quadratic complexity so that longer input LCSP (List of Common Substring Pairs) could be used.
This problem is similar to the "Longest Common Subsequence", but not exactly it, because the input is not two strings but a list of common substrings sorted by their length. So I do not know where to look for an existing solutions or even if they exist.
Here is my particular code (JavaScript):
function getChainSize(T) {
var R = 0
for (var i = 0; i < T.length; i++) R += T[i].size
return R
}
function LSPQ(T, X, Y, id) {
// X,Y are first unused character is str1,str2
//id is current pair
function findNextPossible() {
var x = id
while (x < T.length) {
if (T[x][0] >= X && T[x][1] >= Y) return x
x++
}
return -1
}
var id = findNextPossible()
if (id < 0) return []
var C = [{a:T[id][0], b:T[id][1], size:T[id][2] }]
// with current
var o = T[id]
var A = C.concat(LSPQ(T, o[0]+o[2], o[1]+o[2], id+1))
// without current
var B = LSPQ(T, X, Y, id+1)
if (getChainSize(A) < getChainSize(B)) return B
return A
}

Related

Smallest window (substring) that has both uppercase and corresponding lowercase characters

I was asked the following question in an onsite interview:
A string is considered "balanced" when every letter in the string appears both in uppercase and lowercase. For e.g., CATattac is balanced (a, c, t occur in both cases), while Madam is not (a, d only appear in lowercase). Write a function that, given a string, returns the shortest balanced substring of that string. For e.g.,:
“azABaabza” should return “ABaab”
“TacoCat” should return -1 (not balanced)
“AcZCbaBz” should returns the entire string
Doing it with the brute force approach is trivial - calculating all the pairs of substrings and then checking if they are balanced, while keeping track of the size and starting index of the smallest one.
How do I optimize? I have a strong feeling it can be done with a sliding-window/two-pointer approach, but I am not sure how. When to update the pointers of the sliding window?
Edit: Removing the sliding-window tag since this is not a sliding-window problem (as discussed in the comments).
Due to the special property of string. There is only 26 uppercase letters and 26 lowercase letters.
We can loop every 26 letter j and denote the minimum length for any substrings starting from position i to find matches for uppercase and lowercase letter j be len[i][j]
Demo C++ code:
string s = "CATattac";
// if len[i] >= s.size() + 1, it denotes there is no matching
vector<vector<int>> len(s.size(), vector<int>(26, 0));
for (int i = 0; i < 26; ++i) {
int upperPos = s.size() * 2;
int lowerPos = s.size() * 2;
for (int j = s.size() - 1; j >= 0; --j) {
if (s[j] == 'A' + i) {
upperPos = j;
} else if (s[j] == 'a' + i) {
lowerPos = j;
}
len[j][i] = max(lowerPos - j + 1, upperPos - j + 1);
}
}
We also keep track of the count of characters.
// cnt[i][j] denotes the number of characters j in substring s[0..i-1]
// cnt[0][j] is always 0
vector<vector<int>> cnt(s.size() + 1, vector<int>(26, 0));
for (int i = 0; i < s.size(); ++i) {
for (int j = 0; j < 26; ++j) {
cnt[i + 1][j] = cnt[i][j];
if (s[i] == 'A' + j || s[i] == 'a' + j) {
++cnt[i + 1][j];
}
}
}
Then we can loop over s.
int m = s.size() + 1;
for (int i = 0; i < s.size(); ++i) {
bool done = false;
int minLen = 1;
while (!done && i + minLen <= s.size()) {
// execute at most 26 times, a new character must be added to change minLen
int prevMinLen = minLen;
done = true;
for (int j = 0; j < 26 && i + minLen <= s.size(); ++j) {
if (cnt[i + minLen][j] - cnt[i][j] > 0) {
// character j exists in the substring, have to find pair of it
minLen = max(minLen, len[i][j]);
}
}
if (prevMinLen != minLen) done = false;
}
// find overall minLen
if (i + minLen <= s.size())
m = min(m, minLen);
cout << minLen << '\n';
}
Output: (if i + minLen <= s.size(), it is valid. Otherwise substring doesn't exist if starting at that position)
The invalid output difference is due to how the array len is generated.
8
4
15
14
13
12
11
10
I'm not sure whether there is a simpler solution but it is the best I could think of right now.
Time complexity: O(N) with a constant of 26 * 26
Edit: I previously had O(nlog(n)) due to a unnecessary binary search.
I thought of a solution, which is technically O(n), where n is the length of the string, but the constant is pretty large.
For simplicity's sake, let's consider an analogous situation with only two letters, A and B (and their lowercase counterparts), and let l be the size of the alphabet for future reference. I worked on an example string ABabBaaA.
We start by computing the prefix counts of the number of occurrences of each letter. In this case, we get
i: 0, 1, 2, 3, 4, 5, 6, 7, 8
----------------------------
A: 0, 1, 1, 1, 1, 1, 1, 1, 2
a: 0, 0, 0, 1, 1, 1, 2, 3, 3
B: 0, 0, 1, 1, 1, 2, 2, 2, 2
b: 0, 0, 0, 0, 1, 1, 1, 1, 1
This way, assuming we are indexing the string starting from 1 (for implementation's sake you can add an extra character to the beginning, like a dollar sign $), we can get the number of occurrences of each letter on any substring in constant time (or rather -- in O(l), but in my case l is set to 2 and in your case l = 26 so technically this is constant time).
OK now we prepare arrays / vectors / queues of character indices, so if the character A appears on indices 1 and 8, the structure will consist of 1 and 8. We get
A: 1, 8
a: 3, 6, 7
B: 2, 5
b: 4
What is important, is that in arrays and vectors, we can look up certain "lowest element greater than" in amortized constant time by discarding indices which are smaller than every index one by one.
Now, the algorithm. Starting at each (left) index greater than 0, we will find the earliest right index for which the substring bound by [left_index, right_index] is balanced. We do that as follows:
Start with left_index = right_index = i for i = 1, ..., n.
Read the array of prefix counts for right_index and subtract the prefix counts for left_index - 1 receiving the counts for the substring [left_index, right_index]. Find any letter, which fails the "balance" check. If there is none, you found the shortest balanced substring starting at left_index.
Find the first occurrence of the "missing" letter, greater than left_index. Set right_index to the index of that occurrence. Go to step 1 keeping the modified right_index.
For example: starting with left_index = right_index = 1 we see that the number of occurrences of each letter in the substring is 1, 0, 0, 0, so a fails the check. The earliest occurrence of a is 3, so we set right_index = 3. We go back to step 1 receiving a new array of occurrences: 1, 1, 1, 0. Now b fails the check, and its earliest occurrence greater than 1 is 4, so we set right_index to 4. We go to step 1 receiving an array of occurrences 1, 1, 1, 1, which passes the balance check.
Another example: starting with left_index = right_index = 2 we get in step 1 an array of occurrences 0, 0, 1, 0. Now b fails the check. The earliest occurrence of b greater than left_index is 4, so we set right_index to 4. Now we get an array of occurrences 0, 1, 1, 1, so A fails the check. The earliest occurrence of A greater than left_index is 8, so we set right_index to that. Now, the array of occurrences is 2-1, 3-0, 2-0, 1-0, which is 1, 3, 2, 1 and it passes the balance check.
Ultimately we will find the shortest balanced substring to be bB with left_index = 4.
The complexity of this algorithm is O(nl^2) because: we start at n different indices and we perform a maximum of l lookups (for l different letters which can fail the check) in O(1). For each lookup, we have to calculate l differences of prefix sums. But as l is constant (albeit it may be large, like 26), this simplifies to O(n).
I'm using a recursive approach to this; I'm not sure what it's time complexity is though.
The idea is we check what characters in the string are present in both their lower and upper form formats. For any characters that aren't given in both forms, we replace them with a space ' '. We then split the remaining string on ' ' into a list.
In the first case, if we have only one string left after it- we return it's length.
In the second case, if we have no characters left, we return -1.
In the third case, if we have more than one string left, we re-evaluate each of the strings sub-lengths and return the length of the longest string we then evaluate.
from collections import Counter
def findMutual(s):
lower = dict(Counter( [x for x in s if x.lower() == x] ))
upper = dict(Counter( [x for x in s if x.upper() == x] ))
mutual = {}
for charr in lower:
if charr.upper() in upper:
mutual[charr] = upper[charr.upper()] + lower[charr]
matching_charrs = ''.join([x if x.lower() in mutual else ' ' for x in s ]).split()
print(s)
print(matching_charrs)
return matching_charrs
def smallestSubstring(s):
matching_charrs = findMutual(s)
if len(matching_charrs) == 1:
return(len(matching_charrs[0]))
elif len(matching_charrs) == 0:
return(-1)
else:
list_lens = []
for i in matching_charrs:
list_lens.append(smallestSubstring(i))
return max(list_lens)
print(smallestSubstring('azABaabza'))
print(smallestSubstring('dAcZCbaBz'))
print(smallestSubstring('TacoCat'))
print(smallestSubstring('Tt'))
print(smallestSubstring('T'))
print(smallestSubstring('TaCc'))

Sequence Of Zero

Consider the sequence of numbers from 1 to 𝑁. For example, for 𝑁 = 9,
we have 1, 2, 3, 4, 5, 6, 7, 8, 9.
Now, place among the numbers one of the three following operators:
"+" sum
"-" subtraction
"#" Paste Operator --> paste the previous and the next operands.
For example, 1#2 = 12
How can I calculate the number of possible sequences that yield zero ?
Example for N = 7:
1+2-3+4-5-6+7
1+2-3-4+5+6-7
1-2#3+4+5+6+7
1-2#3-4#5+6#7
1-2+3+4-5+6-7
1-2-3-4-5+6+7
See the fourth sequence, it is same as 1-23-45+67 and the result is 0.
All of the above sequences evaluate to zero.
Here is my recursion based solution just to build your intuition so that you can approach and improve this solution using dynamic programming on your own (implemented in c++):
// N is the input
// index_count is the index count in the given sequence
// sum is the total sum of a given sequence
int isEvaluteToZero(int N, int index_count, int sum){
// if N==1, then the sequence only contains 1 which is not 0, so return 0
if(N==1){
return 0;
}
// Base case
// if index_count is equal to N and total sum is 0, return 1, else 0
if(index_count==N){
if(sum==0){
return 1;
}
return 0;
}
// recursively call by considering '+' between index_count and index_count+1
// increase index_count by 1
int placeAdd = isEvaluteToZero(N, index_count+1, sum+index_count+1);
// recursively call by considering '-' between index_count and index_count+1
// increase index_count by 1
int placeMinus = isEvaluteToZero(N, index_count+1, sum-index_count-1);
// place '#'
int placePaste;
if(index_count+2<=N){
// paste the previous and the next operands
// For e.g., (8#9) = 8*(10^1)+9 = 89
// (9#10) = 9*(10^2)+10 = 910
// (99#100) = 99*(10^3)+100 = 99100
// (999#1000) = 999*(10^4)+1000 = 9991000
int num1 = index_count+1;
int num2 = index_count+2;
int concat_num = num1*(int)(pow(10, (int)num2/10 + 1) + 0.5)+num2;
placePaste = isEvaluteToZero(N, index_count+2, sum+concat_num) + isEvaluteToZero(N, index_count+2, sum-concat_num);
}else{
// in case index_count+2>N
placePaste = 0;
}
return (placeAdd+placeMinus+placePaste);
}
int main(){
int N, res=1, index_count=1;
cout<<"Enter N:";
cin>>N;
cout<<isEvaluteToZero(N, index_count, res)<<endl;
return 0;
}
output:
N=1 output=0
N=2 output=0
N=3 output=1
N=4 output=1
N=7 output=6

Get all possible sums from a list of numbers

Let's say I have a list of numbers: 2, 2, 5, 7
Now the result of the algorithm should contain all possible sums.
In this case: 2+2, 2+5, 5+7, 2+2+5, 2+2+5+7, 2+5+7, 5+7
I'd like to achieve this by using Dynamic Programming. I tried using a matrix but so far I have not found a way to get all the possibilities.
Based on the question, I think that the answer posted by AT-2016 is correct, and there is no solution that can exploit the concept of dynamic programming to reduce the complexity.
Here is how you can exploit dynamic programming to solve a similar question that asks to return the sum of all possible subsequence sums.
Consider the array {2, 2, 5, 7}: The different possible subsequences are:
{2},{2},{5},{7},{2,5},{2,5},{5,7},{2,5,7},{2,5,7},{2,2,5,7},{2,2},{2,7},{2,7},{2,2,7},{2,2,5}
So, the question is to find the sum of all these elements from all these subsequences. Dynamic Programming comes to the rescue!!
Arrange the subsequences based on the ending element of each subsequence:
subsequences ending with the first element: {2}
subsequences ending with the second element: {2}, {2,2}
subsequences ending with the third element: {5},{2,5},{2,5},{2,2,5}
subsequences ending with the fourth element: {7},{5,7},{2,7},{2,7},{2,2,7},{2,5,7},{2,5,7},{2,2,5,7}.
Here is the code snippet:
The array 's[]' calculates the sums for 1,2,3,4 individually, that is, s[2] calculates the sum of all subsequences ending with third element. The array 'dp[]' calculates the overall sum till now.
s[0]=array[0];
dp[0]=s[0];
k = 2;
for(int i = 1; i < n; i ++)
{
s[i] = s[i-1] + k*array[i];
dp[i] = dp[i-1] + s[i];
k = k * 2;
}
return dp[n-1];
This is done in C# and in an array to find the possible sums that I used earlier:
static void Main(string[] args)
{
//Set up array of integers
int[] items = { 2, 2, 5, 7 };
//Figure out how many bitmasks is needed
//4 bits have a maximum value of 15, so we need 15 masks.
//Calculated as: (2 ^ ItemCount) - 1
int len = items.Length;
int calcs = (int)Math.Pow(2, len) - 1;
//Create array of bitmasks. Each item in the array represents a unique combination from our items array
string[] masks = Enumerable.Range(1, calcs).Select(i => Convert.ToString(i, 2).PadLeft(len, '0')).ToArray();
//Spit out the corresponding calculation for each bitmask
foreach (string m in masks)
{
//Get the items from array that correspond to the on bits in the mask
int[] incl = items.Where((c, i) => m[i] == '1').ToArray();
//Write out the mask, calculation and resulting sum
Console.WriteLine(
"[{0}] {1} = {2}",
m,
String.Join("+", incl.Select(c => c.ToString()).ToArray()),
incl.Sum()
);
}
Console.ReadKey();
}
Possible outputs:
[0001] 7 = 7
[0010] 5 = 5
[0011] 5 + 7 = 12
[0100] 2 = 2
This is not an answer to the question because it does not demonstrate the application of dynamic programming. Rather it notes that this problem involves multisets, for which facilities are available in Sympy.
>>> from sympy.utilities.iterables import multiset_combinations
>>> numbers = [2,2,5,7]
>>> sums = [ ]
>>> for n in range(2,1+len(numbers)):
... for item in multiset_combinations([2,2,5,7],n):
... item
... added = sum(item)
... if not added in sums:
... sums.append(added)
...
[2, 2]
[2, 5]
[2, 7]
[5, 7]
[2, 2, 5]
[2, 2, 7]
[2, 5, 7]
[2, 2, 5, 7]
>>> sums.sort()
>>> sums
[4, 7, 9, 11, 12, 14, 16]
I have a solution that can print a list of all possible subset sums.
Its not dynamic programming(DP) but this solution is faster than the DP approach.
void solve(){
ll i, j, n;
cin>>n;
vector<int> arr(n);
const int maxPossibleSum=1000000;
for(i=0;i<n;i++){
cin>>arr[i];
}
bitset<maxPossibleSum> b;
b[0]=1;
for(i=0;i<n;i++){
b|=b<<arr[i];
}
for(i=0;i<maxPossibleSum;i++){
if(b[i])
cout<<i<<endl;
}
}
Input:
First line has the number of elements N in the array.
The next line contains N space-separated array elements.
4
2 2 5 7
----------
Output:
0
2
4
5
7
9
11
12
14
16
The time complexity of this solution is O(N * maxPossibleSum/32)
The space complexity of this solution is O(maxPossibleSum/8)

Generate string permutations recursively; each character appears n times

I'm trying to write an algorithm that will generate all strings of length nm, with exactly n of each number 1, 2, ... m,
For instance all strings of length 6, with exactly two 1's, two 2's and two 3's e.g. 112233, 121233,
I managed to do this with just 1's and 2's using a recursive method, but can't seem to get something that works when I introduce 3's.
When m = 2, the algorithm I have is:
generateAllStrings(int len, int K, String str)
{
if(len == 0)
{
output(str);
}
if(K > 0)
{
generateAllStrings(len - 1, K - 1, str + '2');
}
if(len > K)
{
generateAllStrings(len - 1, K, str + '1');
}
}
I've tried inserting similar conditions for the third number but the algorithm doesn't give a correct output. After that I wouldn't even know how to generalise for 4 numbers and above.
Is recursion the right thing to do? Any help would be appreciated.
One option would be to list off all distinct permutations of the string 111...1222...2...nnn....n. There are nice algorithms for enumerating all distinct permutations of a string in time proportional to the length of the string, and they'd probably be a good way to go about solving this problem.
To use a simple recursive algorithm, give each recursion the permutation so far (variable perm), and the number of occurances of each digit that is still available (array count).
Run the code snippet to generate all unique permutations for n=2 and m=4 (set: 11223344).
function permutations(n, m) {
var perm = "", count = []; // start with empty permutation
for (var i = 0; i < m; i++) count[i] = n; // set available number for each digit = n
permute(perm, count); // start recursion with "" and [n,n,n...]
function permute(perm, count) {
var done = true;
for (var i = 0; i < count.length; i++) { // iterate over all digits
if (count[i] > 0) { // more instances of digit i available
var c = count.slice(); // create hard copy of count array
--c[i]; // decrement count of digit i
permute(perm + (i + 1), c); // add digit to permutation and recurse
done = false; // digits left over: not the last step
}
}
if (done) document.write(perm + "<BR>"); // no digits left: complete permutation
}
}
permutations(2, 4);
You can easily do this using DFS (or BFS alternatively). We can define an graph such that each node contains one string and a node is connected to any node that holds a string with a pair of int swaped in comparison to the original string. This graph is connected, thus we can easily generate a set of all nodes; which will contain all strings that are searched:
set generated_strings
list nodes
nodes.add(generateInitialString(N , M))
generated_strings.add(generateInitialString(N , M))
while(!nodes.empty())
string tmp = nodes.remove(0)
for (int i in [0 , N * M))
for (int j in distinct([0 , N * M) , i))
string new = swap(tmp , i , j)
if (!generated_strings.contains(new))
nodes.add(new)
generated_strings.add(new)
//generated_strings now contains all strings that can possibly be generated.

Remove numbers which are repeated several times in a row

I have collection
def list = [4,1,1,1,3,5,1,1]
and I need to remove numbers which are repeated three times in a row. As a result I have to get an [4,3,5,1,1]. How to do this in groovy ?
This can be done by copying the list while ensuring the two previous elements are not the same as the one to be copied. If they are, drop the two previous elements, otherwise copy as normal.
This can be implemented with inject like this:
def list = [4,1,1,1,3,5,1,1]
def result = list.drop(2).inject(list.take(2)) { result, element ->
def prefixSize = result.size() - 2
if ([element] * 2 == result.drop(prefixSize)) {
result.take(prefixSize)
} else {
result + element
}
}
assert result == [4,3,5,1,1]
You can calculate the uniques size in the next three elements and drop them when they are 1:
def list = [4,1,1,1,3,5,1,1]
assert removeTriplets(list) == [4,3,5,1,1]
def removeTriplets(list) {
listCopy = [] + list
(list.size()-3).times { index ->
uniques = list[index..(index+2)].unique false
if (uniques.size() == 1)
listCopy = listCopy[0..(index-1)] + listCopy[(index+3)..-1]
}
listCopy
}
Another option is to use Run Length Encoding
First lets define a class which will hold our object and the number of times it occurs in a row:
class RleUnit {
def object
int runLength
RleUnit( object ) {
this( object, 1 )
}
RleUnit( object, int runLength ) {
this.object = object
this.runLength = runLength
}
RleUnit inc() {
new RleUnit( object, runLength + 1 )
}
String toString() { "$object($runLength)" }
}
We can then define a method which will encode a List into a List of RleUnit objects:
List<RleUnit> rleEncode( List list ) {
list.inject( [] ) { r, v ->
if( r && r[ -1 ].object == v ) {
r.take( r.size() - 1 ) << r[ -1 ].inc()
}
else {
r << new RleUnit( v )
}
}
}
And a method that takes a List of RleUnit objects, and unpacks it back to the original list:
List rleDecode( List<RleUnit> rle ) {
rle.inject( [] ) { r, v ->
r.addAll( [ v.object ] * v.runLength )
r
}
}
We can then encode the original list:
def list = [ 4, 1, 1, 1, 3, 5, 1, 1 ]
rle = rleEncode( list )
And filter this RleUnit list with the Groovy find method:
// remove all elements with a runLength of 3
noThrees = rle.findAll { it.runLength != 3 }
unpackNoThrees = rleDecode( noThrees )
assert unpackNoThrees == [ 4, 3, 5, 1, 1 ]
// remove all elements with a runLength of less than 3
threeOrMore = rle.findAll { it.runLength >= 3 }
unpackThreeOrMore = rleDecode( threeOrMore )
assert unpackThreeOrMore == [ 1, 1, 1 ]

Resources