Modified longest common substring - string

Given two strings what is an efficient algorithm to find the number and length of longest common sub-strings with the sub-strings being called common if :
1) they have at-least x% characters same and at same position.
2) the start and end indexes of the sub-strings being same.
Ex :
String 1 -> abedefkhj
String 2 -> kbfdfjhlo
suppose the x% being asked is 40,then, ans is,
5 1
where 5 is the longest length and 1 is the number of sub-strings in each string satisfying the given property. Sub-String is "abede" in string 1 and "kbfdf" in string 2.

You can use smth like Levenshtein distance without deleting and inserting.
Build the table, where every element [i, j] is error for substring from position [i] to position [j].
foo(string a, string b, int x):
len = min(a.length, b.length)
error[0][0] = 0 if a[0] == b[0] else 1;
for (end: [1 -> len-1]):
for (start: [end -> 0]):
if a[end] == b[end]:
error[start][end] = error[start][end - 1]
else:
error[start][end] = error[start][end - 1] + 1
best_len = 0;
best_pos = 0;
for (i: [0 -> len-1]):
for (j: [i -> 0]):
len = i - j + 1
error_percent = 100 * error[i][j] / len
if (error_percent <= x and len > best_len):
best_len = len
best_pos = j
return (best_len, best_pos)

Related

Given a string s, find the length of the longest substring without repeating characters? (I need to find the bug in code I wrote)

Please help as this is getting on my nerves I can't figure out what I'm doing wrong and have tried trace code.
Link to problem: https://leetcode.com/problems/longest-substring-without-repeating-characters/
I created a solution using a sliding window. It works on most test cases, but fails for a few (such as "ad"). I can't figure out where the bug is. I basically keep track in a dictionary of characters I've seen and the last index I saw them at which gets updated periodically in a loop. I use two indices i and j; i gets updated when I find a repeat character. I return the max of current max and length of current substring which is i-j. Here is my code below:
class Solution:
def lengthOfLongestSubstring(self, s: str) -> int:
if len(s) < 2:
return len(s)
m = 1
i = 0
j = 1
d = {}
d[s[0]] = 0
while j < len(s):
if s[j] in d and d[s[j]] >= i:
m = max(m, j -i)
i = j
d[s[j]] = j
j += 1
return max(m, j - i - 1)
Why does this fail for some cases? Example:
"au"
Output
1
Expected
2
Last line should be return max(m, j - i). Because i is the last index we see repeated character. So. We start this index to end of the string.So length is len(s) - i . And since j = len(s) (while loop ends when j = len(s)) so last substring length is j-i. not j-i-1
And also we are updating i wrong.let's say s = "abcadf". In while loop when we see second "a" ,so j = 3, we should update i = 1, not 3. Because in this case our longest substring will start with "b".So we should update i as i = d[s[j]] + 1. So final result:
class Solution:
def lengthOfLongestSubstring(self, s: str) -> int:
if len(s) < 2:
return len(s)
m = 1
i = 0
j = 1
d = {}
d[s[0]] = 0
while j < len(s):
if s[j] in d and d[s[j]] >= i:
m = max(m, j -i)
i = d[s[j]] + 1
d[s[j]] = j
j += 1
return max(m, j - i)

How to find the lexicographically smallest string by reversing a substring?

I have a string S which consists of a's and b's. Perform the below operation once. Objective is to obtain the lexicographically smallest string.
Operation: Reverse exactly one substring of S
e.g.
if S = abab then Output = aabb (reverse ba of string S)
if S = abba then Output = aabb (reverse bba of string S)
My approach
Case 1: If all characters of the input string are same then output will be the string itself.
Case 2: if S is of the form aaaaaaa....bbbbbb.... then answer will be S itself.
otherwise: Find the first occurence of b in S say the position is i. String S will look like
aa...bbb...aaaa...bbbb....aaaa....bbbb....aaaaa...
|
i
In order to obtain the lexicographically smallest string the substring that will be reversed starts from index i. See below for possible ending j.
aa...bbb...aaaa...bbbb....aaaa....bbbb....aaaaa...
| | | |
i j j j
Reverse substring S[i:j] for every j and find the smallest string.
The complexity of the algorithm will be O(|S|*|S|) where |S| is the length of the string.
Is there a better way to solve this problem? Probably O(|S|) solution.
What I am thinking if we can pick the correct j in linear time then we are done. We will pick that j where number of a's is maximum. If there is one maximum then we solved the problem but what if it's not the case? I have tried a lot. Please help.
So, I came up with an algorithm, that seems to be more efficient that O(|S|^2), but I'm not quite sure of it's complexity. Here's a rough outline:
Strip of the leading a's, storing in variable start.
Group the rest of the string into letter chunks.
Find the indices of the groups with the longest sequences of a's.
If only one index remains, proceed to 10.
Filter these indices so that the length of the [first] group of b's after reversal is at a minimum.
If only one index remains, proceed to 10.
Filter these indices so that the length of the [first] group of a's (not including the leading a's) after reversal is at a minimum.
If only one index remains, proceed to 10.
Go back to 5, except inspect the [second/third/...] groups of a's and b's this time.
Return start, plus the reversed groups up to index, plus the remaining groups.
Since any substring that is being reversed begins with a b and ends in an a, no two hypothesized reversals are palindromes and thus two reversals will not result in the same output, guaranteeing that there is a unique optimal solution and that the algorithm will terminate.
My intuition says this approach of probably O(log(|S|)*|S|), but I'm not too sure. An example implementation (not a very good one albeit) in Python is provided below.
from itertools import groupby
def get_next_bs(i, groups, off):
d = 1 + 2*off
before_bs = len(groups[i-d]) if i >= d else 0
after_bs = len(groups[i+d]) if i <= d and len(groups) > i + d else 0
return before_bs + after_bs
def get_next_as(i, groups, off):
d = 2*(off + 1)
return len(groups[d+1]) if i < d else len(groups[i-d])
def maximal_reversal(s):
# example input: 'aabaababbaababbaabbbaa'
first_b = s.find('b')
start, rest = s[:first_b], s[first_b:]
# 'aa', 'baababbaababbaabbbaa'
groups = [''.join(g) for _, g in groupby(rest)]
# ['b', 'aa', 'b', 'a', 'bb', 'aa', 'b', 'a', 'bb', 'aa', 'bbb', 'aa']
try:
max_length = max(len(g) for g in groups if g[0] == 'a')
except ValueError:
return s # no a's after the start, no reversal needed
indices = [i for i, g in enumerate(groups) if g[0] == 'a' and len(g) == max_length]
# [1, 5, 9, 11]
off = 0
while len(indices) > 1:
min_bs = min(get_next_bs(i, groups, off) for i in indices)
indices = [i for i in indices if get_next_bs(i, groups, off) == min_bs]
# off 0: [1, 5, 9], off 1: [5, 9], off 2: [9]
if len(indices) == 1:
break
max_as = max(get_next_as(i, groups, off) for i in indices)
indices = [i for i in indices if get_next_as(i, groups, off) == max_as]
# off 0: [1, 5, 9], off 1: [5, 9]
off += 1
i = indices[0]
groups[:i+1] = groups[:i+1][::-1]
return start + ''.join(groups)
# 'aaaabbabaabbabaabbbbaa'
TL;DR: Here's an algorithm that only iterates over the string once (with O(|S|)-ish complexity for limited string lengths). The example with which I explain it below is a bit long-winded, but the algorithm is really quite simple:
Iterate over the string, and update its value interpreted as a reverse (lsb-to-msb) binary number.
If you find the last zero of a sequence of zeros that is longer than the current maximum, store the current position, and the current reverse value. From then on, also update this value, interpreting the rest of the string as a forward (msb-to-lsb) binary number.
If you find the last zero of a sequence of zeros that is as long as the current maximum, compare the current reverse value with the current value of the stored end-point; if it is smaller, replace the end-point with the current position.
So you're basically comparing the value of the string if it were reversed up to the current point, with the value of the string if it were only reversed up to a (so-far) optimal point, and updating this optimal point on-the-fly.
Here's a quick code example; it could undoubtedly be coded more elegantly:
function reverseSubsequence(str) {
var reverse = 0, max = 0, first, last, value, len = 0, unit = 1;
for (var pos = 0; pos < str.length; pos++) {
var digit = str.charCodeAt(pos) - 97; // read next digit
if (digit == 0) {
if (first == undefined) continue; // skip leading zeros
if (++len > max || len == max && reverse < value) { // better endpoint found
max = len;
last = pos;
value = reverse;
}
} else {
if (first == undefined) first = pos; // end of leading zeros
len = 0;
}
reverse += unit * digit; // update reverse value
unit <<= 1;
value = value * 2 + digit; // update endpoint value
}
return {from: first || 0, to: last || 0};
}
var result = reverseSubsequence("aaabbaabaaabbabaaabaaab");
document.write(result.from + "→" + result.to);
(The code could be simplified by comparing reverse and value whenever a zero is found, and not just when the end of a maximally long sequence of zeros is encountered.)
You can create an algorithm that only iterates over the input once, and can process an incoming stream of unknown length, by keeping track of two values: the value of the whole string interpreted as a reverse (lsb-to-msb) binary number, and the value of the string with one part reversed. Whenever the reverse value goes below the value of the stored best end-point, a better end-point has been found.
Consider this string as an example:
aaabbaabaaabbabaaabaaab
or, written with zeros and ones for simplicity:
00011001000110100010001
We iterate over the leading zeros until we find the first one:
0001
^
This is the start of the sequence we'll want to reverse. We will start interpreting the stream of zeros and ones as a reversed (lsb-to-msb) binary number and update this number after every step:
reverse = 1, unit = 1
Then at every step, we double the unit and update the reverse number:
0001 reverse = 1
00011 unit = 2; reverse = 1 + 1 * 2 = 3
000110 unit = 4; reverse = 3 + 0 * 4 = 3
0001100 unit = 8; reverse = 3 + 0 * 8 = 3
At this point we find a one, and the sequence of zeros comes to an end. It contains 2 zeros, which is currently the maximum, so we store the current position as a possible end-point, and also store the current reverse value:
endpoint = {position = 6, value = 3}
Then we go on iterating over the string, but at every step, we update the value of the possible endpoint, but now as a normal (msb-to-lsb) binary number:
00011001 unit = 16; reverse = 3 + 1 * 16 = 19
endpoint.value *= 2 + 1 = 7
000110010 unit = 32; reverse = 19 + 0 * 32 = 19
endpoint.value *= 2 + 0 = 14
0001100100 unit = 64; reverse = 19 + 0 * 64 = 19
endpoint.value *= 2 + 0 = 28
00011001000 unit = 128; reverse = 19 + 0 * 128 = 19
endpoint.value *= 2 + 0 = 56
At this point we find that we have a sequence of 3 zeros, which is longer that the current maximum of 2, so we throw away the end-point we had so far and replace it with the current position and reverse value:
endpoint = {position = 10, value = 19}
And then we go on iterating over the string:
000110010001 unit = 256; reverse = 19 + 1 * 256 = 275
endpoint.value *= 2 + 1 = 39
0001100100011 unit = 512; reverse = 275 + 1 * 512 = 778
endpoint.value *= 2 + 1 = 79
00011001000110 unit = 1024; reverse = 778 + 0 * 1024 = 778
endpoint.value *= 2 + 0 = 158
000110010001101 unit = 2048; reverse = 778 + 1 * 2048 = 2826
endpoint.value *= 2 + 1 = 317
0001100100011010 unit = 4096; reverse = 2826 + 0 * 4096 = 2826
endpoint.value *= 2 + 0 = 634
00011001000110100 unit = 8192; reverse = 2826 + 0 * 8192 = 2826
endpoint.value *= 2 + 0 = 1268
000110010001101000 unit = 16384; reverse = 2826 + 0 * 16384 = 2826
endpoint.value *= 2 + 0 = 2536
Here we find that we have another sequence with 3 zeros, so we compare the current reverse value with the end-point's value, and find that the stored endpoint has a lower value:
endpoint.value = 2536 < reverse = 2826
so we keep the end-point set to position 10 and we go on iterating over the string:
0001100100011010001 unit = 32768; reverse = 2826 + 1 * 32768 = 35594
endpoint.value *= 2 + 1 = 5073
00011001000110100010 unit = 65536; reverse = 35594 + 0 * 65536 = 35594
endpoint.value *= 2 + 0 = 10146
000110010001101000100 unit = 131072; reverse = 35594 + 0 * 131072 = 35594
endpoint.value *= 2 + 0 = 20292
0001100100011010001000 unit = 262144; reverse = 35594 + 0 * 262144 = 35594
endpoint.value *= 2 + 0 = 40584
And we find another sequence of 3 zeros, so we compare this position to the stored end-point:
endpoint.value = 40584 > reverse = 35594
and we find it has a smaller value, so we replace the possible end-point with the current position:
endpoint = {position = 21, value = 35594}
And then we iterate over the final digit:
00011001000110100010001 unit = 524288; reverse = 35594 + 1 * 524288 = 559882
endpoint.value *= 2 + 1 = 71189
So at the end we find that position 21 gives us the lowest value, so it is the optimal solution:
00011001000110100010001 -> 00000010001011000100111
^ ^
start = 3 end = 21
Here's a C++ version that uses a vector of bool instead of integers. It can parse strings longer than 64 characters, but the complexity is probably quadratic.
#include <vector>
struct range {unsigned int first; unsigned int last;};
range lexiLeastRev(std::string const &str) {
unsigned int len = str.length(), first = 0, last = 0, run = 0, max_run = 0;
std::vector<bool> forward(0), reverse(0);
bool leading_zeros = true;
for (unsigned int pos = 0; pos < len; pos++) {
bool digit = str[pos] - 'a';
if (!digit) {
if (leading_zeros) continue;
if (++run > max_run || run == max_run && reverse < forward) {
max_run = run;
last = pos;
forward = reverse;
}
}
else {
if (leading_zeros) {
leading_zeros = false;
first = pos;
}
run = 0;
}
forward.push_back(digit);
reverse.insert(reverse.begin(), digit);
}
return range {first, last};
}

total substrings with k ones

Given a binary string s, we need to find the number of its substrings, containing exactly k characters that are '1'.
For example: s = "1010" and k = 1, answer = 6.
Now, I solved it using binary search technique over the cumulative sum array.
I also used another approach to solve it. The approach is as follows:
For each position i, find the total substrings that end at i containing
exactly k characters that are '1'.
To find the total substrings that end at i containing exactly k characters that are 1, it can be represented as the set of indices j such that substring j to i contains exactly k '1's. The answer would be the size of the set. Now, to find all such j for the given position i, we can rephrase the problem as finding all j such that
number of ones from [1] to [j - 1] = the total number of ones from 1 to i - [the total number of ones from j to i = k].
i.e. number of ones from [1] to [j - 1] = C[i] - k
which is equal to
C[j - 1] = C[i] - k,
where C is the cumulative sum array, where
C[i] = sum of characters of string from 1 to i.
Now, the problem is easy because, we can find all the possible values of j's using the equation by counting all the prefixes that sum to C[i] - k.
But I found this solution,
int main() {
cin >> k >> S;
C[0] = 1;
for (int i = 0; S[i]; ++i) {
s += S[i] == '1';
++C[s];
}
for (int i = k; i <= s; ++i) {
if (k == 0) {
a += (C[i] - 1) * C[i] / 2;
} else {
a += C[i] * C[i - k];
}
}
cout << a << endl;
return 0;
}
In the code, S is the given string and K as described above, C is the cumulative sum array and a is the answer.
What is the code exactly doing by using multiplication, I don't know.
Could anybody explain the algorithm?
If you see the way C[i] is calculated, C[i] represents the number of characters between ith 1 and i+1st 1.
If you take an example S = 1001000
C[0] = 1
C[1] = 3 // length of 100
C[2] = 4 // length of 1000
So coming to your doubt, Why multiplication
Say your K=1, then you want to find out the substring which have only one 1, now you know that after first 1 there are two zeros since C[1] = 3. So number of of substrings will be 3, because you have to include this 1.
{1,10,100}
But when you come to the second part: C[2] =4
now if you see 1000 and you know that you can make 4 substrings (which is equal to C[2])
{1,10,100,1000}
and also you should notice that there are C[1]-1 zeroes before this 1.
So by including those zeroes you can make more substring, in this case by including 0 once
0{1,10,100,1000}
=> {01,010,0100,01000}
and 00 once
00{1,10,100,1000}
=> {001,0010,00100,001000}
so essentially you are making C[i] substrings starting with 1 and you can append i number of zeroes before this one and make another C[i] * C[i-k]-1 substrings. i varies from 1 to C[i-k]-1 (-1 because we want to leave that last one).
((C[i-k]-1)* C[i]) +C[i]
=> C[i-k]*C[i]

Find non-unique characters in a given string in O(n) time with constant space i.e with no extra auxiliary array

Given a string s containing only lower case alphabets (a - z), find (i.e print) the characters that are repeated.
For ex, if string s = "aabcacdddec"
Output: a c d
3 approaches to this problem exists:
[brute force] Check every char of string (i.e s[i] with every other char and print if both are same)
Time complexity: O(n^2)
Space complexity: O(1)
[sort and then compare adjacent elements] After sorting (in O(n log(n) time), traverse the string and check if s[i] ans s[i + 1] are equal
Time complexity: O(n logn) + O(n) = O(n logn)
Space complexity: O(1)
[store the character count in an array] Create an array of size 26 (to keep track of a - z) and for every s[i], increment value stored at index = s[i] - 26 in the array. Finally traverse the array and print all elements (i.e 'a' + i) with value greater than 1
Time complexity: O(n)
Space complexity: O(1) but we have a separate array for storing the frequency of each element.
Is there a O(n) approach that DOES NOT use any array/hash table/map (etc)?
HINT: Use BIT Vectors
This is the element distinctness problem, so generally speaking - no there is no way to solve it in O(n) without extra space.
However, if you regard the alphabet as constant size (a-z characters only is pretty constant) you can either create a bitset of these characters, in O(1) space [ it is constant!] or check for each character in O(n) if it repeats more than once, it will be O(constant*n), which is still in O(n).
Pseudo code for 1st solution:
bit seen[] = new bit[SIZE_OF_ALPHABET] //contant!
bit printed[] = new bit[SIZE_OF_ALPHABET] //so is this!
for each i in seen.length: //init:
seen[i] = 0
printed[i] = 0
for each character c in string: //traverse the string:
i = intValue(c)
//already seen it and didn't print it? print it now!
if seen[i] == 1 and printed[i] == 0:
print c
printed[i] = 1
else:
seen[i] = 1
Pseudo code for 2nd solution:
for each character c from a-z: //constant number of repeats is O(1)
count = 0
for each character x in the string: //O(n)
if x==c:
count += 1
if count > 1
print count
Implementation in Java
public static void findDuplicate(String str) {
int checker = 0;
char c = 'a';
for (int i = 0; i < str.length(); ++i) {
int val = str.charAt(i) - c;
if ((checker & (1 << val)) > 0) {
System.out.println((char)(c+val));
}else{
checker |= (1 << val);
}
}
}
Uses as int as storage and performs bit wise operator to find the duplicates.
it is in O(n) .. explanation follows
Input as "abddc"
i==0
STEP #1 : val = 98 - 98 (0) str.charAt(0) is a and conversion char to int is 98 ( ascii of 'a')
STEP #2 : 1 << val equal to ( 1 << 0 ) equal to 1 finally 1 & 0 is 0
STEP #3 : checker = 0 | ( 1 << 0) equal to 0 | 1 equal to 1 checker is 1
i==1
STEP #1 : val = 99 - 98 (1) str.charAt(1) is b and conversion char to int is 99 ( ascii of 'b')
STEP #2 : 1 << val equal to ( 1 << 1 ) equal to 2 finally 1 & 2 is 0
STEP #3 : checker = 2 | ( 1 << 1) equal to 2 | 1 equal to 2 finally checker is 2
i==2
STEP #1 : val = 101 - 98 (3) str.charAt(2) is d and conversion char to int is 101 ( ascii of 'd')
STEP #2 : 1 << val equal to ( 1 << 3 ) equal to 8 finally 2 & 8 is 0
STEP #3 : checker = 2 | ( 1 << 3) equal to 2 | 8 equal to 8 checker is 8
i==3
STEP #1 : val = 101 - 98 (3) str.charAt(3) is d and conversion char to int is 101 ( ascii of 'd')
STEP #2 : 1 << val equal to ( 1 << 3 ) equal to 8 finally 8 & 8 is 8
Now print 'd' since the value > 0
You can also use the Bit Vector, depends upon the language it would space efficient. In java i would prefer to use int for this fixed ( just 26) constant case
The size of the character set is a constant, so you could scan the input 26 times. All you need is a counter to store the number of times you've seen the character corresponding to the current iteration. At the end of each iteration, print that character if your counter is greater than 1.
It's O(n) in runtime and O(1) in auxiliary space.
Implementation in C# (recursive solution)
static void getNonUniqueElements(string s, string nonUnique)
{
if (s.Count() > 0)
{
char ch = s[0];
s = s.Substring(1);
if (s.LastIndexOf(ch) > 0)
{
if (nonUnique.LastIndexOf(ch) < 0)
nonUnique += ch;
}
getNonUniqueElements(s, nonUnique);
}
else
{
Console.WriteLine(nonUnique);
return;
}
}
static void Main(string[] args)
{
getNonUniqueElements("aabcacdddec", "");
Console.ReadKey();
}

Counting substring that begin with character 'A' and ends with character 'X'

PYTHON QN:
Using just one loop, how do I devise an algorithm that counts the number of substrings that begin with character A and ends with character X? For example, given the input string CAXAAYXZA there are four substrings that begin with A and ends with X, namely: AX, AXAAYX, AAYX, and AYX.
For example:
>>>count_substring('CAXAAYXZA')
4
Since you didn't specify a language, im doing c++ish
int count_substring(string s)
{
int inc = 0;
int substring_count = 0;
for(int i = 0;i < s.length();i++)
{
if(s[i] == 'A') inc++;
if(s[i] == 'X') substring_count += inc;
}
return substring_count;
}
and in Python
def count_substring(s):
inc = 0
substring_count = 0
for c in s:
if(c == 'A'): inc = inc + 1
if(c == 'X'): substring_count = substring_count + inc
return substring_count
First count number of "A" in the string
Then count "X" in the string
using
Public Function CountCharacter(ByVal value As String, ByVal ch As Char) As Integer
Dim cnt As Integer = 0
For Each c As Char In value
If c = ch Then cnt += 1
Next
Return cnt
End Function
then take each "A" as a start position and "X" as an end position and get the substring. Do this for each "X" and then start with second "A" and run that for "X" count times. Repeat this and you will get all the substrings starting with "A" and ending with "X".
Just another solution In python:
def count_substring(str):
length = len(str) + 1
found = []
for i in xrange(0, length):
for j in xrange(i+1, length):
if str[i] == 'A' and str[j-1] == 'X':
found.append(str[i:j])
return found
string = 'CAXAAYXZA'
print count_substring(string)
Output:
['AX', 'AXAAYX', 'AAYX', 'AYX']

Resources