Longest Common Substring non-DP solution with O(m*n)

Longest Common Substring non-DP solution with O(m*n) - string

The definition of the problem is:
Given two strings, find the longest common substring.
Return the length of it.
I was solving this problem and I think I solved it with O(m*n) time complexity. However I don't know why when I look up the solution, it's all talking about the optimal solution being dynamic programming - http://www.geeksforgeeks.org/longest-common-substring/
Here's my solution, you can test it here: http://www.lintcode.com/en/problem/longest-common-substring/
int longestCommonSubstring(string &A, string &B) {
int ans = 0;
for (int i=0; i<A.length(); i++) {
int counter = 0;
int k = i;
for (int j=0; j<B.length() && k <A.length(); j++) {
if (A[k]!=B[j]) {
counter = 0;
k = i;
} else {
k++;
counter++;
ans = max(ans, counter);
}
}
}
return ans;
}
My idea is simple, start from the first position of string A and see what's the longest substring I can match with string B, then start from the second position of string A and see what's the longest substring I can match....
Is there something wrong with my solution? Or is it not O(m*n) complexity?

Good news: your algorithm is O(mn). Bad news: it doesn't work correctly.
Your inner loop is wrong: it's intended to find the longest initial substring of A[i:] in B, but it works like this:
j = 0
While j < len(B)
Match as much of A[i:] against B[j:]. Call it s.
Remember s if it's the longest so far found.
j += len(s)
This fails to find the longest match. For example, when A = "XXY" and B = "XXXY" and i=0 it'll find "XX" as the longest match instead of the complete match "XXY".
Here's a runnable version of your code (lightly transcribed into C) that shows the faulty result:
#include <string.h>
#include <stdio.h>
int lcs(const char* A, const char* B) {
int al = strlen(A);
int bl = strlen(B);
int ans = 0;
for (int i=0; i<al; i++) {
int counter = 0;
int k = i;
for (int j=0; j<bl && k<al; j++) {
if (A[k]!=B[j]) {
counter = 0;
k = i;
} else {
k++;
counter++;
if (counter >= ans) ans = counter;
}
}
}
return ans;
}
int main(int argc, char**argv) {
printf("%d\n", lcs("XXY", "XXXY"));
return 0;
}
Running this program outputs "2".

Your solution is O(nm) complexity and if you look compare the structure to the provided algorithm its the exact same; however, yours does not memoize.
One advantage that the dynamic algorithm provided in the link has is that in the same complexity class time it can recall different substring lengths in O(1); otherwise, it looks good to me.
This is a kind of thing will happen from time to time because storing subspace solutions will not always result in a better run time (on first call) and result in the same complexity class runtime instead (eg. try to compute the nth Fibonacci number with a dynamic solution and compare that to a tail recursive solution. Note that in this case like your case, after the array is filled the first time, its faster to return an answer each successive call.

Related

CS50 Plurality – having trouble understanding why candidate_count is used

I'm trying to understand why candidate_count is used instead of voter count in CS50's Plurality (week 3). Below is my code.
If we imagine we have three candidates (Alice, Bob, Charlie) and every time we iterate through the bool function or the print_winner function, would we not miss out on counting votes if we had something like 10 voters? According to my understanding, 'i' would only ever iterate 3 times. I'm having a conceptual issue in understanding why we wouldn't use voter_count instead in the print winner function at the bottom.
I'm still trying to refine my code a bit, so parts may still be buggy. I'm just looking for some help in clarifying the logic in this problem.
#include <cs50.h>
#include <stdio.h>
#include <string.h>
// Max number of candidates
#define MAX 9
// Candidates have name and vote count
typedef struct
{
string name;
int votes;
}
candidate;
// Array of candidates
candidate candidates[MAX];
// Number of candidates
int candidate_count;
// Function prototypes
bool vote(string name);
void print_winner(void);
int main(int argc, string argv[])
{
// Check for invalid usage
if (argc < 2)
{
printf("Usage: plurality [candidate ...]\n");
return 1;
}
// Populate array of candidates (number of arguments - 1 because the first arg is going to be plurality)
candidate_count = argc - 1;
if (candidate_count > MAX)
{
printf("Maximum number of candidates is %i\n", MAX);
return 2;
}
for (int i = 0; i < candidate_count; i++)
{
candidates[i].name = argv[i + 1]; //
candidates[i].votes = 0;
}
int voter_count = get_int("Number of voters: ");
// Loop over all voters
for (int i = 0; i < voter_count; i++)
{
string name = get_string("Vote: ");
// Check for invalid vote
if (!vote(name))
{
printf("Invalid vote.\n");
}
}
// Display winner of election
print_winner();
}
// Update vote totals given a new vote
bool vote(string name)
{
for (int i = 0; i < candidate_count; i++)
{
if (strcmp(candidates[i].name, name) == 0)
{
candidates[i].votes++;
return true;
}
}
return false;
}
// Print the winner (or winners) of the election
void print_winner(void)
{
int maxvotes = 0;
for (int i = 0; i < candidate_count; i++)
{
if (candidates[i].votes > maxvotes)
{
maxvotes = candidates[i].votes;
}
}
for (int i = 0; i < candidate_count; i++)
{
printf("the winner is %s\n!", candidates[i].name);
}
return;
}

I think you misunderstand what we're looping through. candidate_count is used to loop through candidates which is an array with candidate_count number of elements.
Consider this simple array-
int arr[] = {1, 2, 3, 4, 5};
We would loop through this using-
for (int i = 0; i < 5; i++)
{
printf("value at index %d is %d\n", i, arr[i]);
}
notice the part i < 5. Why 5? Well because that's the length of the array arr, anything more and we'd be reading out of bounds, anything less and we would not be reading the entire arr.
Now replace arr with candidates, which is also an array and replace 5 with candidate_count, which is obviously the length of said array.
The loops in vote or print_winner functions do not "count votes", they iterate through the array of candidates. That's the purpose of those loops. Hence, to iterate through an array, we must use index < length_of_array. That's exactly what it does.
Just to address the concern of "where does it count votes then?". Let's look at vote real quick-
bool vote(string name)
{
for (int i = 0; i < candidate_count; i++)
{
if (strcmp(candidates[i].name, name) == 0)
{
candidates[i].votes++;
return true;
}
}
return false;
}
This function is triggered after user input, so let's say the user is asked for a name - they pick some name, assume this name matches with a candidate. So vote is called, it iterates through the array of candidates. Why does it iterate? It needs to find the candidate who has the same name as the name just provided by the user.
So the purpose of this loop, is to iterate through the array candidates, and the purpose of the code within this loop is to compare the names of each candidate to the name provided by the user. If it matches, it increments the votes of that candidate.
This line- candidates[i].votes++;
That is where the votes are counted. It's a simple as incrementing a counter.
Imagine a real life scenario, you're the program. The candidates are all standing in a line (an array). Each candidate starts with 0 cards (votes). Assume none of them have the same name.
A user tells you that they wanna vote for candidate foo (just a name).
You then go to that line of candidates. You start by the first candidate and ask them "what is your name?".
The candidate tells you their name.
If their name matches with the name the user gave you, this is it, you give them a card representing the vote. Now they have 1 more card than they had before
If their name does not match, you move to the next candidate and repeat.
But where do you stop? Simple, you stop before the last candidate. How many iterations could this take maximum? candidate_count, i.e the number of candidates standing in that line. So, worst case scenario, the person you're looking for is at the very end of the line, so you have to ask candidate_count number of people before finally finding the one you're looking for. Traversing an array.
In the end, you just count how many cards each candidate has to realize who's the winner. (.votes)
The print_winner function should do the same thing. Though your code seems a bit crooked.
for (int i = 0; i < candidate_count; i++)
{
if (candidates[i].votes > maxvotes)
{
maxvotes = candidates[i].votes;
}
}
This loop is the equivalent of going through the whole line of candidates, asking each one how many cards (votes) they have and finding out who has the maximum number of cards (votes).
But you also need to ask the name of the person with the maxvotes! That's what you're looking for.
But you completely ditch the maxvotes you just calculated later, you never use it.
for (int i = 0; i < candidate_count; i++)
{
printf("the winner is %s\n!", candidates[i].name);
}
That loop will print the name of every single candidate. It's the equivalent of going through the whole line of candidates (array), asking each one "what is your name?" and shouting out "The winner is [their name]!". But that's not true!
You should store the name of the candidate, as well as maxvotes and then print it after the first loop-
string winner_name;
...
for (int i = 0; i < candidate_count; i++)
{
if (candidates[i].votes > maxvotes)
{
maxvotes = candidates[i].votes;
winner_name = candidates[i].name;
}
}
printf("the winner is %s\n!", candidates[i].name);
Of course, this does not take care of multiple winners, what if 2 candidates have the same number of votes? That's something you'll have to try yourself.
But I hope this answers your question about the confusion.

Optimal algorithm for this string decompression

I have been working on an exercise from google's dev tech guide. It is called Compression and Decompression you can check the following link to get the description of the problem Challenge Description.
Here is my code for the solution:
public static String decompressV2 (String string, int start, int times) {
String result = "";
for (int i = 0; i < times; i++) {
inner:
{
for (int j = start; j < string.length(); j++) {
if (isNumeric(string.substring(j, j + 1))) {
String num = string.substring(j, j + 1);
int times2 = Integer.parseInt(num);
String temp = decompressV2(string, j + 2, times2);
result = result + temp;
int next_j = find_next(string, j + 2);
j = next_j;
continue;
}
if (string.substring(j, j + 1).equals("]")) { // Si es un bracket cerrado
break inner;
}
result = result + string.substring(j,j+1);
}
}
}
return result;
}
public static int find_next(String string, int start) {
int count = 0;
for (int i = start; i < string.length(); i++) {
if (string.substring(i, i+1).equals("[")) {
count= count + 1;
}
if (string.substring(i, i +1).equals("]") && count> 0) {
count = count- 1;
continue;
}
if (string.substring(i, i +1).equals("]") && count== 0) {
return i;
}
}
return -111111;
}
I will explain a little bit about the inner workings of my approach. It is a basic solution involves use of simple recursion and loops.
So, let's start from the beggining with a simple decompression:
DevTech.decompressV2("2[3[a]b]", 0, 1);
As you can see, the 0 indicates that it has to iterate over the string at index 0, and the 1 indicates that the string has to be evaluated only once: 1[ 2[3[a]b] ]
The core here is that everytime you encounter a number you call the algorithm again(recursively) and continue where the string insides its brackets ends, that's the find_next function for.
When it finds a close brackets, the inner loop breaks, that's the way I choose to make the stop sign.
I think that would be the main idea behind the algorithm, if you read the code closely you'll get the full picture.
So here are some of my concerns about the way I've written the solution:
I could not find a more clean solution to tell the algorithm were to go next if it finds a number. So I kind of hardcoded it with the find_next function. Is there a way to do this more clean inside the decompress func ?
About performance, It wastes a lot of time by doing the same thing again, when you have a number bigger than 1 at the begging of a bracket.
I am relatively to programming so maybe this code also needs an improvement not in the idea, but in the ways It's written. So would be very grateful to get some suggestions.
This is the approach I figure out but I am sure there are a couple more, I could not think of anyone but It would be great if you could tell your ideas.
In the description it tells you some things that you should be awared of when developing the solutions. They are: handling non-repeated strings, handling repetitions inside, not doing the same job twice, not copying too much. Are these covered by my approach ?
And the last point It's about tets cases, I know that confidence is very important when developing solutions, and the best way to give confidence to an algorithm is test cases. I tried a few and they all worked as expected. But what techniques do you recommend for developing test cases. Are there any softwares?
So that would be all guys, I am new to the community so I am open to suggestions about the how to improve the quality of the question. Cheers!

Your solution involves a lot of string copying that really slows it down. Instead of returning strings that you concatenate, you should pass a StringBuilder into every call and append substrings onto that.
That means you can use your return value to indicate the position to continue scanning from.
You're also parsing repeated parts of the source string more than once.
My solution looks like this:
public static String decompress(String src)
{
StringBuilder dest = new StringBuilder();
_decomp2(dest, src, 0);
return dest.toString();
}
private static int _decomp2(StringBuilder dest, String src, int pos)
{
int num=0;
while(pos < src.length()) {
char c = src.charAt(pos++);
if (c == ']') {
break;
}
if (c>='0' && c<='9') {
num = num*10 + (c-'0');
} else if (c=='[') {
int startlen = dest.length();
pos = _decomp2(dest, src, pos);
if (num<1) {
// 0 repetitions -- delete it
dest.setLength(startlen);
} else {
// copy output num-1 times
int copyEnd = startlen + (num-1) * (dest.length()-startlen);
for (int i=startlen; i<copyEnd; ++i) {
dest.append(dest.charAt(i));
}
}
num=0;
} else {
// regular char
dest.append(c);
num=0;
}
}
return pos;
}

I would try to return a tuple that also contains the next index where decompression should continue from. Then we can have a recursion that concatenates the current part with the rest of the block in the current recursion depth.
Here's JavaScript code. It takes some thought to encapsulate the order of operations that reflects the rules.
function f(s, i=0){
if (i == s.length)
return ['', i];
// We might start with a multiplier
let m = '';
while (!isNaN(s[i]))
m = m + s[i++];
// If we have a multiplier, we'll
// also have a nested expression
if (s[i] == '['){
let result = '';
const [word, nextIdx] = f(s, i + 1);
for (let j=0; j<Number(m); j++)
result = result + word;
const [rest, end] = f(s, nextIdx);
return [result + rest, end]
}
// Otherwise, we may have a word,
let word = '';
while (isNaN(s[i]) && s[i] != ']' && i < s.length)
word = word + s[i++];
// followed by either the end of an expression
// or another multiplier
const [rest, end] = s[i] == ']' ? ['', i + 1] : f(s, i);
return [word + rest, end];
}
var strs = [
'2[3[a]b]',
'10[a]',
'3[abc]4[ab]c',
'2[2[a]g2[r]]'
];
for (const s of strs){
console.log(s);
console.log(JSON.stringify(f(s)));
console.log('');
}

Leetcode--3 find the longest substring without repeating character

The target is simple--find the longest substring without repeating characters,here is the code:
class Solution {
public:
int lengthOfLongestSubstring(string s) {
int ans = 0;
int dic[256];
memset(dic, -1, sizeof(dic));
int len = s.size();
int idx = -1;
for (int i = 0;i < len;i++) {
char c = s[i];
if (dic[c] > idx)
idx = dic[c];
ans = max(ans, i - idx);
dic[c] = i;
}
return ans;
}
};
From its concise expression,I think this is a high-performance method,and we can get that its Time Complexity is just O(n).But I'm confused about this method,though I came up with some examples to understand,can anyone give some tips or idea to me?

What it is doing is recording the position where each character was last seen.
As you step through, it takes each new encountered character and the length of non-repeat goes back at least as far as that last-seen, but for future indices can't go back further, as we have now seen a duplicate.
So we are maintaining in idx, the start index of the latest-seen highest-starting duplicate, which is the candidate for the start of the longest non-duplicating sequence.
I'm certain that the ans = max() code code be optimised slightly, as after encountering a new duplicate, you have to go forward at least ans chars from the start of that duplicate before ans can be improved again. You still need to do the rest of the work maintaining dic and idx, but you could avoid that particular test for ans for a few iterations. You would have to do a lot of unrolling to benefit, though.

Remove occurrences of substring recursively

Here's a problem:
Given string A and a substring B, remove the first occurence of substring B in string A till it is possible to do so. Note that removing a substring, can further create a new same substring. Ex. removing 'hell' from 'hehelllloworld' once would yield 'helloworld' which after removing once more would become 'oworld', the desired string.
Write a program for the above for input constraints of length 10^6 for A, and length 100 for B.
This question was asked to me in an interview, I gave them a simple algorithm to solve it that was to do exactly what the statement was and remove it iteratievly(to decresae over head calls), I later came to know there's a better solution for it that's much faster what would it be ? I've thought of a few optimizations but it's still not as fast as the fastest soln for the problem(acc. the company), so can anyone tell me of a faster way to solve the problem ?
P.S> I know of stackoverflow rules and that having code is better, but for this problem, I don't think that having code would be in any way beneficial...

Your approach has a pretty bad complexity. In a very bad case the string a will be aaaaaaaaabbbbbbbbb, and the string b will be ab, in which case you will need O(|a|) searches, each taking O(|a| + |b|) (assuming using some sophisticated search algorithm), resulting in a total complexity of O(|a|^2 + |a| * |b|), which with their constraints is years.
For their constraints a good complexity to aim for would be O(|a| * |b|), which is around 100 million operations, will finish in subsecond. Here's one way to approach it. For each position i in the string a let's compute the largest length n_i, such that the a[i - n_i : i] = b[0 : n_i] (in other words, the longest suffix of a at that position which is a prefix of b). We can compute it in O(|a| + |b|) by using Knuth-Morris-Pratt algorithm.
After we have n_i computed, finding the first occurrence of b in a is just a matter of finding the first n_i that is equal to |b|. This will be the right end of one of the occurrences of b in a.
Finally, we will need to modify Knuth-Morris-Pratt slightly. We will be logically removing occurrences of b as soon as we compute an n_i that is equal to |b|. To account for the fact that some letters were removed from a we will rely on the fact that Knuth-Morris-Pratt only relies on the last value of n_i (and those computed for b), and the current letter of a, so we just need a fast way of retrieving the last value of n_i after we logically remove an occurrence of b. That can be done with a deque, that stores all the valid values of n_i. Each value will be pushed into the deque once, and popped from it once, so that complexity of maintaining it is O(|a|), while the complexity of the Knuth-Morris-Pratt is O(|a| + |b|), resulting in O(|a| + |b|) total complexity.
Here's a C++ implementation. It could have some off-by-one errors, but it works on your sample, and it flies for the worst case that I described at the beginning.
#include <deque>
#include <string>
#include <iostream>
#include <vector>
#include <algorithm>
using namespace std;
int main() {
string a, b;
cin >> a >> b;
size_t blen = b.size();
// make a = b$a
a = b + "$" + a;
vector<size_t> n(a.size()); // array for knuth-morris-pratt
vector<bool> removals(a.size()); // positions of right ends at which we remove `b`s
deque<size_t> lastN;
n[0] = 0;
// For the first blen + 1 iterations just do vanilla knuth-morris-pratt
for (size_t i = 1; i < blen + 1; ++ i) {
size_t z = n[i - 1];
while (z && a[i] != a[z]) {
z = n[z - 1];
}
if (a[i] != a[z]) n[i] = 0;
else n[i] = z + 1;
lastN.push_back(n[i]);
}
// For the remaining iterations some characters could have been logically
// removed from `a`, so use lastN to get last value of n instaed
// of actually getting it from `n[i - 1]`
for (size_t i = blen + 1; i < a.size(); ++ i) {
size_t z = lastN.back();
while (z && a[i] != a[z]) {
z = n[z - 1];
}
if (a[i] != a[z]) n[i] = 0;
else n[i] = z + 1;
if (n[i] == blen) // found a match
{
removals[i] = true;
// kill last |b| - 1 `n_i`s
for (size_t j = 0; j < blen - 1; ++ j) {
lastN.pop_back();
}
}
else {
lastN.push_back(n[i]);
}
}
string ret;
size_t toRemove = 0;
for (size_t pos = a.size() - 1; a[pos] != '$'; -- pos) {
if (removals[pos]) toRemove += blen;
if (toRemove) -- toRemove;
else ret.push_back(a[pos]);
}
reverse(ret.begin(), ret.end());
cout << ret << endl;
return 0;
}
[in] hehelllloworld
[in] hell
[out] oworld
[in] abababc
[in] ababc
[out] ab
[in] caaaaa ... aaaaaabbbbbb ... bbbbc
[in] ab
[out] cc

O(n) algorithm for constructing suffix table in boyer-moore string matching algorithm

I want to implement boyer-moore algorithm but I'm stuck on constructing a good suffix table which I think should have O(n) complexity, I only found the O(n^2) algorithm.
So do you guys have a clue for me?
Please don't give me code snippets, I can google it if I want, but I prefer to solve it in my way, I just need a clue.

There is a fast algorithm that uses the prefix function.
A prefix function of a string s is an array p, where p[i] is the longest length of the prefix of a substring s[0..i] (0-indexed) and its suffix.
It can be calculated with O(n) complexity using KMP that uses 2 facts:
p[i+1]<=p[i]+1.
For each i, if s[p[i]]==s[i+1], then p[i+1] = p[i] + 1. Otherwise, we should try another string, for which s[0...j-1]==s[i-j+1...i]. Obviously, (we choose the longest string) we should just jump to the position i = p[i-1].
The algorithm (c++):
vector<int> prefix (string s)
{
int n=s.length();
vector<int> pi(n);
pi[0]=0;
for (int i=1; i<n; ++i)
{
int j = pi[i-1];
while (j>0 && s[i]!=s[j])
j=pi[j-1];
if (s[i]==s[j]) ++j;
pi[i]=j;
}
return pi;
}
Now we can construct the suffix table:
m = text.length();
vector<int> suffshift(m);
vector<int> pi = prefix(pattern);
vector<int> pi1 = prefix(inverse(pattern));
for (int j=0; j<m; ++j)
{
suffshift[j] = m - pi[m];
}
for (int i=1; i<m; ++i)
{
j = m - pi1[i];
suffshift[j]=min(suffshift[j], i-pi1[i]);
}
Suffshift[m] stands for the empty suffix, suppshift[0] - for the whole text. THe complexity is O(n).

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Longest Common Substring non-DP solution with O(m*n) - string

Related

CS50 Plurality – having trouble understanding why candidate_count is used

Optimal algorithm for this string decompression

Leetcode--3 find the longest substring without repeating character

Remove occurrences of substring recursively

O(n) algorithm for constructing suffix table in boyer-moore string matching algorithm

Categories

Resources