Finding the longest word in a string using dynamic programming? - string

I have started working on some algorithm problems when I saw a problem asking if I can find the longest word from a string (string does not have spaces just characters). After thinking for some time, I just wanted to confirm if I can use Dynamic Programming for this issue similar to Maximum contiguous sum problem. Here after parsing every character I can call isWord method (already implemented) and then if it is keep going to the next character and increase the word length, if its not then simply reset the counter to zero and start looking for a word from that index. Please let me know if that would be a good approach otherwise please guide me what would be better approach to solve this.
Thanks for your help guys.
-Vik

This algorithm will not work correctly. Consider the following string:
BENDOCRINE
If you start from the start of the string and scan forward while you still have a word, you will find the word "BEND," then reset the string after that point and pick up from the O. The correct answer here is instead to pick the word "ENDOCRINE," which is much longer.
If you have a static dictionary and want to find the longest word from that dictionary that is contained within a text string, you might want to look at the Aho-Corasick algorithm, which will find every single match of a set of strings inside a text string, and does so extremely efficiently. You could easily modify the algorithm so that it tracks the longest word it has outputted at any time so that it does not output shorter strings than the longest one found so far, in which case the runtime will be O(n + m), where n is the length of your text string to search and m is the total number of characters in all legal English words. Moreover, if you do O(m) preprocessing in advance, from that point forward you can find the longest word in a given string in time O(n), where n is the number of characters in the string.
(As for why it runs in time O(n + m): normally the runtime is O(n + m + z), where z is the number of matches. If you restrict the number of matches outputted so that you never output a shorter word than the longest so far, there can be at most n words outputted. Thus the runtime is O(n + m + n) = O(n + m)).
Hope this helps!

Dynamic programming will not work for your problem:
let seq1 and seq2 be 2 character sequences
isWord(Concatenation(seq1, seq2)) cannot be infered from the values of isWord(seq1) and isWord(seq2)

Related

Splitting a string into words with dynamic programming

In this problem we've to split a string into meaningful words. We're given a dictionary to see If the word exists or not.
I"ve seen some other approaches here at How to split a string into words. Ex: "stringintowords" -> "String Into Words"?.
I thought of a different approach and was wondering If it would work or not.
Example- itlookslikeasentence
Algorithm
Each letter of the string corresponds to a node in a DAG.
Initialize a bool array to False.
At each node we have a choice- If the addition of the present letter to the previous subarray still produces a valid word then add it, if it does not then we will begin a new word from that letter and set bool[previous_node]=True indicating that a word ended there. In the above example bool[1] would be set to true.
This is something similar to the maximum subarray sum problem.
Would this algorithm work?
No, it wouldn't. You solution takes the longest possible word at every step, which doesn't always work.
Here is counterexample:
Let's assume that the given string is aturtle. Your algorithm will take a. Then it will take t as at is valid word. atu is not a word, so it'll split the input: at + urtle. However, there is no way to split urtle into a sequence of valid English words. The right answer would be a + turtle.
One of the possible correct solutions uses dynamic programming. We can define a function f such that f(i) = true iff it's possible to split the first i characters of the input into a valid sequence of words. Initially, f(0) = true and the rest of the values are false. There is a transition from f(l) to f(r) if s[l + 1, r] is a valid word for all valid l and r.
P.S. Other types of greedy algorithms would not work here either. For instance, if you take the shortest word instead of the longest one, it fails to work on, for instance, the input atnight: there is no way to split tnight after the a is stripped off, but at + night is clearly a valid answer.

Algorithm to un-concatenate words from string without spaces and punctuation

I've been given a problem in my data structures class to find the solution to this problem. It's similar to an interview question. If someone could explain the thinking process or solution to the problem. Pseudocode can be used. So far i've been thinking to use tries to hold the dictionary and look up words that way for efficiency.
This is the problem:
Oh, no! You have just completed a lengthy document when you have an unfortunate Find/Replace mishap. You have accidentally removed all spaces, punctuation, and capitalization in the document. A sentence like "I reset the computer. It still didn't boot!" would become "iresetthecomputeritstilldidntboot". You figure that you can add back in the punctation and capitalization later, once you get the individual words properly separated. Most of the words will be in a dictionary, but some strings, like proper names, will not.
Given a dictionary (a list of words), design an algorithm to find the optimal way of "unconcatenating" a sequence of words. In this case, "optimal" is defined to be the parsing which minimizes the number of unrecognized sequences of characters.
For example, the string "jesslookedjustliketimherbrother" would be optimally parsed as "JESS looked just like TIM her brother". This parsing has seven unrecognized characters, which we have capitalized for clarity.
For each index, n, into the string, compute the cost C(n) of the optimal solution (ie: the number of unrecognised characters in the optimal parsing) starting at that index.
Then, the solution to your problem is C(0).
There's a recurrence relation for C. At each n, either you match a word of i characters, or you skip over character n, incurring a cost of 1, and then parse the rest optimally. You just need to find which of those choices incurs the lowest cost.
Let N be the length of the string, and let W(n) be a set containing the lengths of all words starting at index n in your string. Then:
C(N) = 0
C(n) = min({C(n+1) + 1} union {C(n+i) for i in W(n)})
This can be implemented using dynamic programming by constructing a table of C(n) starting from the end backwards.
If the length of the longest word in your dictionary is L, then the algorithm runs in O(NL) time in the worst case and can be implemented to use O(L) memory if you're careful.
You could use rolling hashes of different lengths to speed up the search.
You can try a partial pattern matcher for example aho-corasick algorithm. Basically it's a special space optimized version of a suffix tree.

find most common substring in given string? overlapping is allow

I already searched for posts on this question. But none of them have clear answers.
Find the occurrence of most common substring with length n in given string.
For example, "deded", we set the length of substring to be 3. "ded" will be the most common substring and its occurrence is 2.
Few post suggest using suffix tree and the time complexity is O(nlgn), space complexity is O(n).
First, I'm not familiar with suffix tree. My idea is to use hashmap store the occurrence of each substring with length of 3. The time is O(n) while space is also O(n). Is this better than suffix tree? Should I take hashmap collison into account?
Extra: if above problem is addressed, how can we solve the problem that length of substring doesn't matter. Just find the most common substring in given string.
If the length of the most common substring doesn't matter (but say, you want it to be greater than 1) then the best solution is to look for the most common substring of length 2. You can do this with a suffix tree in linear time, if you look up suffix trees then it will be clear how to do this. If you want the length M of the most common substring to be an input parameter, then you can hash all substrings of length M in linear time using hashing with multiply-and-add where you multiply the previous string hash value by a constant and then add the value for the next least significant value in the string, and take the modulus modulo a prime P. If you pick your modulus P for the computed string integers to be a randomly chosen prime P such that you can store O(P) memory, then this will do the trick, in linear time if you assume that your hashing has no collisions. If you assume that your hashing might have a lot of collisions, and the substring is of length M and the total string length is N, then the running time would be O(MN) because you have to check all collisions, which in the worst case could be checking all substrings of length M for example if your string is a string of all one character. Suffix trees are better in the worst case, let me know if you want some details (but not completely, because suffix trees are complicated) and I can explain at a high level how to get a faster solution with suffix trees.

How to find the period of a string

I take a input from the user and its a string with a certain substring which repeats itself all through the string. I need to output the substring or its length AKA period.
Say
S1 = AAAA // substring is A
S2 = ABAB // Substring is AB
S3 = ABCAB // Substring is ABC
S4 = EFIEFI // Substring is EFI
I could start with a Single char and check if it is same as its next character if it is not, I could do it with two characters then with three and so on. This would be a O(N^2) algo. I was wondering if there is a more elegant solution to this.
You can do this in linear time and constant additional space by inductively computing the period of each prefix of the string. I can't recall the details (there are several things to get right), but you can find them in Section 13.6 of "Text algorithms" by Crochemore and Rytter under function Per(x).
Let me assume that the length of the string n is at least twice greater than the period p.
Algorithm
Let m = 1, and S the whole string
Take m = m*2
Find the next occurrence of the substring S[:m]
Let k be the start of the next occurrence
Check if S[:k] is the period
if not go to 2.
Example
Suppose we have a string
CDCDFBFCDCDFDFCDCDFBFCDCDFDFCDC
For each power m of 2 we find repetitions of first 2^m characters. Then we extend this sequence to it's second occurrence. Let's start with 2^1 so CD.
CDCDFBFCDCDFDFCDCDFBFCDCDFDFCDC
CDCD CDCD CDCD CDCD CD
We don't extend CD since the next occurrence is just after that. However CD is not the substring we are looking for so let's take the next power: 2^2 = 4 and substring CDCD.
CDCDFBFCDCDFDFCDCDFBFCDCDFDFCDC
CDCD CDCD
Now let's extend our string to the first repetition. We get
CDCDFBF
we check if this is periodic. It is not so we go further. We try 2^3 = 8, so CDCDFBFC
CDCDFBFCDCDFDFCDCDFBFCDCDFDFCDC
CDCDFBFC CDCDFBFC
we try to extend and we get
CDCDFBFCDCDFDF
and this indeed is our period.
I expect this to work in O(n log n) with some KMP-like algorithm for checking where a given string appears. Note that some edge cases still should be worked out here.
Intuitively this should work, but my intuition failed once on this problem already so please correct me if I'm wrong. I will try to figure out a proof.
A very nice problem though.
You can build a suffix tree for the entire string in linear time (suffix tree is easy to look up online), and then recursively compute and store the number of suffix tree leaves (occurences of the suffix prefix) N(v) below each internal node v of the suffix tree. Also recursively compute and store the length of each suffix prefix L(v) at each node of the tree. Then, at an internal node v in the tree, the suffix prefix encoded at v is a repeating subsequence that generates your string if N(v) equals the total length of the string divided by L(v).
We can actually optimise the time complexity by creating a Z Array. We can create Z array in O(n) time and O(n) space. Now, lets say if there is string
S1 = abababab
For this the z array would like
z[]={8,0,6,0,4,0,2,0};
In order to calcutate the period we can iterate over the z array and
use the condition, where i+z[i]=S1.length. Then, that i would be the period.
Well if every character in the input string is part of the repeating substring, then all you have to do is store first character and compare it with rest of the string's characters one by one. If you find a match, string until to matched one is your repeating string.
I too have been looking for the time-space-optimal solution to this problem. The accepted answer by tmyklebu essentially seems to be it, but I would like to offer some explanation of what it's actually about and some further findings.
First, this question by me proposes a seemingly promising but incorrect solution, with notes on why it's incorrect: Is this algorithm correct for finding period of a string?
In general, the problem "find the period" is equivalent to "find the pattern within itself" (in some sense, "strstr(x+1,x)"), but with no constraints matching past its end. This means that you can find the period by taking any left-to-right string matching algorith, and applying it to itself, considering a partial match that hits the end of the haystack/text as a match, and the time and space requirements are the same as those of whatever string matching algorithm you use.
The approach cited in tmyklebu's answer is essentially applying this principle to String Matching on Ordered Alphabets, also explained here. Another time-space-optimal solution should be possible using the GS algorithm.
The fairly well-known and simple Two Way algorithm (also explained here) unfortunately is not a solution because it's not left-to-right. In particular, the advancement after a mismatch in the left factor depends on the right factor having been a match, and the impossibility of another match misaligned with the right factor modulo the right factor's period. When searching for the pattern within itself and disregarding anything past the end, we can't conclude anything about how soon the next right-factor match could occur (part or all of the right factor may have shifted past the end of the pattern), and therefore a shift that preserves linear time cannot be made.
Of course, if working space is available, a number of other algorithms may be used. KMP is linear-time with O(n) space, and it may be possible to adapt it to something still reasonably efficient with only logarithmic space.

String pattern matching with one or zero mismatch

Given a string and a pattern to be matched, how efficiently can the matches be found having zero or one mismatch.
e.g)
S = abbbaaabbbabab
P = abab
Matches are abbb(index 0),aaab(index 4),abbb(index 6),abab(index 10)
I tried to modify KMP algorithm but I'm not sure about the approach.
Please give me idea to proceed with the problem.
Thanks.
Ok I found it! I found the best algorithm!
This might sound a bit brave, but as long as the algorithm I am going to propose has both running time O(m + n) and memory consumption O(m + n) and the entry data itself has the same properties the algorithm can be optimized only in constant.
Algorithms used
I am going to use mix-up between KMP and Rabin Karp algorithms for my solution. Rabin Karp uses rolling hashes for comparing substrings of the initial strings. It requires linear in time precomputing that uses linear additional memory, but from then on the comparison between substrings of the two strings is constant O(1) (this is amortized if you handle collisions properly).
What my solution will not do
My solution will not find all the occurrences in the first string that match the second string with at most 1 difference. However, the algorithm can be modified so that for every starting index in the first string if there is such matching at least one of them will be found (this is left to the reader).
Observations
Let m be the length of the second string and n - the length of the first string. I am going to split the task in two parts: if I am aiming to find a matching with at most one difference, I want to find to substrings of the first string: PREF is going to be the substring before the single difference and SUFF the substring after the difference. I want len(PREF) + len(SUFF) + 1 = m, where PREF or SUFF will be artificially shortened if required (when the strings match without difference).
I am going to base my solution on one very important observation: suppose there is a substring of the first string starting at index i with length m that matches the second string with at most one difference. Then if we take PREF as long as possible there will still be solution for SUFF. This is obvious: I am just pushing the difference as much to the end as possible.
The algorithm
And now follows the algorithm itself. Start off with usual KMP. Every time when the extension of the prefix fails and the fail links are to be followed, first check whether if you skip the next letter the remaining suffix will match the remaining of the second string. If so the sought match with at most one character difference is found. If not - we go on with the ordinary KMP making the Rabin Karp check every time a fail link is to be followed.
Let me clarify further the Rabin Karp check with an example. Suppose we are at certain step of the KMP and we have found that first.substring[i, i + k - 1] matches the first k letters of the second string. Suppose also that the letter first[i + k] is different from second[k]. Then you check whether first.substring[i + k + 1, i + m - 1] matches exactly second.substring[k + 1, m - 1] using Rabin Karp. This is exactly the case in which you have extended the starting prefix form index i as much as possible and you try now whether there is a match with at most one difference.
Rabin Karp will be used only when a fail link is followed, which moves the starting index of the prefix with at least one, which means that at most O(n) Rabin Karp calls are used, every one with complexity O(1) for a total of linear complexity.
This is known as the approximate string matching problem. In your particular case, you want a maximum edit distance of 1.
The bitap algorithm is a fairly fast way of solving it.
To find all submatches including one mismatch you need 2 z-functions (one for the original P, and another for reversed P).
After that buld array of longest prefix submatches for the original and reversed string S.
Later you need to reverse the second array.
And in the end everything is easy: run through the first array and check if the length of longest prefix is equal to the length of P. If it is, then it is a match without mistakes.
If it is shorter, then check the second array at position (i + length(P) - 1). If sum of
two values is equal to length(P) - 1, then it is a submatch with one mistake.
Complexity is O(len(P) + len(S))
A comprehensive overview of the various algorithms and how they compare to each other is given by Gonzalo Navarro in his A guided tour to approximate string matching. Pages 80, 81 and 82 show complexity results, including worst and average cases, and space requirements for the various algorithms.
(In the notation used there, n refers to the length of the text you search, m to the length of the pattern, σ to the size of the alphabet, and k to the maximum edit distance, which is 1 in your case.)

Resources