Using a trie for string segmentation - time complexity? - string

Problem to be solved:
Given a non-empty string s and a string array wordArr containing a list
of non-empty words, determine if s can be segmented into a
space-separated sequence of one or more dictionary words. You may
assume the dictionary does not contain duplicate words.
For example, given s = "leetcode", wordArr = ["leet", "code"].
Return true because "leetcode" can be segmented as "leet code".
In the above problem, would it work to build a trie that has each string in wordArr. Then, for each char in given string s, work down the trie. If a trie branch terminates, then this substring is complete so pass the remaining string up to the root and do the exact same thing recursively.
This should be O(N) time and O(N) space correct? I ask because the problem I'm working on says this will be O(N^2) time in the most optimal way and I'm not sure what's wrong with my approach.
For example, if s = "hello" and wordArr = ["he", "ll", "ee", "zz", "o"], then "he" will be completed in the first branch of the trie, "llo" will be passed up to the root recursively. Then, "ll" will be completed, so "o" gets passed up to root of trie. Then "o" is completed, which is the end of s, so return true. If the end of s isn't completed, return false.
Is this correct?

Your example would indeed suggest a linear time complexity, but look at this example:
s = "hello"
wordArr = ["hell", "he", "e", "ll", "lo", "l", "h"]
Now, first "hell" is tried, but in the next recursion cycle, no solution is found (there is no "o"), so the algorithm needs to backtrack and assume "hell" is not suitable (pun not intended), so you try "he", and in the next level you find "ll", but then again it fails, as there is no "o". Again backtracking is needed. Now start with "h", then "e" and then again a failure is coming: you try "ll" without success, so backtracking to use "l" instead: the solution is now available: "h e l lo".
So, no this does not have O(n) time complexity.

I suspect off-hand that the issue is backtracking. What if the word is not segmentable based on a particular dictionary, or what if there are multiple possible substrings with a common prefix? E.g., suppose the dictionary contains he, llenic, and llo. Failure down one branch of the trie would require backtracking, with some corresponding increase in time complexity.
This is similar to a regex-match problem: the example you give is like testing an input word against
^(he|ll|ee|zz|o)+$
(any number of dictionary members, in any order, and nothing else). I don't know the time complexity of regex matchers offhand, but I know backtracking can get you into serious time trouble.
I did find this answer which says:
Running a DFA-compiled regular expression against a string is indeed O(n), but can require up to O(2^m) construction time/space (where m = regular expression size).
So maybe it is O(n^2) with reduced construction effort.

Let's start by converting the trie to a nfa. We create an accept node on the root and add an edge that moves from every word end of the dictionary in the trie to the root node for the empty char.
Time complexity: since each step in the trie we can move only to one edge that represent the current char in the input string and the root.
T(n) = 2×T (n-1)+c
That gives us O(2^n)
Indeed not O(n), But you can do better using Dynamic programming.
We will use top-down approach.
Before we solve it for any string check if we have already solve it.
We can use another HashMap to store the result of already solved strings.
Whenever any recursive call returns false, store that string in HashMap.
The idea is to calculate every suffix of the word only once. We have only n suffixes and It will end up with O(n^2).
Code form algorithms.tutorialhorizon.com:
Map<String, String> memoized;
Set<String> dict;
String SegmentString(String input) {
if (dict.contains(input)) return input;
if (memoized.containsKey(input) {
return memoized.get(input);
}
int len = input.length();
for (int i = 1; i < len; i++) {
String prefix = input.substring(0, i);
if (dict.contains(prefix)) {
String suffix = input.substring(i, len);
String segSuffix = SegmentString(suffix);
if (segSuffix != null) {
memoized.put(input, prefix + " " + segSuffix);
return prefix + " " + segSuffix;
}
}
And you can do better!
Map<String, String> memoized;
Trie<String> dict;
String SegmentString(String input)
{
if (dict.contains(input))
return input;
if (memoized.containsKey(input)
return memoized.get(input);
int len = input.length();
foreach (StringBuilder word in dict.GetAll(input))
{
String prefix = input.substring(0, word.length);
String suffix = input.substring(word.length, len);
String segSuffix = SegmentString(suffix);
if (segSuffix != null)
{
memoized.put(input, word.ToString() + " " + segSuffix);
return prefix + " " + segSuffix;
}
}
retrun null;
}
Using the Trieto find the recursive calls only when Trie reach a word end you will get o (z×n) where z is the length of the Trie.

Related

Unique Substrings in wrap around strings

I have been given an infinite wrap around of the string str="abcdefghijklmnopqrstuvwxyz" so it looks like
"..zabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcd...." and another string p.
I need to find out how many unique non-empty substrings of p are present in the infinite wraparound string str?
For example: "zab"
There are 6 substrings "z", "a", "b", "za", "ab", "zab" of string "zab" in str.
I tried finding all suffixes of p in a particular concatenation of the string str say for example: "abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz"
and as soon as i get a suffix which is a part of the above i add all its substrings to my result, as:
for (int i=0;i<length;i++) {
String suffix = p.substring(i,length);
if(isPresent(suffix)) {
sum += (suffix.length()*(suffix.length()+1))/2;
break;
} else {
sum++;
}
}
And my isPresent function is:
private boolean isPresent(String s) {
if(s.length()==1) {
return true;
}
String main = "abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcde
fghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz";
return main.contains(s);
}
If the length of p is greater than my assumed concatenated string assumed in isPresent function, my algorithm fails!!
So how should i find the substrings irrespective of the the wrap around string str? Is there a better approach for this problem?
Some ideas/suggestions (not a full algo)
you don't need to consider an infinite repetition of the wrap around string but only len(p)/len(repeating-fragment) + 1 (integral division) repetitions. Let's denote this string with S **
if a substring sp of p is a substring of S, than any substrings of sp will be substrings of S
So the problem seems to reduce to:
find sp (substring of both p and S) with the maximal length. This is called longest common substring and admits a dynamic programming solution with the complexity of O(n*m) (lengths of the two strings). The cited has a pseudo-code algo.
repeat the above recursively with the 'remnants' of p after eliminating the longest common substring.
Now, you have a sequence of "longest common substrings". How many do you need to retain? I feel that the "longest common substring" may be used to trim down the need of brute-forcing every substring of any and all the above, but I'd need more time than I have available now.
I hope the sketch above helps.
** I might be wrong on the number of repetitions which need to be considered. If I am, then in any case there will be a maximal number of repetitions to be considered and there will be an S of minimal length that is sufficient for the purpose.

Efficient algorithm for phrase anagrams

What is an efficient way to produce phrase anagrams given a string?
The problem I am trying to solve
Assume you have a word list with n words. Given an input string, say, "peanutbutter", produce all phrase anagrams. Some contenders are: pea nut butter, A But Ten Erupt, etc.
My solution
I have a trie that contains all words in the given word list. Given an input string, I calculate all permutations of it. For each permutation, I have a recursive solution (something like this) to determine if that specific permuted string can be broken in to words. For example, if one of the permutations of peanutbutter was "abuttenerupt", I used this method to break it into "a but ten erupt". I use the trie to determine if a string is a valid word.
What sucks
My problem is that because I calculate all permutations, my solution runs very slow for phrases that are longer than 10 characters, which is a big let down. I want to know if there is a way to do this in a different way.
Websites like https://wordsmith.org/anagram/ can do the job in less than a second and I am curious to know how they do it.
Your problem can be decomposed to 2 sub-problems:
Find combination of words that use up all characters of the input string
Find all permutations of the words found in the first sub-problem
Subproblem #2 is a basic algorithm and you can find existing standard implementation in most programming language. Let's focus on subproblem #1
First convert the input string to a "character pool". We can implement the character pool as an array oc, where oc[c] = number of occurrence of character c.
Then we use backtracking algorithm to find words that fit in the charpool as in this pseudo-code:
result = empty;
function findAnagram(pool)
if (pool empty) then print result;
for (word in dictionary) {
if (word fit in charpool) {
result = result + word;
update pool to exclude characters in word;
findAnagram(pool);
// as with any backtracking algorithm, we have to restore global states
restore pool;
restore result;
}
}
}
Note: If we pass the charpool by value then we don't have to restore it. But as it is quite big, I prefer passing it by reference.
Now we remove redundant results and apply some optimizations:
Assuming A comes before B in the dictionary. If we choose the first word is B, then we don't have to consider word A in following steps, because those results (if we take A) would already be in the case where A is chosen as the first word
If the character set is small enough (< 64 characters is best), we can use a bitmask to quickly filter words that cannot fit in the pool. A bitmask mask which character is in a word, no matter how many time it occurs.
Update the pseudo-code to reflect those optimizations:
function findAnagram(charpool, minDictionaryIndex)
pool_bitmask <- bitmask(charpool);
if (pool empty) then print result;
for (word in dictionary AND word's index >= minDictionaryIndex) {
// bitmask of every words in the dictionary should be pre-calculated
word_bitmask <- bitmask(word)
if (word_bitmask contains bit(s) that is not in pool_bitmask)
then skip this for iteration
if (word fit in charpool) {
result = result + word;
update charpool to exclude characters in word;
findAnagram(charpool, word's index);
// as with any backtracking algorithm, we have to restore global states
restore pool;
restore result;
}
}
}
My C++ implementation of subproblem #1 where the character set contains only lowercase 'a'..'z': http://ideone.com/vf7Rpl .
Instead of a two stage solution where you generate permutations and then try and break them into words, you could speed it up by checking for valid words as you recursively generate the permutations. If at any point your current partially-complete permutation does not correspond to any valid words, stop there and do not recurse any further. This means you don't waste time generating useless permutations. For example, if you generate "tt", there is no need to permute "peanubuter" and append all the permutations to "tt" because there are no English words beginning with tt.
Suppose you are doing basic recursive permutation generation, keep track of the current partial word you have generated. If at any point it is a valid word, you can output a space and start a new word, and recursively permute the remaining character. You can also try adding each of the remaining characters to the current partial word, and only recurse if doing so results in a valid partial word (i.e. a word exists starting with those characters).
Something like this (pseudo-code):
void generateAnagrams(String partialAnagram, String currentWord, String remainingChars)
{
// at each point, you can either output a space, or each of the remaining chars:
// if the current word is a complete valid word, you can output a space
if(isValidWord(currentWord))
{
// if there are no more remaining chars, output the anagram:
if(remainingChars.length == 0)
{
outputAnagram(partialAnagram);
}
else
{
// output a space and start a new word
generateAnagrams(partialAnagram + " ", "", remainingChars);
}
}
// for each of the chars in remainingChars, check if it can be
// added to currentWord, to produce a valid partial word (i.e.
// there is at least 1 word starting with these characters)
for(i = 0 to remainingChars.length - 1)
{
char c = remainingChars[i];
if(isValidPartialWord(currentWord + c)
{
generateAnagrams(partialAnagram + c, currentWord + c,
remainingChars.remove(i));
}
}
}
You could call it like this
generateAnagrams("", "", "peanutbutter");
You could optimize this algorithm further by passing the node in the trie corresponding to the current partially completed word, as well as passing currentWord as a string. This would make your isValidPartialWord check even faster.
You can enforce uniqueness by changing your isValidWord check to only return true if the word is in ascending (greater or equal) alphabetic order compared to the previous word output. You might also need another check for dupes at the end, to catch cases where two of the same word can be output.

Word Break time complexity

I came across the word break problem which goes something like this:
Given an input string and a dictionary of words,segment the input
string into a space-separated sequence of dictionary words if
possible.
For example, if the input string is "applepie" and dictionary contains a standard set of English words,then we would return the string "apple pie" as output
Now I myself came up with a quadratic time solution. And I came across various other quadratic time solutions using DP.
However in Quora a user posted a linear time solution to this problem
I cant figure out how it comes out to be linear. Is their some mistake in the time complexity calculations? What is the best possible worst case time complexity for this problem. I am posting the most common DP solution here
String SegmentString(String input, Set<String> dict) {
int len = input.length();
for (int i = 1; i < len; i++) {
String prefix = input.substring(0, i);
if (dict.contains(prefix)) {
String suffix = input.substring(i, len);
if (dict.contains(suffix)) {
return prefix + " " + suffix;
}
}
}
return null;
}
The 'linear' time algorithm that you linked here works as follows:
If the string is sharperneedle and dictionary is sharp, sharper, needle,
It pushes sharp in the string.
Then it sees that er is not in dictionary, but if we combine it with the last word added, then sharper exists. Hence it pops out the last element and pushes this in.
IMO the above logic fails for string eaterror and dictionary eat, eater, error.
Here er shall pop out eat from the list, and push in eater. The remaining string ror shall not be recognized and discarded.
As regards the code you posted, as mentioned in the comments, this works for only two words with one partition place.

Number of distinct rotated strings

We have a string S and we want to calculate the number of distinct strings that can be formed by rotating the string.
For example :-
S = "aaaa" , here it would be 1 string {"aaaa"}
S = "abab" , here it would be 2 strings {"abab" , "baba"}
So ,is there an algorithm to solve this in O(|S|) complexity where |S| is the length of string.
Suffix trees, baby!
If string is S. Construct the Suffix Tree for SS (S concatenated to S).
Find number of unique substrings of length |S|. The uniqueness you get automatically. For length |S| you might have to change the suffix tree algo a little (to maintain depth info), but is doable.
(Note that the other answer by johnsoe is actually quadratic, or worse, depending on the implementation of Set).
You can solve this with rolling hash functions used in the Rabin-Karp algorithm.
You can use the rolling hash to update the hash table for all substrings of size |S| (obtained by sliding a |S| window across SS) in constant time (so, O(|S|) in total).
Assuming your string comes from an alphabet of constant size, you can inspect the hash table in constant time to obtain the required metric.
Something like this should do the trick.
public static int uniqueRotations(String phrase){
Set<String> rotations = new HashSet<String>();
rotations.add(phrase);
for(int i = 0; i < phrase.length() - 1; i++){
phrase = phrase.charAt(phrase.length() - 1) + phrase.substring(0, phrase.length() - 1);
rotations.add(phrase);
}
return rotations.size();
}

Find the smallest period of input string in O(n)?

Given the following problem :
Definition :
Let S be a string over alphabet Σ .S' is the smallest period of S
if S' is the smallest string such that :
S = (S')^k (S'') ,
where S'' is a prefix of S. If no such S' exists , then S is
not periodic .
Example : S = abcabcabcabca. Then abcabc is a period since S =
abcabc abcabc a, but the smallest period is abc since S = abc abc
abc abc a.
Give an algorithm to find the smallest period of input string S or
declare that S is not periodic.
Hint : You can do that in O(n) ...
My solution : We use KMP , which runs in O(n) .
By the definition of the problem , S = (S')^k (S'') , then I think that if we create
an automata for the shortest period , and find a way to find that shortest period , then I'm done.
The problem is where to put the FAIL arrow of the automata ...
Any ideas would be greatly appreciated ,
Regards
Alright so this problem can definitely be solved in O(n), we just have to cleverly use KMP as you suggested.
Solving the longest proper prefix which is also a suffix problem is a vital part of KMP that we will make use of.
The longest proper prefix which is also a suffix problem is a mouthful so let's just call it the prefix suffix problem for now.
The prefix suffix problem can be pretty hard to understand so I'll include some examples.
The prefix suffix solution for "abcabc" is
"abc" since that is the longest string which is both a proper prefix
and a proper suffix (proper prefixes and suffixes cannot be the entire
string).
The prefix suffix solution for "abcabca" is "a"
Hmmmmmmmmm wait a minute if we just chop off "a" from the end of "abcabca" we are left with "abcabc" and if we get the solution("abc") for this new string and chop it off again we are left with "abc" Hmmmmmmmmm. Very interesting.(This is pretty much the solution but I will talk about why this works)
Alright let's try to formalize this intuition a bit more and see if we can arrive at a solution.
I will use one key assumption in my argument:
The smallest period of our pattern is a valid period of every larger period in our pattern
Let us store the prefix suffix solution for the first i characters of our pattern in lps[i]. This lps array can be calculated in O(n) and it is used in the KMP algorithm, you can read more about how to calculate it in O(n) here: https://www.geeksforgeeks.org/kmp-algorithm-for-pattern-searching/
Just so we are clear I will list some examples of some lps arrays
Pattern:"aaaaa"
lps: [0, 1, 2, 3, 4]
Pattern:"aabbcc"
lps: [0, 1, 0, 0, 0, 0]
Pattern:"abcabcabc"
lps: [0, 0, 0, 1, 2, 3, 4, 5, 6]
Alright now lets define some variables, to help us find out why this lps array is useful.
Let l be the length of our pattern, and let k be the last value in our lps array(k=lps[l-1])
The value k tells us that the first k characters of our string are the same as the last k characters of our string. And we can use this fact to find a period!
Using this information we can now show that the prefix consisting of the first l-k characters of our string form a valid period. This is clear because the next k characters which are not in our prefix must match the first k characters of our prefix, because of how we defined our lps array. The first k characters that from our prefix must be the same as the last k characters which form our suffix.
In practice you can implement this with a simple while loop as shown below where index marks the end of the suffix you are currently considering to be the smallest period.
public static void main(String[] args){
String pattern="abcabcabcabca";
int[] lps= calculateLPS(pattern);
//start at the end of the string
int index=lps.length-1;
while(lps[index]!=0){
//shift back
index-=lps[index];
}
System.out.println(pattern.substring(0,index+1));
}
And since calculating lps happens in O(n), and you are always moving at least 1 step back in the while loop the time complexity for the whole procedure is simply O(n)
I borrowed heavily from the geeksForGeeks implementation of KMP in my calculateLPS() method if you would like to see my exact code it is below, but I reccomend that you also look at their explanation: https://www.geeksforgeeks.org/kmp-algorithm-for-pattern-searching/
static int[] calculateLPS(String pat) {
int[] lps = new int[pat.length()];
int len = 0;
int i = 1;
lps[0] = 0;
while (i < pat.length()) {
if (pat.charAt(i) == pat.charAt(len)) {
len++;
lps[i] = len;
i++;
}
else {
if (len != 0) {
len = lps[len - 1];
}
else {
lps[i] = len;
i++;
}
}
}
System.out.println(Arrays.toString(lps));
return lps;
}
Last but not least, thanks for posting such an interesting problem it was pretty fun to figure out! Also I am new to this so please let me know if any part of my explanation doesn't make sense.
I'm not sure that I understand your attempted solution. KMP is a useful subroutine, though -- the smallest period is how far KMP moves the needle string (i.e., S) after a complete match.
this problem can be solved using the Z function , this tutorial can help you .
This problem can easily be solved by KMP
Concatenate the string to itself and run KMP on it.
Let n be the length of the original string.
Search for the first value >= n in the KMP array. That value must be at a position k >= n (0-based).
Then k - n + 1 is the length of the shortest period of the string.
Example:
Original string = abaaba
n = 6
New string = abaabaabaaba
KMP values for this new string: 0 0 1 1 2 3 4 5 6 7 8 9
The first value >= n is 6 which is at position 8. 8 - 6 + 1 = 3 is the length of the shortest period of the string (aba).
See if this solution works for O(n). I used rotation of strings.
public static int stringPeriod(String s){
String s1= s;
String s2= s1;
for (int i=1; i <s1.length();i++){
s2=rotate(s2);
if(s1.equals(s2)){
return i;
}
}
return -1;
}
public static String rotate(String s1){
String rotS= s1;
rotS = s1.substring(1)+s1.substring(0,1);
return rotS;
}
The complete program is available in this github repository

Resources