string processing - string

a string 'hello', how to list all words and the count of each word.
normal suffix tree algorithm return the suffix only, mean that middle word 'll' will not appear. can anyone help me to solve in step by step?

Initialize a hash table.
Use a double loop (for within for). One loop index will represent the beginning of the substring, the other the end. Make sure the end index is strictly bigger than the beginning index, and that both are in the string boundaries.
For each substring encountered, check if it is in the hash. If it isn't - add it as a key, with the value 1. If it is - increment the current saved value.
Edit: as per your comment:
for(b = 0; b < len; ++b) {
for(e = b + 1; e <= len; ++e) {
//process substring from index b (incl.) to index e (excl.)
}
}
This will traverse the string "abcd" in this order:
a ab abc abcd b bc bcd c cd d

Use a prefix tree instead of a suffix tree. Then run through this tree and at any node output the string encountered so far plus the number of sub trees that are still available.
EDIT:
Actually it's too early and I messed up some nomenclature:
A Prefix tree is a tree that stores common prefixes only once. A Suffix tree stores all suffixes in a prefix tree. So I meant a suffix tree here (which also happens to be a special kind of prefix tree).
So you do the following:
Build up your suffix tree
Do a search on the prefix tree, starting at root
function search( node ) {
c = node.symbol;
if not children.empty then
for each child in node.children do
sub_search = search( child )
other_results.append( sub_search.results );
sub_trees += sub_search.num_trees
done
for each result in other_results do
append c to result
done
return c :: other_results
else
return {results = c; num_trees = 1 }
fi
}
If I did not do any mistake this should do the trick. The suffix part of the suffix tree takes care of eliminating all suffixes and the prefix part takes care of eliminating all prefixes. Because both are stored reduced you get strings in between (which might already have been stored together). Note that this is not including any compression on the trie, which is usually not needed unless your strings get very long.

Related

Using a trie for string segmentation - time complexity?

Problem to be solved:
Given a non-empty string s and a string array wordArr containing a list
of non-empty words, determine if s can be segmented into a
space-separated sequence of one or more dictionary words. You may
assume the dictionary does not contain duplicate words.
For example, given s = "leetcode", wordArr = ["leet", "code"].
Return true because "leetcode" can be segmented as "leet code".
In the above problem, would it work to build a trie that has each string in wordArr. Then, for each char in given string s, work down the trie. If a trie branch terminates, then this substring is complete so pass the remaining string up to the root and do the exact same thing recursively.
This should be O(N) time and O(N) space correct? I ask because the problem I'm working on says this will be O(N^2) time in the most optimal way and I'm not sure what's wrong with my approach.
For example, if s = "hello" and wordArr = ["he", "ll", "ee", "zz", "o"], then "he" will be completed in the first branch of the trie, "llo" will be passed up to the root recursively. Then, "ll" will be completed, so "o" gets passed up to root of trie. Then "o" is completed, which is the end of s, so return true. If the end of s isn't completed, return false.
Is this correct?
Your example would indeed suggest a linear time complexity, but look at this example:
s = "hello"
wordArr = ["hell", "he", "e", "ll", "lo", "l", "h"]
Now, first "hell" is tried, but in the next recursion cycle, no solution is found (there is no "o"), so the algorithm needs to backtrack and assume "hell" is not suitable (pun not intended), so you try "he", and in the next level you find "ll", but then again it fails, as there is no "o". Again backtracking is needed. Now start with "h", then "e" and then again a failure is coming: you try "ll" without success, so backtracking to use "l" instead: the solution is now available: "h e l lo".
So, no this does not have O(n) time complexity.
I suspect off-hand that the issue is backtracking. What if the word is not segmentable based on a particular dictionary, or what if there are multiple possible substrings with a common prefix? E.g., suppose the dictionary contains he, llenic, and llo. Failure down one branch of the trie would require backtracking, with some corresponding increase in time complexity.
This is similar to a regex-match problem: the example you give is like testing an input word against
^(he|ll|ee|zz|o)+$
(any number of dictionary members, in any order, and nothing else). I don't know the time complexity of regex matchers offhand, but I know backtracking can get you into serious time trouble.
I did find this answer which says:
Running a DFA-compiled regular expression against a string is indeed O(n), but can require up to O(2^m) construction time/space (where m = regular expression size).
So maybe it is O(n^2) with reduced construction effort.
Let's start by converting the trie to a nfa. We create an accept node on the root and add an edge that moves from every word end of the dictionary in the trie to the root node for the empty char.
Time complexity: since each step in the trie we can move only to one edge that represent the current char in the input string and the root.
T(n) = 2×T (n-1)+c
That gives us O(2^n)
Indeed not O(n), But you can do better using Dynamic programming.
We will use top-down approach.
Before we solve it for any string check if we have already solve it.
We can use another HashMap to store the result of already solved strings.
Whenever any recursive call returns false, store that string in HashMap.
The idea is to calculate every suffix of the word only once. We have only n suffixes and It will end up with O(n^2).
Code form algorithms.tutorialhorizon.com:
Map<String, String> memoized;
Set<String> dict;
String SegmentString(String input) {
if (dict.contains(input)) return input;
if (memoized.containsKey(input) {
return memoized.get(input);
}
int len = input.length();
for (int i = 1; i < len; i++) {
String prefix = input.substring(0, i);
if (dict.contains(prefix)) {
String suffix = input.substring(i, len);
String segSuffix = SegmentString(suffix);
if (segSuffix != null) {
memoized.put(input, prefix + " " + segSuffix);
return prefix + " " + segSuffix;
}
}
And you can do better!
Map<String, String> memoized;
Trie<String> dict;
String SegmentString(String input)
{
if (dict.contains(input))
return input;
if (memoized.containsKey(input)
return memoized.get(input);
int len = input.length();
foreach (StringBuilder word in dict.GetAll(input))
{
String prefix = input.substring(0, word.length);
String suffix = input.substring(word.length, len);
String segSuffix = SegmentString(suffix);
if (segSuffix != null)
{
memoized.put(input, word.ToString() + " " + segSuffix);
return prefix + " " + segSuffix;
}
}
retrun null;
}
Using the Trieto find the recursive calls only when Trie reach a word end you will get o (z×n) where z is the length of the Trie.

Finding the longest double suffix in linear time

Given a string s, find the longest double suffix in time complexity O(|s|).
Example: for string banana, the LDS is na. For abaabaa it's baa.
Obviously I thought about using a suffix tree, but I'm having trouble to find double suffix in it.
Reverse the string and build sparse array P[i][j], where i is from 0 to log(n), j is from 0 to n-1, n is the length of the string. P[i][j] refers to the rank of the suffix starting from position j and length 2^i. So if P[i][j]=P[i][k], the first 2^i chars of the suffixes at indexes j and k are equal.
Now your problem reduces to finding a Longest Common Prefix for 0(start of the reversed string) and another suffix at index i, such that LCP >= i.
Where LCP can be computed by simply using P array in log(n) time, by comparing first 2^x chars of these two suffixes and gradually reducing x.
Total complexity is n*log(n)*log(n).
Here is the working C++ source code: https://ideone.com/aJCAYG
I think that Gene's solution is the simpler to implement and since it does not rely on an arborescent structures but on arrays, it is likely more hardware friendly as well.
But since you mentioned suffix trees, let's look into a solution based on suffix trees! I will assume that you use an end token to mark the end of the string(s) you insert in the tree. To illustrate this, here is a representation of the suffix tree built for your abaabaa example:
$ - ##
b a a - $ - ## // Longest double suffix: P is the first dash, N the second
b a a $ - ## // N' is the dash
a - $ - ##
a - $ - ##
b a a $ - ##
b a a - $ - ##
b a a $ - ##
When N is a node in a suffix tree, we will denote |N| the length of the substring represented by N.
How can you characterize a "double suffix" in a suffix tree? Well it is a terminal node N with a parent that has a specific property: let P be the parent node of a double suffix, then:
P has a transition to the suffix node N that only contains the end token ($ above) of the string.
Let suffix be the substring represented by the node P with an appended end token (baa$ in your example). If we walk down the tree from P, using suffix, we end up in another suffix node N' (walking down the tree won't be actually needed)
The substring represented by the node P is the double suffix (baa in our case).
We have the equalities |N'| = 2.|P| + 1 and |N| = |P| + 1
Given that, you only have to iterate over suffix nodes and test this condition. You can be greedy if you iterate suffixes in decreasing order of length: the first match is necessarily the longest double suffix.
Note that we can stop our search after having inspected the suffix of length |S|/2 and only iterate over suffixes of odd length (do not forget we add an end token to the string)
Complexity analysis
Building the suffix tree is O(|S|).
Let N' be a suffix node and N be the suffix node for the suffix of length (|N'|-1)/2 + 1. Assuming proper construction of the tree:
The suffixes can be stored in an array/vector in increasing order because the creation of the tree adds them in increasing order of length (at least with the Ukkonen's algorithm).
Thus accessing the suffix of length k is O(1)
Accessing the substring represented by a node of the tree is O(1), in particular, this applies to P the parent node of N and N'
Finding out if the transition from P to N only contains the end token ($) is O(1)
Checking if |N'| = 2.|P| + 1 is indeed O(1)
Since we are iterating over the suffix in decreasing order of length, we necessarily focus on the N' suffixes (the doubled suffix, ie baabaa$ in your example), so we just have to:
Get N the suffix node such that |N'| = 2.|N| - 1: O(1)
Get P the parent of the suffix node N: O(1)
Check that the transition from P to N contains only the end token $: O(1)
Proof: (We ignore the end token in the following proof)
The 3 steps above, if leading to a true evaluation, prove the existence of a suffix of length 2.|P| that starts with the substring represented by P, which is also a suffix. Since this substring is a suffix, the suffix of length 2.|P| necessarily ends with it and therefore is made of two occurrences of that substring QED.
Since we will do this step for at most (|S|/2 + 1)/2 suffixes, the identification step is therefore O(|S|) in the worst case.
The overall complexity is thus O(|S|).

Efficient algorithm for phrase anagrams

What is an efficient way to produce phrase anagrams given a string?
The problem I am trying to solve
Assume you have a word list with n words. Given an input string, say, "peanutbutter", produce all phrase anagrams. Some contenders are: pea nut butter, A But Ten Erupt, etc.
My solution
I have a trie that contains all words in the given word list. Given an input string, I calculate all permutations of it. For each permutation, I have a recursive solution (something like this) to determine if that specific permuted string can be broken in to words. For example, if one of the permutations of peanutbutter was "abuttenerupt", I used this method to break it into "a but ten erupt". I use the trie to determine if a string is a valid word.
What sucks
My problem is that because I calculate all permutations, my solution runs very slow for phrases that are longer than 10 characters, which is a big let down. I want to know if there is a way to do this in a different way.
Websites like https://wordsmith.org/anagram/ can do the job in less than a second and I am curious to know how they do it.
Your problem can be decomposed to 2 sub-problems:
Find combination of words that use up all characters of the input string
Find all permutations of the words found in the first sub-problem
Subproblem #2 is a basic algorithm and you can find existing standard implementation in most programming language. Let's focus on subproblem #1
First convert the input string to a "character pool". We can implement the character pool as an array oc, where oc[c] = number of occurrence of character c.
Then we use backtracking algorithm to find words that fit in the charpool as in this pseudo-code:
result = empty;
function findAnagram(pool)
if (pool empty) then print result;
for (word in dictionary) {
if (word fit in charpool) {
result = result + word;
update pool to exclude characters in word;
findAnagram(pool);
// as with any backtracking algorithm, we have to restore global states
restore pool;
restore result;
}
}
}
Note: If we pass the charpool by value then we don't have to restore it. But as it is quite big, I prefer passing it by reference.
Now we remove redundant results and apply some optimizations:
Assuming A comes before B in the dictionary. If we choose the first word is B, then we don't have to consider word A in following steps, because those results (if we take A) would already be in the case where A is chosen as the first word
If the character set is small enough (< 64 characters is best), we can use a bitmask to quickly filter words that cannot fit in the pool. A bitmask mask which character is in a word, no matter how many time it occurs.
Update the pseudo-code to reflect those optimizations:
function findAnagram(charpool, minDictionaryIndex)
pool_bitmask <- bitmask(charpool);
if (pool empty) then print result;
for (word in dictionary AND word's index >= minDictionaryIndex) {
// bitmask of every words in the dictionary should be pre-calculated
word_bitmask <- bitmask(word)
if (word_bitmask contains bit(s) that is not in pool_bitmask)
then skip this for iteration
if (word fit in charpool) {
result = result + word;
update charpool to exclude characters in word;
findAnagram(charpool, word's index);
// as with any backtracking algorithm, we have to restore global states
restore pool;
restore result;
}
}
}
My C++ implementation of subproblem #1 where the character set contains only lowercase 'a'..'z': http://ideone.com/vf7Rpl .
Instead of a two stage solution where you generate permutations and then try and break them into words, you could speed it up by checking for valid words as you recursively generate the permutations. If at any point your current partially-complete permutation does not correspond to any valid words, stop there and do not recurse any further. This means you don't waste time generating useless permutations. For example, if you generate "tt", there is no need to permute "peanubuter" and append all the permutations to "tt" because there are no English words beginning with tt.
Suppose you are doing basic recursive permutation generation, keep track of the current partial word you have generated. If at any point it is a valid word, you can output a space and start a new word, and recursively permute the remaining character. You can also try adding each of the remaining characters to the current partial word, and only recurse if doing so results in a valid partial word (i.e. a word exists starting with those characters).
Something like this (pseudo-code):
void generateAnagrams(String partialAnagram, String currentWord, String remainingChars)
{
// at each point, you can either output a space, or each of the remaining chars:
// if the current word is a complete valid word, you can output a space
if(isValidWord(currentWord))
{
// if there are no more remaining chars, output the anagram:
if(remainingChars.length == 0)
{
outputAnagram(partialAnagram);
}
else
{
// output a space and start a new word
generateAnagrams(partialAnagram + " ", "", remainingChars);
}
}
// for each of the chars in remainingChars, check if it can be
// added to currentWord, to produce a valid partial word (i.e.
// there is at least 1 word starting with these characters)
for(i = 0 to remainingChars.length - 1)
{
char c = remainingChars[i];
if(isValidPartialWord(currentWord + c)
{
generateAnagrams(partialAnagram + c, currentWord + c,
remainingChars.remove(i));
}
}
}
You could call it like this
generateAnagrams("", "", "peanutbutter");
You could optimize this algorithm further by passing the node in the trie corresponding to the current partially completed word, as well as passing currentWord as a string. This would make your isValidPartialWord check even faster.
You can enforce uniqueness by changing your isValidWord check to only return true if the word is in ascending (greater or equal) alphabetic order compared to the previous word output. You might also need another check for dupes at the end, to catch cases where two of the same word can be output.

Fast way to find strings in set of strings containing substring

Task
I have a set S of n = 10,000,000 strings s and need to find the set Sp containing the strings s of S that contain the substring p.
Simple solution
As I'm using C# this is quite a simple task using LINQ:
string[] S = new string[] { "Hello", "world" };
string p = "ll";
IEnumerable<string> S_p = S.Where(s => s.Contains(p));
Problem
If S contains many strings (like the mentioned 10,000,000 strings) this gets horribly slow.
Idea
Build some kind of index to retrieve Sp faster.
Question
What is the best way to index S for this task and do you have any implementation in C#?
Here is one way to do it:
1. Create a string T = S[0] + sep_0 + S[1] + sep_1 + ... + S[n - 1] + sep_n-1(where sep_i is a unique character that never appears in S[j] for any j(it can actually be an integer number if the set of characters is not big enough)).
2. Build a suffix tree for T(it can be done in linear time).
3. For each query string Q traverse the suffix tree(it takes O(length(Q)) time). Then all possible answers will be located in the leaves of some subtree. So you can just traverse all these leaves. If Q is rather long, then the number of leaves in this subtree is likely to be much smaller than n.
4. If Q is really short, then the number of leaves in a subtree can be pretty large. That's why you can use another strategy for short query strings: precompute all short substrings of S[0] ... S[n - 1] and for each of them store a set of indices where it has occurred. Then you can just print these indices for a given Q. It is difficult to say what 'short' exactly means here, but it can be found out experimentally.

How do I find the largest sequence in a string that is repeated at least once?

Trying to solve the following problem:
Given a string of arbitrary length, find the longest substring that occurs more than one time within the string, with no overlaps.
For example, if the input string was ABCABCAB, the correct output would be ABC. You couldn't say ABCAB, because that only occurs twice where the two substrings overlap, which is not allowed.
Is there any way to solve this reasonably quickly for strings containing a few thousand characters?
(And before anyone asks, this is not homework. I'm looking at ways to optimize the rendering of Lindenmayer fractals, because they tend to take excessive amounts of time to draw at high iteration levels with a naive turtle graphics system.)
Here's an example for a string of length 11, which you can generalize
Set chunk length to floor(11/2) = 5
Scan the string in chunks of 5 characters left to looking for repeats. There will be 3 comparisons
Left Right
Offset Offset
0 5
0 6
1 5
If you found a duplicate you're done. Otherwise reduce the chunk length to 4 and repeat until chunk length goes to zero.
Here's some (obviously untested) pseudocode:
String s
int len = floor(s.length/2)
for int i=len; i>0; i--
for j=0; j<=len-(2*i); j++
for k=j+i; k<=len-i; k++
if s.substr(j,j+i) == s.substr(k,k+i)
return s.substr(j,j+i)
return null
There may be an off-by-one error in there, but the approach should be sound (and minimal).
it looks like a suffix tree problem. Create the suffix tree, then find the biggest compressed branch with more than one child (occurs more than once in the original string). The number of letters in that compressed branch should be the size of the biggest subsequence.
i found something similar here: http://www.coderanch.com/t/370396/java/java/Algorithm-wanted-longest-repeating-substring
Looks like it can be done in O(n).
First we need to define the start symbol of our substring and define the length. Iterate all possible start positions then figure out the length doing binary search for the length (if you can find substr with lenght a, you may find with the longer length, function looks monotonous so bin search should be fine). Then find equal substring is N, using KMP or Rabin-Karp any linear algo is fine. Total N*N*log(N). Is that too much complexity?
The code is something like:
for(int i=0;i<input.length();++i)
{
int l = i;
int r = input.length();
while(l <= r)
{
int middle = l + ((r - l) >> 1);
Check if string [i;middle] can be found in initial string. Should be done in O(n); You need to check parts of initial string [0,i-1], [middle+1;length()-1];
if (found)
l = middle + 1;
else
r = middle - 1;
}
}
Make sense?
This type of analysis is often done in genome sequences. have a look at this paper. it has an efficient implemention (c++) for solving repeats: http://www.complex-systems.com/pdf/17-4-4.pdf
might be what you are looking for

Resources