data structure for shift strings - string

We're interested in a data structure for binary strings. Let S=s1s2....sm be a binary string of size m. Shift(S,i) is a cyclic shift of string S i spaces to the left. That is, Shift(S,i)=sisi+1si+2...sms1...si-1. Suggest an efficient data structure that supports:
Init() of an empy DS in O(1)
Insert(s) inserts a binary string to the DS in O(|s|^2)
Search_cyclic(s) checks if there is a Shift(S,i) for ANY i in O(|s|).
Space Complexity: O(|S1|+|S2|+.....+|Sm|) where Si is one if the m strings we've inserted this far.
If i had to find Search_cyclic(s,i) for some given i, this is quite simple with using a suffix tree and just traversing it in O(|s|). But here in Search_cyclic(s) we don't have a given i, so I don't know what to do in the given complexity. OTOH, Insert(s) generally takes O(|s|) to insert to a suffix tree and here we are given O(|s|^2).

So here is a solution I can propose to you. The complexities are even lower then the ones they asked of you but it may seem a bit complicated.
The data structure in which you keep all the strings will be a Trie or even a Patricia tree. In this tree for each string you want to insert the minimum cyclic shift(i.e. the cyclic shift of all possible ones which is minimum lexicographically) out of all of its possible shifts. You can calculate the minimum cyclic shift of a string in linear time and I will give one possible solution to that a bit later. For the moment lets assume you can do it. Here is how the operations required will be implemented:
Init() - init of both trie and patricia tree are constant - no problem here
Insert(s) - you compute the minimum cyclic shift s' of s in O(|s|) and then you insert it in either of the data structures in O(|s'|) = O(|s|). This is even better then the required complexity
Search_cyclic(s) - again you compute the minimum cyclic shift of s in O(|s|) and then you check in the Patricia or Trie if the string is present, which again is done in O(|s|)
Also the memory complexity is as required and may be even lower if you construct a Patricia.
So all that is left is to exaplain how to find the minimum cyclic shift. Since you mention suffix tree I hope you know how to construct it in linear time. So the trick is - you append your string s to itself(i.e. double it) and then you construct a suffix tree for the doubled string. This is still linear with respect to |s| so no problem there. After that all you have to do is to find the minimum of the suffixes of length n in this tree. This is not hard at all I believe - start from the root and always follow the link from the current node that has minimal string written on it until you accumulate length longer then |s|. Because of the doubling of the string, you will always be able to follow minimal string links until you accumulate length at least |s|.
Hope this answer helps.

Related

What is the advantage of Suffix tree over suffix array?

I have been studying about trie, suffix array and suffix tree.I know these data structures can be used to fast lookup and for many more applications.
Now my question is,
If suffix array is space efficient and easy to implement than what are the scenarios where suffix tree should be preferred over suffix array
Can you please list down the individual's advantages over one another..
Thanks in advance.
Here is the abstract from Suffix arrays:A new method for on-line string searches written by Udi Manber and Gene Myers.
link to the article.
It provides a list of advantages of the suffix array in comparison to the suffix tree structure in general ca
A new and conceptually simple data structure, called a suffix array,
for on-line string searches is introduced in this paper. Constructing
and querying suffix arrays is reduced to a sort and search paradigm
that employs novel algorithms. The main advantage of suffix arrays
over suffix trees is that, in practice, they use three to five times
less space. From a complexity standpoint, suffix arrays permit on-line
string searches of the type, ‘‘Is W a substring of A?’’ to be answered
in time O(P + log N), where P is the length of W and N is the length
of A, which is competitive with (and in some cases slightly better
than) suffix trees. The only drawback is that in those instances where
the underlying alphabet is finite and small, suffix trees can be
constructed in O(N) time in the worst case, versus O(N log N) time for
suffix arrays. However, we give an augmented algorithm that,
regardless of the alphabet size, constructs suffix arrays in O(N)
expected time, albeit with lesser space efficiency. We believe that
suffix arrays will prove to be better in practice than suffix trees
for many applications
To make it brief, let's say that the suffix array has a significantly lower space complexity and better space locality than the suffix tree ; the trade-off being that the suffix tree runs faster in terms of time complexity (O(n) versus O(n.log(n)). Both give the suffixes of a string online(you can receive the string one char at a time, you don't need the whole string to run the algorithm).
Another advantage of the suffix array is the adaptability, for a substring search for instance ; the structure will allow for easier use of the data. It is also easier to implement as well.

find most common substring in given string? overlapping is allow

I already searched for posts on this question. But none of them have clear answers.
Find the occurrence of most common substring with length n in given string.
For example, "deded", we set the length of substring to be 3. "ded" will be the most common substring and its occurrence is 2.
Few post suggest using suffix tree and the time complexity is O(nlgn), space complexity is O(n).
First, I'm not familiar with suffix tree. My idea is to use hashmap store the occurrence of each substring with length of 3. The time is O(n) while space is also O(n). Is this better than suffix tree? Should I take hashmap collison into account?
Extra: if above problem is addressed, how can we solve the problem that length of substring doesn't matter. Just find the most common substring in given string.
If the length of the most common substring doesn't matter (but say, you want it to be greater than 1) then the best solution is to look for the most common substring of length 2. You can do this with a suffix tree in linear time, if you look up suffix trees then it will be clear how to do this. If you want the length M of the most common substring to be an input parameter, then you can hash all substrings of length M in linear time using hashing with multiply-and-add where you multiply the previous string hash value by a constant and then add the value for the next least significant value in the string, and take the modulus modulo a prime P. If you pick your modulus P for the computed string integers to be a randomly chosen prime P such that you can store O(P) memory, then this will do the trick, in linear time if you assume that your hashing has no collisions. If you assume that your hashing might have a lot of collisions, and the substring is of length M and the total string length is N, then the running time would be O(MN) because you have to check all collisions, which in the worst case could be checking all substrings of length M for example if your string is a string of all one character. Suffix trees are better in the worst case, let me know if you want some details (but not completely, because suffix trees are complicated) and I can explain at a high level how to get a faster solution with suffix trees.

How to find the period of a string

I take a input from the user and its a string with a certain substring which repeats itself all through the string. I need to output the substring or its length AKA period.
Say
S1 = AAAA // substring is A
S2 = ABAB // Substring is AB
S3 = ABCAB // Substring is ABC
S4 = EFIEFI // Substring is EFI
I could start with a Single char and check if it is same as its next character if it is not, I could do it with two characters then with three and so on. This would be a O(N^2) algo. I was wondering if there is a more elegant solution to this.
You can do this in linear time and constant additional space by inductively computing the period of each prefix of the string. I can't recall the details (there are several things to get right), but you can find them in Section 13.6 of "Text algorithms" by Crochemore and Rytter under function Per(x).
Let me assume that the length of the string n is at least twice greater than the period p.
Algorithm
Let m = 1, and S the whole string
Take m = m*2
Find the next occurrence of the substring S[:m]
Let k be the start of the next occurrence
Check if S[:k] is the period
if not go to 2.
Example
Suppose we have a string
CDCDFBFCDCDFDFCDCDFBFCDCDFDFCDC
For each power m of 2 we find repetitions of first 2^m characters. Then we extend this sequence to it's second occurrence. Let's start with 2^1 so CD.
CDCDFBFCDCDFDFCDCDFBFCDCDFDFCDC
CDCD CDCD CDCD CDCD CD
We don't extend CD since the next occurrence is just after that. However CD is not the substring we are looking for so let's take the next power: 2^2 = 4 and substring CDCD.
CDCDFBFCDCDFDFCDCDFBFCDCDFDFCDC
CDCD CDCD
Now let's extend our string to the first repetition. We get
CDCDFBF
we check if this is periodic. It is not so we go further. We try 2^3 = 8, so CDCDFBFC
CDCDFBFCDCDFDFCDCDFBFCDCDFDFCDC
CDCDFBFC CDCDFBFC
we try to extend and we get
CDCDFBFCDCDFDF
and this indeed is our period.
I expect this to work in O(n log n) with some KMP-like algorithm for checking where a given string appears. Note that some edge cases still should be worked out here.
Intuitively this should work, but my intuition failed once on this problem already so please correct me if I'm wrong. I will try to figure out a proof.
A very nice problem though.
You can build a suffix tree for the entire string in linear time (suffix tree is easy to look up online), and then recursively compute and store the number of suffix tree leaves (occurences of the suffix prefix) N(v) below each internal node v of the suffix tree. Also recursively compute and store the length of each suffix prefix L(v) at each node of the tree. Then, at an internal node v in the tree, the suffix prefix encoded at v is a repeating subsequence that generates your string if N(v) equals the total length of the string divided by L(v).
We can actually optimise the time complexity by creating a Z Array. We can create Z array in O(n) time and O(n) space. Now, lets say if there is string
S1 = abababab
For this the z array would like
z[]={8,0,6,0,4,0,2,0};
In order to calcutate the period we can iterate over the z array and
use the condition, where i+z[i]=S1.length. Then, that i would be the period.
Well if every character in the input string is part of the repeating substring, then all you have to do is store first character and compare it with rest of the string's characters one by one. If you find a match, string until to matched one is your repeating string.
I too have been looking for the time-space-optimal solution to this problem. The accepted answer by tmyklebu essentially seems to be it, but I would like to offer some explanation of what it's actually about and some further findings.
First, this question by me proposes a seemingly promising but incorrect solution, with notes on why it's incorrect: Is this algorithm correct for finding period of a string?
In general, the problem "find the period" is equivalent to "find the pattern within itself" (in some sense, "strstr(x+1,x)"), but with no constraints matching past its end. This means that you can find the period by taking any left-to-right string matching algorith, and applying it to itself, considering a partial match that hits the end of the haystack/text as a match, and the time and space requirements are the same as those of whatever string matching algorithm you use.
The approach cited in tmyklebu's answer is essentially applying this principle to String Matching on Ordered Alphabets, also explained here. Another time-space-optimal solution should be possible using the GS algorithm.
The fairly well-known and simple Two Way algorithm (also explained here) unfortunately is not a solution because it's not left-to-right. In particular, the advancement after a mismatch in the left factor depends on the right factor having been a match, and the impossibility of another match misaligned with the right factor modulo the right factor's period. When searching for the pattern within itself and disregarding anything past the end, we can't conclude anything about how soon the next right-factor match could occur (part or all of the right factor may have shifted past the end of the pattern), and therefore a shift that preserves linear time cannot be made.
Of course, if working space is available, a number of other algorithms may be used. KMP is linear-time with O(n) space, and it may be possible to adapt it to something still reasonably efficient with only logarithmic space.

Most efficient way to print an AVL tree of strings?

I'm thinking that an in order traversal will run in O(n) time. The only thing better than that would be to have something running in logn time. But I don't see how this could be, considering we have to run at least n times.
Is O(n) the lastest we could do here?
Converting and expanding #C.B.'s comment to an answer:
If you have an AVL tree with n strings in it and you want to print all of them, then you have to do at least Θ(n) total work simply because you have to print out each of the n strings. You can often lower-bound the amount of work required to produce a list or otherwise output a sequence of values simply by counting up how many items are going to be in the list.
We can be even more precise here. Suppose the combined length of all the strings in the tree is L. The time required to print out all the strings in the tree has to be at least Θ(L), since it costs some computational effort to output each individual character. Therefore, we can say that we have to do at least Θ(n + L) work to print out all the strings in the tree.
The bound given here just says that any correct algorithm has to do at least this much work, not that there actually is an algorithm that does this much work. But if you look closely at any of the major tree traversals - inorder, preorder, postorder, level-order - you'll find that they all match this time bound.
Now, one area where you can look for savings is in space complexity. A level-order traversal of the tree might require Ω(n) total space if the tree is perfectly balanced (since it holds a whole layer of the tree in memory and the bottommost layer can have Θ(n) nodes in it), while an inorder, preorder, or postorder traversal would only require O(log n) memory because you only need to store the current access path, which has logarithmic height in an AVL tree.

Search for cyclic strings

I am looking for the most efficient way to store binary strings in a data structure (insert function) and then when getting a string I want to check if some cyclic string of the given string is in my structure.
I thought about storing the input strings in a Trie but then when trying to determine whether some cyclic string of the string I got now was inserted to the Trie means to do |s| searches in the Trie for all the possible cyclic strings.
Is there any way to do that more efficiently while the place complexity will be like in a Trie?
Note: When I say cyclic strings of a string I mean that for example all the cyclic strings of 1011 are: 0111, 1110, 1101, 1011
Can you come up with a canonicalizing function for cyclic strings based on the following:
Find the largest run of zeroes.
Rotate the string so that that run of zeroes is at the front.
For each run of zeroes of equal size, see if rotating that to the front produces a lexicographically lesser string and if so use that.
This would canonicalize everything in the equivalence class (1011, 1101, 1110, 0111) to the lexicographically least value: 0111.
0101010101 is a thorny instance for which this algo will not perform well, but if your bits are roughly randomly distributed, it should work well in practice for long strings.
You can then hash based on the canonical form or use a trie that will include only the empty string and strings that start with 0 and a single trie run will answer your question.
EDIT:
if I have a string of a length |s| it can take a lot of time to find the least lexicographically value..how much time will it actually take?
That's why I said 010101.... is a value for which it performs badly. Let's say the string is of length n and the longest run of 1's is of length r. If the bits are randomly distributed, the length of the longest run is O(log n) according to "Distribution of longest run".
The time to find the longest run is O(n). You can implement shifting using an offset instead of a buffer copy, which should be O(1). The number of runs is worst case O(n / m).
Then, the time to do step 3 should be
Find other long runs: one O(n) pass with O(log n) storage average case, O(n) worst case
For each run: O(log n) average case, O(n) worst case
Shift and compare lexicographically: O(log n) average case since most comparisons of randomly chosen strings fail early, O(n) worst case.
This leads to a worst case of O(n²) but an average case of O(n + log² n) ≅ O(n).
You have n strings s1..sn and given a string t you want to know whether a cyclic permutation of t is a substring of any s1..sn. And you want to store the strings as efficiently as possible. Did I understand your question correctly?
If so, here is a solution, but with a large run-time: for a given input t, let t' = concat(t,t), check t' with every s in s1..sn to see if the longest subsequence of t' and sm is at least |t| If |si| = k, |t| = l it runs in O(n.k.l) time. And you can store s1..sn in any data structure you want. Is that good enough or you?

Resources