Search for cyclic strings - string

I am looking for the most efficient way to store binary strings in a data structure (insert function) and then when getting a string I want to check if some cyclic string of the given string is in my structure.
I thought about storing the input strings in a Trie but then when trying to determine whether some cyclic string of the string I got now was inserted to the Trie means to do |s| searches in the Trie for all the possible cyclic strings.
Is there any way to do that more efficiently while the place complexity will be like in a Trie?
Note: When I say cyclic strings of a string I mean that for example all the cyclic strings of 1011 are: 0111, 1110, 1101, 1011

Can you come up with a canonicalizing function for cyclic strings based on the following:
Find the largest run of zeroes.
Rotate the string so that that run of zeroes is at the front.
For each run of zeroes of equal size, see if rotating that to the front produces a lexicographically lesser string and if so use that.
This would canonicalize everything in the equivalence class (1011, 1101, 1110, 0111) to the lexicographically least value: 0111.
0101010101 is a thorny instance for which this algo will not perform well, but if your bits are roughly randomly distributed, it should work well in practice for long strings.
You can then hash based on the canonical form or use a trie that will include only the empty string and strings that start with 0 and a single trie run will answer your question.
EDIT:
if I have a string of a length |s| it can take a lot of time to find the least lexicographically value..how much time will it actually take?
That's why I said 010101.... is a value for which it performs badly. Let's say the string is of length n and the longest run of 1's is of length r. If the bits are randomly distributed, the length of the longest run is O(log n) according to "Distribution of longest run".
The time to find the longest run is O(n). You can implement shifting using an offset instead of a buffer copy, which should be O(1). The number of runs is worst case O(n / m).
Then, the time to do step 3 should be
Find other long runs: one O(n) pass with O(log n) storage average case, O(n) worst case
For each run: O(log n) average case, O(n) worst case
Shift and compare lexicographically: O(log n) average case since most comparisons of randomly chosen strings fail early, O(n) worst case.
This leads to a worst case of O(n²) but an average case of O(n + log² n) ≅ O(n).

You have n strings s1..sn and given a string t you want to know whether a cyclic permutation of t is a substring of any s1..sn. And you want to store the strings as efficiently as possible. Did I understand your question correctly?
If so, here is a solution, but with a large run-time: for a given input t, let t' = concat(t,t), check t' with every s in s1..sn to see if the longest subsequence of t' and sm is at least |t| If |si| = k, |t| = l it runs in O(n.k.l) time. And you can store s1..sn in any data structure you want. Is that good enough or you?

Related

Time complexity of String.contains()

What is the time complexity of String.contains();
lets say n is the length of the string that is compared against another string of length k.
There is no answer without knowing the actual implementation of the String.contains() that you're interested in; or what algorithm you intend to use.
A completely naive implementation might take (n+1-k)*kcomparisons to decide that a given string of length n does not contain a particular substring of length k. That's O(nk) for the worst case.
Even stopping substring comparisons after the first unequal comparison, while having a smaller coefficient, still is O(nk). Construct a string that's a repetition of many isolated letters, each separated by exactly k-1 spaces, and search that for an occurrence of k consecutive spaces. The search will fail, but each substring comparison will take an amortized k/2 compares to find that out, and you're still at O(nk).
If k is known to be much less than n, you could treat that as O(n).
The average case depends on the actual algorithm used, and also on the distribution of characters in the two strings; and you haven't said what either of those were.

Average Case Big O and the Impact of Sorting

I'm looking at the time complexity for implementations of a method which determines if a String contains all unique characters.
The basic, brute force, approach would be to iterate through the String one character at a time maintaining a HashSet of seen characters. For each character in the iteration we check if the Set already contains it, and if so return false. We return true if the entire String has been searched. This would be O(n) as a worst case complexity. What would be the average case? O(n/2)?
If we try to optimise this by sorting the String into a char array, would it be more or less efficient? Sorting typically takes O(n log n) which is worse than O(n), but a sorted String allows for duplicate characters to be detected much earlier (especially for long strings).
Do we say the worst case is O(n^2 log n) but the average case is better? If so, what is it?
In the un-sorted case, the average case depends entirely on the string! Without knowing/assuming any distribution, it's hard to make any assumption.
A simple case, for a string with randomly-placed characters, where one of the characters repeats once:
the number of possibilities for the repeated characters being arranged is n*(n-1)/2
the probability it is detected repeated in exactly k steps is (k-1)/(n-1)
the probability it is detected in at most k steps is (k*(k-1))/(n*(n-1)), meaning that on average you will detect it (for large n) in about 0.7071*n... [incomplete]
For multiple characters that occur with different frequencies, or you make different assumptions on how characters are distributed in the string, you'll get different probabilities.
Hopefully someone can extend on my answer! :)
If the string is sorted, then you don't need the HashSet.
However, the average case still depends on the distribution of characters in the string: if you get two aa in the beggining, it's pretty efficient; if you get two zz, then you didn't win anything.
The worst case is sorting plus detecting-duplicates, so O(n log n + n), or just O(n log n).
So, it appears it's not advantageous to sort the string beforehand, due to the increased complexity, both in average-case and worst-case.

find most common substring in given string? overlapping is allow

I already searched for posts on this question. But none of them have clear answers.
Find the occurrence of most common substring with length n in given string.
For example, "deded", we set the length of substring to be 3. "ded" will be the most common substring and its occurrence is 2.
Few post suggest using suffix tree and the time complexity is O(nlgn), space complexity is O(n).
First, I'm not familiar with suffix tree. My idea is to use hashmap store the occurrence of each substring with length of 3. The time is O(n) while space is also O(n). Is this better than suffix tree? Should I take hashmap collison into account?
Extra: if above problem is addressed, how can we solve the problem that length of substring doesn't matter. Just find the most common substring in given string.
If the length of the most common substring doesn't matter (but say, you want it to be greater than 1) then the best solution is to look for the most common substring of length 2. You can do this with a suffix tree in linear time, if you look up suffix trees then it will be clear how to do this. If you want the length M of the most common substring to be an input parameter, then you can hash all substrings of length M in linear time using hashing with multiply-and-add where you multiply the previous string hash value by a constant and then add the value for the next least significant value in the string, and take the modulus modulo a prime P. If you pick your modulus P for the computed string integers to be a randomly chosen prime P such that you can store O(P) memory, then this will do the trick, in linear time if you assume that your hashing has no collisions. If you assume that your hashing might have a lot of collisions, and the substring is of length M and the total string length is N, then the running time would be O(MN) because you have to check all collisions, which in the worst case could be checking all substrings of length M for example if your string is a string of all one character. Suffix trees are better in the worst case, let me know if you want some details (but not completely, because suffix trees are complicated) and I can explain at a high level how to get a faster solution with suffix trees.

String pattern matching with one or zero mismatch

Given a string and a pattern to be matched, how efficiently can the matches be found having zero or one mismatch.
e.g)
S = abbbaaabbbabab
P = abab
Matches are abbb(index 0),aaab(index 4),abbb(index 6),abab(index 10)
I tried to modify KMP algorithm but I'm not sure about the approach.
Please give me idea to proceed with the problem.
Thanks.
Ok I found it! I found the best algorithm!
This might sound a bit brave, but as long as the algorithm I am going to propose has both running time O(m + n) and memory consumption O(m + n) and the entry data itself has the same properties the algorithm can be optimized only in constant.
Algorithms used
I am going to use mix-up between KMP and Rabin Karp algorithms for my solution. Rabin Karp uses rolling hashes for comparing substrings of the initial strings. It requires linear in time precomputing that uses linear additional memory, but from then on the comparison between substrings of the two strings is constant O(1) (this is amortized if you handle collisions properly).
What my solution will not do
My solution will not find all the occurrences in the first string that match the second string with at most 1 difference. However, the algorithm can be modified so that for every starting index in the first string if there is such matching at least one of them will be found (this is left to the reader).
Observations
Let m be the length of the second string and n - the length of the first string. I am going to split the task in two parts: if I am aiming to find a matching with at most one difference, I want to find to substrings of the first string: PREF is going to be the substring before the single difference and SUFF the substring after the difference. I want len(PREF) + len(SUFF) + 1 = m, where PREF or SUFF will be artificially shortened if required (when the strings match without difference).
I am going to base my solution on one very important observation: suppose there is a substring of the first string starting at index i with length m that matches the second string with at most one difference. Then if we take PREF as long as possible there will still be solution for SUFF. This is obvious: I am just pushing the difference as much to the end as possible.
The algorithm
And now follows the algorithm itself. Start off with usual KMP. Every time when the extension of the prefix fails and the fail links are to be followed, first check whether if you skip the next letter the remaining suffix will match the remaining of the second string. If so the sought match with at most one character difference is found. If not - we go on with the ordinary KMP making the Rabin Karp check every time a fail link is to be followed.
Let me clarify further the Rabin Karp check with an example. Suppose we are at certain step of the KMP and we have found that first.substring[i, i + k - 1] matches the first k letters of the second string. Suppose also that the letter first[i + k] is different from second[k]. Then you check whether first.substring[i + k + 1, i + m - 1] matches exactly second.substring[k + 1, m - 1] using Rabin Karp. This is exactly the case in which you have extended the starting prefix form index i as much as possible and you try now whether there is a match with at most one difference.
Rabin Karp will be used only when a fail link is followed, which moves the starting index of the prefix with at least one, which means that at most O(n) Rabin Karp calls are used, every one with complexity O(1) for a total of linear complexity.
This is known as the approximate string matching problem. In your particular case, you want a maximum edit distance of 1.
The bitap algorithm is a fairly fast way of solving it.
To find all submatches including one mismatch you need 2 z-functions (one for the original P, and another for reversed P).
After that buld array of longest prefix submatches for the original and reversed string S.
Later you need to reverse the second array.
And in the end everything is easy: run through the first array and check if the length of longest prefix is equal to the length of P. If it is, then it is a match without mistakes.
If it is shorter, then check the second array at position (i + length(P) - 1). If sum of
two values is equal to length(P) - 1, then it is a submatch with one mistake.
Complexity is O(len(P) + len(S))
A comprehensive overview of the various algorithms and how they compare to each other is given by Gonzalo Navarro in his A guided tour to approximate string matching. Pages 80, 81 and 82 show complexity results, including worst and average cases, and space requirements for the various algorithms.
(In the notation used there, n refers to the length of the text you search, m to the length of the pattern, σ to the size of the alphabet, and k to the maximum edit distance, which is 1 in your case.)

data structure for shift strings

We're interested in a data structure for binary strings. Let S=s1s2....sm be a binary string of size m. Shift(S,i) is a cyclic shift of string S i spaces to the left. That is, Shift(S,i)=sisi+1si+2...sms1...si-1. Suggest an efficient data structure that supports:
Init() of an empy DS in O(1)
Insert(s) inserts a binary string to the DS in O(|s|^2)
Search_cyclic(s) checks if there is a Shift(S,i) for ANY i in O(|s|).
Space Complexity: O(|S1|+|S2|+.....+|Sm|) where Si is one if the m strings we've inserted this far.
If i had to find Search_cyclic(s,i) for some given i, this is quite simple with using a suffix tree and just traversing it in O(|s|). But here in Search_cyclic(s) we don't have a given i, so I don't know what to do in the given complexity. OTOH, Insert(s) generally takes O(|s|) to insert to a suffix tree and here we are given O(|s|^2).
So here is a solution I can propose to you. The complexities are even lower then the ones they asked of you but it may seem a bit complicated.
The data structure in which you keep all the strings will be a Trie or even a Patricia tree. In this tree for each string you want to insert the minimum cyclic shift(i.e. the cyclic shift of all possible ones which is minimum lexicographically) out of all of its possible shifts. You can calculate the minimum cyclic shift of a string in linear time and I will give one possible solution to that a bit later. For the moment lets assume you can do it. Here is how the operations required will be implemented:
Init() - init of both trie and patricia tree are constant - no problem here
Insert(s) - you compute the minimum cyclic shift s' of s in O(|s|) and then you insert it in either of the data structures in O(|s'|) = O(|s|). This is even better then the required complexity
Search_cyclic(s) - again you compute the minimum cyclic shift of s in O(|s|) and then you check in the Patricia or Trie if the string is present, which again is done in O(|s|)
Also the memory complexity is as required and may be even lower if you construct a Patricia.
So all that is left is to exaplain how to find the minimum cyclic shift. Since you mention suffix tree I hope you know how to construct it in linear time. So the trick is - you append your string s to itself(i.e. double it) and then you construct a suffix tree for the doubled string. This is still linear with respect to |s| so no problem there. After that all you have to do is to find the minimum of the suffixes of length n in this tree. This is not hard at all I believe - start from the root and always follow the link from the current node that has minimal string written on it until you accumulate length longer then |s|. Because of the doubling of the string, you will always be able to follow minimal string links until you accumulate length at least |s|.
Hope this answer helps.

Resources