Average Case Big O and the Impact of Sorting - string

I'm looking at the time complexity for implementations of a method which determines if a String contains all unique characters.
The basic, brute force, approach would be to iterate through the String one character at a time maintaining a HashSet of seen characters. For each character in the iteration we check if the Set already contains it, and if so return false. We return true if the entire String has been searched. This would be O(n) as a worst case complexity. What would be the average case? O(n/2)?
If we try to optimise this by sorting the String into a char array, would it be more or less efficient? Sorting typically takes O(n log n) which is worse than O(n), but a sorted String allows for duplicate characters to be detected much earlier (especially for long strings).
Do we say the worst case is O(n^2 log n) but the average case is better? If so, what is it?

In the un-sorted case, the average case depends entirely on the string! Without knowing/assuming any distribution, it's hard to make any assumption.
A simple case, for a string with randomly-placed characters, where one of the characters repeats once:
the number of possibilities for the repeated characters being arranged is n*(n-1)/2
the probability it is detected repeated in exactly k steps is (k-1)/(n-1)
the probability it is detected in at most k steps is (k*(k-1))/(n*(n-1)), meaning that on average you will detect it (for large n) in about 0.7071*n... [incomplete]
For multiple characters that occur with different frequencies, or you make different assumptions on how characters are distributed in the string, you'll get different probabilities.
Hopefully someone can extend on my answer! :)
If the string is sorted, then you don't need the HashSet.
However, the average case still depends on the distribution of characters in the string: if you get two aa in the beggining, it's pretty efficient; if you get two zz, then you didn't win anything.
The worst case is sorting plus detecting-duplicates, so O(n log n + n), or just O(n log n).
So, it appears it's not advantageous to sort the string beforehand, due to the increased complexity, both in average-case and worst-case.

Related

Time complexity of String.contains()

What is the time complexity of String.contains();
lets say n is the length of the string that is compared against another string of length k.
There is no answer without knowing the actual implementation of the String.contains() that you're interested in; or what algorithm you intend to use.
A completely naive implementation might take (n+1-k)*kcomparisons to decide that a given string of length n does not contain a particular substring of length k. That's O(nk) for the worst case.
Even stopping substring comparisons after the first unequal comparison, while having a smaller coefficient, still is O(nk). Construct a string that's a repetition of many isolated letters, each separated by exactly k-1 spaces, and search that for an occurrence of k consecutive spaces. The search will fail, but each substring comparison will take an amortized k/2 compares to find that out, and you're still at O(nk).
If k is known to be much less than n, you could treat that as O(n).
The average case depends on the actual algorithm used, and also on the distribution of characters in the two strings; and you haven't said what either of those were.

Longest Palindromic Substring clarification

Approach #3 (Dynamic Programming) [Accepted]
To improve over the brute force solution, we first observe how we can
avoid unnecessary re-computation while validating palindromes.
Consider the case ''ababa''. If we already knew that
''bab'' is a palindrome, it is obvious that ''ababa'' must be a palindrome since the two left and right end
letters are the same.
This yields a straight forward DP solution, which we first initialize
the one and two letters palindromes, and work our way up finding all
three letters palindromes, and so on...
Complexity Analysis
Time complexity : O(n^2) This gives us a runtime
complexity of O(n^2).
Space complexity : O(n^2). It uses O(n^2)space
to store the table.
I read the above solution to this problem online, and have some questions about it (if this isn't the correct forum to post on please let me know). This is my understanding of how to do this problem: save all the one-char palindromes. Then for each of these, if the char to the left equals the char to the right, keep it. If that condition isn't met, cease dealing with this substring. Continue this until end is reached.
Is this correct? If so, how does this translate to O(N^2) algorithm? Is it because, in the worst case scenario, we have to run through the string N times to increment each potential palindrome by one char? This part isn't intuitive to me.
Your interpretation is correct.
In the worst case we need to check all substrings with increasing length. We first check all substrings of length 1, then all substrings of length 3 and so on. In addition we also need to keep palindromes of the kind "abba" into account, thus we also need to check all candidates with even length. So in the worst case, we need to validate every possible substring of a given input-string.
Total number of substrings of a given string of length n is n(n + 1)/2
n * (n + 1) / 2 = n ^ 2 / 2 + n / 2 = O(n ^ 2)
Doing a single validation-step for a palindrome can be done in O(1), thus the total runtime is O(n ^ 2).

Search for cyclic strings

I am looking for the most efficient way to store binary strings in a data structure (insert function) and then when getting a string I want to check if some cyclic string of the given string is in my structure.
I thought about storing the input strings in a Trie but then when trying to determine whether some cyclic string of the string I got now was inserted to the Trie means to do |s| searches in the Trie for all the possible cyclic strings.
Is there any way to do that more efficiently while the place complexity will be like in a Trie?
Note: When I say cyclic strings of a string I mean that for example all the cyclic strings of 1011 are: 0111, 1110, 1101, 1011
Can you come up with a canonicalizing function for cyclic strings based on the following:
Find the largest run of zeroes.
Rotate the string so that that run of zeroes is at the front.
For each run of zeroes of equal size, see if rotating that to the front produces a lexicographically lesser string and if so use that.
This would canonicalize everything in the equivalence class (1011, 1101, 1110, 0111) to the lexicographically least value: 0111.
0101010101 is a thorny instance for which this algo will not perform well, but if your bits are roughly randomly distributed, it should work well in practice for long strings.
You can then hash based on the canonical form or use a trie that will include only the empty string and strings that start with 0 and a single trie run will answer your question.
EDIT:
if I have a string of a length |s| it can take a lot of time to find the least lexicographically value..how much time will it actually take?
That's why I said 010101.... is a value for which it performs badly. Let's say the string is of length n and the longest run of 1's is of length r. If the bits are randomly distributed, the length of the longest run is O(log n) according to "Distribution of longest run".
The time to find the longest run is O(n). You can implement shifting using an offset instead of a buffer copy, which should be O(1). The number of runs is worst case O(n / m).
Then, the time to do step 3 should be
Find other long runs: one O(n) pass with O(log n) storage average case, O(n) worst case
For each run: O(log n) average case, O(n) worst case
Shift and compare lexicographically: O(log n) average case since most comparisons of randomly chosen strings fail early, O(n) worst case.
This leads to a worst case of O(n²) but an average case of O(n + log² n) ≅ O(n).
You have n strings s1..sn and given a string t you want to know whether a cyclic permutation of t is a substring of any s1..sn. And you want to store the strings as efficiently as possible. Did I understand your question correctly?
If so, here is a solution, but with a large run-time: for a given input t, let t' = concat(t,t), check t' with every s in s1..sn to see if the longest subsequence of t' and sm is at least |t| If |si| = k, |t| = l it runs in O(n.k.l) time. And you can store s1..sn in any data structure you want. Is that good enough or you?

Removing repeated characters in string without using recursion

You are given a string. Develop a function to remove duplicate characters from that string. String could be of any length. Your algorithm must be in space. If you wish you can use constant size extra space which is not dependent any how on string size. Your algorithm must be of complexity of O(n).
My idea was to define an integer array of size of 26 where 0th index would correspond to the letter a and the 25th index for the letter z and initialize all the elements to 0.
Thus we will travel the entire string once and and would increment the value at the desired index as and when we encounter a letter.
and then we will travel the string once again and if the value at the desired index is 1 we print out the letter otherwise we do not.
In this way the time complexity is O(n) and the space used is constant irrespective of the length of the string!!
if anyone can come up with ideas of better efficiency,it will be very helpful!!
Your solution definitely fits the criteria of O(n) time. Instead of an array, which would be very, very large if the allowed alphabet is large (Unicode has over a million characters), you could use a plain hash. Here is your algorithm in (unoptimized!) Ruby:
def undup(s)
seen = Hash.new(0)
s.each_char {|c| seen[c] += 1}
result = ""
s.each_char {|c| result << c if seen[c] == 1}
result
end
puts(undup "")
puts(undup "abc")
puts(undup "Olé")
puts(undup "asdasjhdfasjhdfasbfdasdfaghsfdahgsdfahgsdfhgt")
It makes two passes through the string, and since hash lookup is less than linear, you're good.
You can say the Hashtable (like your array) uses constant space, albeit large, because it is bounded above by the size of the alphabet. Even if the size of the alphabet is larger than that of the string, it still counts as constant space.
There are many variations to this problem, many of which are fun. To do it truly in place, you can sort first; this gives O(n log n). There are variations on merge sort where you ignore dups during the merge. In fact, this "no external hashtable" restriction appears in Algorithm: efficient way to remove duplicate integers from an array (also tagged interview question).
Another common interview question starts with a simple string, then they say, okay now a million character string, okay now a string with 100 billion characters, and so on. Things get very interesting when you start considering Big Data.
Anyway, your idea is pretty good. It can generally be tweaked as follows: Use a set, not a dictionary. Go trough the string. For each character, if it is not in the set, add it. If it is, delete it. Sets take up less space, don't need counters, and can be implemented as bitsets if the alphabet is small, and this algorithm does not need two passes.
Python implementation: http://code.activestate.com/recipes/52560-remove-duplicates-from-a-sequence/
You can also use a bitset instead of the additional array to keep track of found chars. Depending on which characters (a-z or more) are allowed you size the bitset accordingly. This requires less space than an integer array.

radix sort on binary strings with arbitrary length

i googled around and see lots of discussion about radix sort on binary string, but they are all with same lenght, how aobut binary string with arbitrary lenght?
say i have {"001", "10101", "011010", "10", "111"}, how do i do radix sort on them ? Thanks!
Find the max length and pad them all to that length. Should still perform well provided there's some upper bound on the length of the longest string.
You could pad them all to be the same length, but there's no real reason to run a sorting algorithm to determine that a length 5 number in binary is larger than a length 2 one. You would likely get better performance by grouping the numbers by length and running your radix sort within each group. Of course, that's dependent upon how you group them and then on how you sort your groups.
An example of how you might do this would be to run through all the items once and throw them all into a hash table (length --> numbers of that length). This takes linear time, and then let's say nlogn time to access them in order. A radix sort runs in O(nk) time where n is the number of items and k is their average length. If you've got a large k, then the difference between O(nk) and O(nlogn) would be acceptable.
If creating a ton of new string instances leaves a nasty taste, write the comparison yourself.
Compare what the lengths of the strings would be without the leading 0's (ie. find the firstIndexOf("1")); the longer string is larger.
If both are the same length, just continue comparing them, character-by-character, until you find two characters that differ - the string with the "1" is the larger.

Resources