Removing repeated characters in string without using recursion - string

You are given a string. Develop a function to remove duplicate characters from that string. String could be of any length. Your algorithm must be in space. If you wish you can use constant size extra space which is not dependent any how on string size. Your algorithm must be of complexity of O(n).
My idea was to define an integer array of size of 26 where 0th index would correspond to the letter a and the 25th index for the letter z and initialize all the elements to 0.
Thus we will travel the entire string once and and would increment the value at the desired index as and when we encounter a letter.
and then we will travel the string once again and if the value at the desired index is 1 we print out the letter otherwise we do not.
In this way the time complexity is O(n) and the space used is constant irrespective of the length of the string!!
if anyone can come up with ideas of better efficiency,it will be very helpful!!

Your solution definitely fits the criteria of O(n) time. Instead of an array, which would be very, very large if the allowed alphabet is large (Unicode has over a million characters), you could use a plain hash. Here is your algorithm in (unoptimized!) Ruby:
def undup(s)
seen = Hash.new(0)
s.each_char {|c| seen[c] += 1}
result = ""
s.each_char {|c| result << c if seen[c] == 1}
result
end
puts(undup "")
puts(undup "abc")
puts(undup "Olé")
puts(undup "asdasjhdfasjhdfasbfdasdfaghsfdahgsdfahgsdfhgt")
It makes two passes through the string, and since hash lookup is less than linear, you're good.
You can say the Hashtable (like your array) uses constant space, albeit large, because it is bounded above by the size of the alphabet. Even if the size of the alphabet is larger than that of the string, it still counts as constant space.
There are many variations to this problem, many of which are fun. To do it truly in place, you can sort first; this gives O(n log n). There are variations on merge sort where you ignore dups during the merge. In fact, this "no external hashtable" restriction appears in Algorithm: efficient way to remove duplicate integers from an array (also tagged interview question).
Another common interview question starts with a simple string, then they say, okay now a million character string, okay now a string with 100 billion characters, and so on. Things get very interesting when you start considering Big Data.
Anyway, your idea is pretty good. It can generally be tweaked as follows: Use a set, not a dictionary. Go trough the string. For each character, if it is not in the set, add it. If it is, delete it. Sets take up less space, don't need counters, and can be implemented as bitsets if the alphabet is small, and this algorithm does not need two passes.
Python implementation: http://code.activestate.com/recipes/52560-remove-duplicates-from-a-sequence/

You can also use a bitset instead of the additional array to keep track of found chars. Depending on which characters (a-z or more) are allowed you size the bitset accordingly. This requires less space than an integer array.

Related

Time complexity of String.contains()

What is the time complexity of String.contains();
lets say n is the length of the string that is compared against another string of length k.
There is no answer without knowing the actual implementation of the String.contains() that you're interested in; or what algorithm you intend to use.
A completely naive implementation might take (n+1-k)*kcomparisons to decide that a given string of length n does not contain a particular substring of length k. That's O(nk) for the worst case.
Even stopping substring comparisons after the first unequal comparison, while having a smaller coefficient, still is O(nk). Construct a string that's a repetition of many isolated letters, each separated by exactly k-1 spaces, and search that for an occurrence of k consecutive spaces. The search will fail, but each substring comparison will take an amortized k/2 compares to find that out, and you're still at O(nk).
If k is known to be much less than n, you could treat that as O(n).
The average case depends on the actual algorithm used, and also on the distribution of characters in the two strings; and you haven't said what either of those were.

Average Case Big O and the Impact of Sorting

I'm looking at the time complexity for implementations of a method which determines if a String contains all unique characters.
The basic, brute force, approach would be to iterate through the String one character at a time maintaining a HashSet of seen characters. For each character in the iteration we check if the Set already contains it, and if so return false. We return true if the entire String has been searched. This would be O(n) as a worst case complexity. What would be the average case? O(n/2)?
If we try to optimise this by sorting the String into a char array, would it be more or less efficient? Sorting typically takes O(n log n) which is worse than O(n), but a sorted String allows for duplicate characters to be detected much earlier (especially for long strings).
Do we say the worst case is O(n^2 log n) but the average case is better? If so, what is it?
In the un-sorted case, the average case depends entirely on the string! Without knowing/assuming any distribution, it's hard to make any assumption.
A simple case, for a string with randomly-placed characters, where one of the characters repeats once:
the number of possibilities for the repeated characters being arranged is n*(n-1)/2
the probability it is detected repeated in exactly k steps is (k-1)/(n-1)
the probability it is detected in at most k steps is (k*(k-1))/(n*(n-1)), meaning that on average you will detect it (for large n) in about 0.7071*n... [incomplete]
For multiple characters that occur with different frequencies, or you make different assumptions on how characters are distributed in the string, you'll get different probabilities.
Hopefully someone can extend on my answer! :)
If the string is sorted, then you don't need the HashSet.
However, the average case still depends on the distribution of characters in the string: if you get two aa in the beggining, it's pretty efficient; if you get two zz, then you didn't win anything.
The worst case is sorting plus detecting-duplicates, so O(n log n + n), or just O(n log n).
So, it appears it's not advantageous to sort the string beforehand, due to the increased complexity, both in average-case and worst-case.

Algorithm to un-concatenate words from string without spaces and punctuation

I've been given a problem in my data structures class to find the solution to this problem. It's similar to an interview question. If someone could explain the thinking process or solution to the problem. Pseudocode can be used. So far i've been thinking to use tries to hold the dictionary and look up words that way for efficiency.
This is the problem:
Oh, no! You have just completed a lengthy document when you have an unfortunate Find/Replace mishap. You have accidentally removed all spaces, punctuation, and capitalization in the document. A sentence like "I reset the computer. It still didn't boot!" would become "iresetthecomputeritstilldidntboot". You figure that you can add back in the punctation and capitalization later, once you get the individual words properly separated. Most of the words will be in a dictionary, but some strings, like proper names, will not.
Given a dictionary (a list of words), design an algorithm to find the optimal way of "unconcatenating" a sequence of words. In this case, "optimal" is defined to be the parsing which minimizes the number of unrecognized sequences of characters.
For example, the string "jesslookedjustliketimherbrother" would be optimally parsed as "JESS looked just like TIM her brother". This parsing has seven unrecognized characters, which we have capitalized for clarity.
For each index, n, into the string, compute the cost C(n) of the optimal solution (ie: the number of unrecognised characters in the optimal parsing) starting at that index.
Then, the solution to your problem is C(0).
There's a recurrence relation for C. At each n, either you match a word of i characters, or you skip over character n, incurring a cost of 1, and then parse the rest optimally. You just need to find which of those choices incurs the lowest cost.
Let N be the length of the string, and let W(n) be a set containing the lengths of all words starting at index n in your string. Then:
C(N) = 0
C(n) = min({C(n+1) + 1} union {C(n+i) for i in W(n)})
This can be implemented using dynamic programming by constructing a table of C(n) starting from the end backwards.
If the length of the longest word in your dictionary is L, then the algorithm runs in O(NL) time in the worst case and can be implemented to use O(L) memory if you're careful.
You could use rolling hashes of different lengths to speed up the search.
You can try a partial pattern matcher for example aho-corasick algorithm. Basically it's a special space optimized version of a suffix tree.

Reversing string in ocaml

I have this function for reversing strings in ocaml however it says that I have my types wrong. I am unsure as to why or what I can do :(
Any tips on debugging would also be greatly appreciated!
28 let reverse s =
29 let rec helper i =
30 if i >= String.length s then "" else (helper (i+1))^(s.[i])
31 in
32 helper 0
Error: This expression has type char but an expression was expected of type
string
Thank you
Your implementation does not have the expected (linear) time and space complexity: it is quadratic in both time and space, so it is hardly a correct implementation of the requested feature.
String concatenation sa^sb allocates a new string of size length sa + length sb, and fills it with the two strings; this means that both its time and space complexity are linear in the sum of the lengths. When you iterate this operation once per character, you get an algorithm of quadratic complexity (the total size of memory allocated, and total number of copies, will be 1+2+3+....+n).
To correctly implement this algorithm, you could either:
allocate a string of the expected size, and mutate it in place with the content of the input string, reversed
create a string list made of reversed size-one strings, then use String.concat to concatenate all of them at once (which allocates the result and copies the strings only once)
use the Buffer module which is meant to accumulate characters or strings iteratively without exhibiting a quadratic behavior (it uses a dynamic resizing policy that makes addition of a char amortized constant time)
The first approach is both the simplest and the fastest, but the other two will get more interesting in more complex application where you want to concatenate strings, but it's less straightforward to know in one step what the final result will be.
The error message is pretty clear, I think. The expression s.[i] represents a character (the ith character of the string). But the ^ operator requires strings as its arguments.
To get past the problem you can use String.make 1 s.[i]. This expression gives a 1-character string containing the single character s.[i].
Handling strings recursively in OCaml isn't as nice as it could be, because there's no nice way to destructure a string (break it into parts). The equivalent code to reverse a list looks a lot prettier. For what it's worth :-)
You can also use 3rd party libraries to do so. http://batteries.forge.ocamlcore.org/ already implements a function for reversing strings

radix sort on binary strings with arbitrary length

i googled around and see lots of discussion about radix sort on binary string, but they are all with same lenght, how aobut binary string with arbitrary lenght?
say i have {"001", "10101", "011010", "10", "111"}, how do i do radix sort on them ? Thanks!
Find the max length and pad them all to that length. Should still perform well provided there's some upper bound on the length of the longest string.
You could pad them all to be the same length, but there's no real reason to run a sorting algorithm to determine that a length 5 number in binary is larger than a length 2 one. You would likely get better performance by grouping the numbers by length and running your radix sort within each group. Of course, that's dependent upon how you group them and then on how you sort your groups.
An example of how you might do this would be to run through all the items once and throw them all into a hash table (length --> numbers of that length). This takes linear time, and then let's say nlogn time to access them in order. A radix sort runs in O(nk) time where n is the number of items and k is their average length. If you've got a large k, then the difference between O(nk) and O(nlogn) would be acceptable.
If creating a ton of new string instances leaves a nasty taste, write the comparison yourself.
Compare what the lengths of the strings would be without the leading 0's (ie. find the firstIndexOf("1")); the longer string is larger.
If both are the same length, just continue comparing them, character-by-character, until you find two characters that differ - the string with the "1" is the larger.

Resources