Reversing string in ocaml - string

I have this function for reversing strings in ocaml however it says that I have my types wrong. I am unsure as to why or what I can do :(
Any tips on debugging would also be greatly appreciated!
28 let reverse s =
29 let rec helper i =
30 if i >= String.length s then "" else (helper (i+1))^(s.[i])
31 in
32 helper 0
Error: This expression has type char but an expression was expected of type
string
Thank you

Your implementation does not have the expected (linear) time and space complexity: it is quadratic in both time and space, so it is hardly a correct implementation of the requested feature.
String concatenation sa^sb allocates a new string of size length sa + length sb, and fills it with the two strings; this means that both its time and space complexity are linear in the sum of the lengths. When you iterate this operation once per character, you get an algorithm of quadratic complexity (the total size of memory allocated, and total number of copies, will be 1+2+3+....+n).
To correctly implement this algorithm, you could either:
allocate a string of the expected size, and mutate it in place with the content of the input string, reversed
create a string list made of reversed size-one strings, then use String.concat to concatenate all of them at once (which allocates the result and copies the strings only once)
use the Buffer module which is meant to accumulate characters or strings iteratively without exhibiting a quadratic behavior (it uses a dynamic resizing policy that makes addition of a char amortized constant time)
The first approach is both the simplest and the fastest, but the other two will get more interesting in more complex application where you want to concatenate strings, but it's less straightforward to know in one step what the final result will be.

The error message is pretty clear, I think. The expression s.[i] represents a character (the ith character of the string). But the ^ operator requires strings as its arguments.
To get past the problem you can use String.make 1 s.[i]. This expression gives a 1-character string containing the single character s.[i].
Handling strings recursively in OCaml isn't as nice as it could be, because there's no nice way to destructure a string (break it into parts). The equivalent code to reverse a list looks a lot prettier. For what it's worth :-)

You can also use 3rd party libraries to do so. http://batteries.forge.ocamlcore.org/ already implements a function for reversing strings

Related

Why is Julia giving me StringIndex error?

I'm getting a StringIndex error for one particular string out of 10,000 which I am processing. I don't really know what the issue is with this string. I think it is probably a special character issue.
If I println the string then assign it to txt then pass txt to the function, I don't get an error. I am a little baffled.
I am sorry, I can't post the string as it is protected content and even if I did copying and pasting the string somehow removes the source of error. Any suggestions?
Just to expand. The details of how String is represented in Julia are explained in the Julia manual.
You can use eachindex to get an iterator of valid indices into a String. The reason why it is an iterator is that you cannot efficiently (i.e. in O(1) time) find an index of i-th character in the string. However, you can use isascii function on a String to check if it consists only of ASCII characters (in which case byte and character indices are the same).
Also if you need to get to some specific character in a string you usually need probably more than one character, in which case first, last and chop functions are useful (actually last(first(s, n)) gives you a character at position n; although it is not most efficient - iterating eachindex will allocate less).
In Julia Strings are indexed by bytes rather than characters. You should use for c in str rather than trying to index manually.

How to find the period of a string

I take a input from the user and its a string with a certain substring which repeats itself all through the string. I need to output the substring or its length AKA period.
Say
S1 = AAAA // substring is A
S2 = ABAB // Substring is AB
S3 = ABCAB // Substring is ABC
S4 = EFIEFI // Substring is EFI
I could start with a Single char and check if it is same as its next character if it is not, I could do it with two characters then with three and so on. This would be a O(N^2) algo. I was wondering if there is a more elegant solution to this.
You can do this in linear time and constant additional space by inductively computing the period of each prefix of the string. I can't recall the details (there are several things to get right), but you can find them in Section 13.6 of "Text algorithms" by Crochemore and Rytter under function Per(x).
Let me assume that the length of the string n is at least twice greater than the period p.
Algorithm
Let m = 1, and S the whole string
Take m = m*2
Find the next occurrence of the substring S[:m]
Let k be the start of the next occurrence
Check if S[:k] is the period
if not go to 2.
Example
Suppose we have a string
CDCDFBFCDCDFDFCDCDFBFCDCDFDFCDC
For each power m of 2 we find repetitions of first 2^m characters. Then we extend this sequence to it's second occurrence. Let's start with 2^1 so CD.
CDCDFBFCDCDFDFCDCDFBFCDCDFDFCDC
CDCD CDCD CDCD CDCD CD
We don't extend CD since the next occurrence is just after that. However CD is not the substring we are looking for so let's take the next power: 2^2 = 4 and substring CDCD.
CDCDFBFCDCDFDFCDCDFBFCDCDFDFCDC
CDCD CDCD
Now let's extend our string to the first repetition. We get
CDCDFBF
we check if this is periodic. It is not so we go further. We try 2^3 = 8, so CDCDFBFC
CDCDFBFCDCDFDFCDCDFBFCDCDFDFCDC
CDCDFBFC CDCDFBFC
we try to extend and we get
CDCDFBFCDCDFDF
and this indeed is our period.
I expect this to work in O(n log n) with some KMP-like algorithm for checking where a given string appears. Note that some edge cases still should be worked out here.
Intuitively this should work, but my intuition failed once on this problem already so please correct me if I'm wrong. I will try to figure out a proof.
A very nice problem though.
You can build a suffix tree for the entire string in linear time (suffix tree is easy to look up online), and then recursively compute and store the number of suffix tree leaves (occurences of the suffix prefix) N(v) below each internal node v of the suffix tree. Also recursively compute and store the length of each suffix prefix L(v) at each node of the tree. Then, at an internal node v in the tree, the suffix prefix encoded at v is a repeating subsequence that generates your string if N(v) equals the total length of the string divided by L(v).
We can actually optimise the time complexity by creating a Z Array. We can create Z array in O(n) time and O(n) space. Now, lets say if there is string
S1 = abababab
For this the z array would like
z[]={8,0,6,0,4,0,2,0};
In order to calcutate the period we can iterate over the z array and
use the condition, where i+z[i]=S1.length. Then, that i would be the period.
Well if every character in the input string is part of the repeating substring, then all you have to do is store first character and compare it with rest of the string's characters one by one. If you find a match, string until to matched one is your repeating string.
I too have been looking for the time-space-optimal solution to this problem. The accepted answer by tmyklebu essentially seems to be it, but I would like to offer some explanation of what it's actually about and some further findings.
First, this question by me proposes a seemingly promising but incorrect solution, with notes on why it's incorrect: Is this algorithm correct for finding period of a string?
In general, the problem "find the period" is equivalent to "find the pattern within itself" (in some sense, "strstr(x+1,x)"), but with no constraints matching past its end. This means that you can find the period by taking any left-to-right string matching algorith, and applying it to itself, considering a partial match that hits the end of the haystack/text as a match, and the time and space requirements are the same as those of whatever string matching algorithm you use.
The approach cited in tmyklebu's answer is essentially applying this principle to String Matching on Ordered Alphabets, also explained here. Another time-space-optimal solution should be possible using the GS algorithm.
The fairly well-known and simple Two Way algorithm (also explained here) unfortunately is not a solution because it's not left-to-right. In particular, the advancement after a mismatch in the left factor depends on the right factor having been a match, and the impossibility of another match misaligned with the right factor modulo the right factor's period. When searching for the pattern within itself and disregarding anything past the end, we can't conclude anything about how soon the next right-factor match could occur (part or all of the right factor may have shifted past the end of the pattern), and therefore a shift that preserves linear time cannot be made.
Of course, if working space is available, a number of other algorithms may be used. KMP is linear-time with O(n) space, and it may be possible to adapt it to something still reasonably efficient with only logarithmic space.

String datastructure supporting append, prepend and search operations

I need to build a text editor as my mini project, and I need to design a data structure or algorithm that supports following operation:
Append : Append a character at the end of the String.
Prepend : Prepend a character at the beginning of the string.
Search : Given a search string s, find all the occurrences of the string.
Each operation in O(log n) time or less. Search and replace operations will be appreciable but not necessary. The maximum length of string is constant. Any ideas how to achieve this?
Thanks!
A common data structure for this kind of application is a Rope, where Append and Prepend are O(1), although that depends a bit on whether the tree is balanced. However, as noted by Толя, Search would be linear.
There are certainly data structures that can make the search faster, such as a Suffix Tree, but they are probably not appropriate for a text editor application.
I would propose you adapt a Trie. On an append operation add all the suffixes of the string ending at the new character with length up to the maximum length of the string in the datastructure. On prepend add all the prefixes of the string starting at the new char with length up to the fixed length of the string. Asymptotically both operations are constant - they take O(k^2) where k is the fixed length of the string. For each node in the structure keep track of all the strings ending at that node(possibly a list).
A search operation will again be constant: iterate over the string and output all indexes stored in the ending node(if you have not "dropped out the tree").
A drawback of my approach is the memory overhead(at most times the maximum length of a word), but if the maximum string length allowed is reasonable and you only insert real words(from English dictionary for instance), this should not be a big problem.

Removing repeated characters in string without using recursion

You are given a string. Develop a function to remove duplicate characters from that string. String could be of any length. Your algorithm must be in space. If you wish you can use constant size extra space which is not dependent any how on string size. Your algorithm must be of complexity of O(n).
My idea was to define an integer array of size of 26 where 0th index would correspond to the letter a and the 25th index for the letter z and initialize all the elements to 0.
Thus we will travel the entire string once and and would increment the value at the desired index as and when we encounter a letter.
and then we will travel the string once again and if the value at the desired index is 1 we print out the letter otherwise we do not.
In this way the time complexity is O(n) and the space used is constant irrespective of the length of the string!!
if anyone can come up with ideas of better efficiency,it will be very helpful!!
Your solution definitely fits the criteria of O(n) time. Instead of an array, which would be very, very large if the allowed alphabet is large (Unicode has over a million characters), you could use a plain hash. Here is your algorithm in (unoptimized!) Ruby:
def undup(s)
seen = Hash.new(0)
s.each_char {|c| seen[c] += 1}
result = ""
s.each_char {|c| result << c if seen[c] == 1}
result
end
puts(undup "")
puts(undup "abc")
puts(undup "Olé")
puts(undup "asdasjhdfasjhdfasbfdasdfaghsfdahgsdfahgsdfhgt")
It makes two passes through the string, and since hash lookup is less than linear, you're good.
You can say the Hashtable (like your array) uses constant space, albeit large, because it is bounded above by the size of the alphabet. Even if the size of the alphabet is larger than that of the string, it still counts as constant space.
There are many variations to this problem, many of which are fun. To do it truly in place, you can sort first; this gives O(n log n). There are variations on merge sort where you ignore dups during the merge. In fact, this "no external hashtable" restriction appears in Algorithm: efficient way to remove duplicate integers from an array (also tagged interview question).
Another common interview question starts with a simple string, then they say, okay now a million character string, okay now a string with 100 billion characters, and so on. Things get very interesting when you start considering Big Data.
Anyway, your idea is pretty good. It can generally be tweaked as follows: Use a set, not a dictionary. Go trough the string. For each character, if it is not in the set, add it. If it is, delete it. Sets take up less space, don't need counters, and can be implemented as bitsets if the alphabet is small, and this algorithm does not need two passes.
Python implementation: http://code.activestate.com/recipes/52560-remove-duplicates-from-a-sequence/
You can also use a bitset instead of the additional array to keep track of found chars. Depending on which characters (a-z or more) are allowed you size the bitset accordingly. This requires less space than an integer array.

String recurring subsequences and compression

I'd like to do some kind of "search and replace" algorithm which will, in an efficient manner if possible, identify a substring of a string which occurs more than once and replace all occurrences of that substring with a token.
For example, given a string "AbcAdAefgAbijkAblmnAbAb", notice that "A" recurs, so reduce in pass one to "#1bc#1d#1efg#1bijk#1blmn#1b#1b" where #_ is an indexed pattern (we note the patterns in an indexed table), then notice that "#1b" recurs so reduce to "#2c#1d#1efg#2ijk#2lmn#2#2". No more patterns occur in the string so we're done.
I have found some information on "longest common subsequences" and compression algorithms, but nothing that seems to do this. They either are for comparing two string or for getting some kind of storage-optimal result.
My objective, on the other hand, is to reduce the genome to its "words" instead of "letters". ie, instead of gatcatcgatc I want to see 2c1c2c. I could do some regex afterwards to find things like "#42*#42"; it would be cool to see recurring brackets in dna.
If I could just find that online I would skip doing it myself but I can't see this question answered before in terms I could uncover. To anyone who can point me in the right direction many thanks.
The byte pair encoding does something pretty close to what you want.
Rather than searching directly for the longest repeated string (top-down),
each pass of byte pair encoding searches for repeated byte pairs (bottom-up).
But eventually it discovers the longest repeated string(*).
gatcatcgatc
1=at g1c1cg1c
2=atc g22g2
3=gatc 2=atc 323
As you can see, it has found the longest repeated string "gatc".
(*) byte pair encoding either eventually finds the longest repeated string,
or else it stops early after making (2^8 - uniquechars(source) ) substitutions.
I suspect it may be possible to tweak byte pair encoding so that the early-stop condition is relaxed a little -- perhaps (2^9 - uniquechars(source) ) or 2^12 or 2^16.
Even if that hurts compression performance, perhaps it will give interesting results for applications like yours.
Wikipedia: byte pair encoding
Stack Overflow: optimizing byte-pair encoding

Resources