Split string on multiple delimiters of any length in Haskell - string

I am attempting a Haskell coding challenge where, given a certain string with a prefix indicating which substrings are delimiting markers, a list needs to be built from the input.
I have already solved the problem for multiple single-length delimiters, but I am stuck with the problem where the delimiters can be any length. I use splitOneOf from Data.List.Split, but this works for character (length 1) delimiters only.
For example, given
input ";,\n1;2,3,4;10",
delimiters are ';' and ','
splitting the input on the above delivers
output [1,2,3,4,10]
The problem I'm facing has two parts:
Firstly, a single delimiter of any length, e.g.
"****\n1****2****3****4****10" should result in the list [1,2,3,4,10].
Secondly, more than one delimiter can be specified, e.g.
input "[***][||]\n1***2||3||4***10",
delimiters are "***" and "||"
splitting the input on the above delivers
output [1,2,3,4,10]
My code for retrieving the delimiter in the case of character delimiters:
--This gives the delimiters as a list of characters, i.e. a String.
getDelimiter::String->[Char]
getDelimiter text = head . splitOn "\n" $ text
--drop "[delimiters]\n" from the input
body::String->String
body text = drop ((length . getDelimiter $ text)+1)) $ text
--returns tuple with fst being the delimiters, snd the body of the input
doc::String->(String,String)
doc text = (getDelimiter text, body text)
--given the delimiters and the body of the input, return a list of strings
numbers::(String,String)->[String]
numbers (delim, rest) = splitOneOf delim rest
--input ",##\n1,2#3#4" gives output ["1","2","3","4"]
getList::String->[String]
getList text = numbers . doc $ text
So my question is, how do I do the processing for when the delimiters are e.g. "***" and "||"?
Any hints are welcome, especially in a functional programming context.

If you don't mind making multiple passes over the input string, you can use splitOn from Data.List.Split, and gradually split the input string using one delimiter at a time.
You can write this fairly succinctly using foldl':
import Data.List
import Data.List.Split
splitOnAnyOf :: Eq a => [[a]] -> [a] -> [[a]]
splitOnAnyOf ds xs = foldl' (\ys d -> ys >>= splitOn d) [xs] ds
Here, the accumulator for the fold operation is a list of strings, or more generally [[a]], so you have to 'lift' xs into a list, using [xs].
Then you fold over the delimiters ds - not the input string to be parsed. For each delimiter d, you split the accumulated list of strings with splitOn, and concatenate them. You could also have used concatMap, but here I arbitrarily chose to use the more general >>= (bind) operator.
This seems to do what is required in the OP:
*Q49228467> splitOnAnyOf [";", ","] "1;2,3,4;10"
["1","2","3","4","10"]
*Q49228467> splitOnAnyOf ["***", "||"] "1***2||3||4***10"
["1","2","3","4","10"]
Since this makes multiple passes over temporary lists, it's most likely not the fastest implementation you can make, but if you don't have too many delimiters, or extremely long lists, this may be good enough.

This problem has two kinds of solutions: the simple, and the efficient. I will not cover the efficient (because it is not simple), though I will hint on it.
But first, the part where you extract the delimiter and body parts of the input, may be simplified with Data.List.break:
delims = splitOn "/" . fst . break (== '\n') -- Presuming the delimiters are delimited with
-- a slash.
body = snd . break (== '\n')
In any way, we may reduce this problem to finding the positions of all the given patterns in a given string. (By saying "string", I do not mean the haskell String. Rather, I mean an arbitrarily long sequence (or even an infinite stream) of any symbols for which an Equality relation is defined, which is typed in Haskell as Eq a => [a]. I hope this is not too confusing.) As soon as we have the positions, we may slice the string to our hearts' content. If we want to deal with an infinite stream, we must obtain the positions incrementally, and yield the results as we go, which is a restriction that must be kept in mind. Haskell is equipped well enough to handle the stream case as well as the finite string.
A simple approach is to cast isPrefixOf on the string, for each of the patterns.
If some of them matches, we replace it with a Nothing.
Otherwise we mark the first symbol as Just and move to the next position.
Thus, we will have replaced all the different delimiters by a single one: Nothing. We may then readily slice the string by it.
This is fairly idiomatic, and I will bring the code to your judgement shortly. The problem with this approach is that it is inefficient: in fact, if a pattern failed to match, we would rather advance by more than one symbol.
It would be more efficient to base our work on the research that has been made into finding patterns in a string; this problem is well known and there are great, intricate algorithms that solve it an order of magnitude faster. These algorithms are designed to work with a single pattern, so some work must be put into adapting them to our case; however, I believe they are adaptable. The simplest and eldest of such algorithms is the KMP, and it is already encoded in Haskell. You may wish to take arms and generalize it − a quick path to some amount of fame.
Here is the code:
module SplitSubstr where
-- stackoverflow.com/questions/49228467
import Data.List (unfoldr, isPrefixOf, elemIndex)
import Data.List.Split (splitWhen) -- Package `split`.
import Data.Maybe (catMaybes, isNothing)
-- | Split a (possibly infinite) string at the occurrences of any of the given delimiters.
--
-- λ take 10 $ splitOnSubstrs ["||", "***"] "la||la***fa"
-- ["la","la","fa"]
--
-- λ take 10 $ splitOnSubstrs ["||", "***"] (cycle "la||la***fa||")
-- ["la","la","fa","la","la","fa","la","la","fa","la"]
--
splitOnSubstrs :: [String] -> String -> [String]
splitOnSubstrs delims
= fmap catMaybes -- At this point, there will be only `Just` elements left.
. splitWhen isNothing -- Now we may split at nothings.
. unfoldr f -- Replace the occurences of delimiters with a `Nothing`.
where
-- | This is the base case. It will terminate the `unfoldr` process.
f [ ] = Nothing
-- | This is the recursive case. It is divided into 2 cases:
-- * One of the delimiters may match. We will then replace it with a Nothing.
-- * Otherwise, we will `Just` return the current element.
--
-- Notice that, if there are several patterns that match at this point, we will use the first one.
-- You may sort the patterns by length to always match the longest or the shortest. If you desire
-- more complicated behaviour, you must plug a more involved logic here. In any way, the index
-- should point to one of the patterns that matched.
--
-- vvvvvvvvvvvvvv
f body#(x:xs) = case elemIndex True $ (`isPrefixOf` body) <$> delims of
Just index -> return (Nothing, drop (length $ delims !! index) body)
Nothing -> return (Just x, xs)
It might happen that you will not find this code straightforward. Specifically, the unfoldr part is somewhat dense, so I will add a few words about it.
unfoldr f is an embodiment of a recursion scheme. f is a function that may chip a part from the body: f :: (body -> Maybe (chip, body)).
As long as it keeps chipping, unfoldr keeps applying it to the body. This is called recursive case.
Once it fails (returning Nothing), unfoldr stops and hands you all the chips it thus collected. This is called base case.
In our case, f takes symbols from the string, and fails once the string is empty.
That's it. I hope you send me a postcard when you receive a Turing award for a fast splitting algorithm.

Related

Haskell recursive program

I begin the function from here and don't know what to do next. Please help me in solving this function.
Write a Haskell recursive function noDupl which returns True if there
are no duplicates characters in the given string.
noDupl :: String -> Bool
noDupl = ?
Example Output:
noDupl "abcde"
True
noDupl "aabcdee"
False
Well, you've got the type signature right. Now like all recursion questions you can then think about the base case (where the recursion ends) and the recursive case (which will recurse with a smaller input).
For strings (and lists in general), the base case is usually the empty string (list). The recursive case usually takes the head of the list, processes it, then pushes to the front of the new result.
This probably sounds pretty confusing. It'll make sense when you look at some examples:
-- Increment each character by one (by ASCII).
incAll :: String -> String
incAll [] = [] -- Base case: empty string (list).
incAll (x:xs) = chr (ord x + 1) : incAll xs -- Recursive case, process head and prepend to recursed result.
There's a more concise way to write the above, but it demonstrates how recursion could be done.
Of course, you don't have to process each char individually, you could pattern match on two chars like so:
f (x0:x1:xs) = ...
(But you'll need to be careful with the base case.)
Hopefully this provides you with enough hints to write noDupl.

Haskell Tree construction

Need some help writing a Haskell function that takes a string and creates a binary tree. Need some help from someone with a little better Haskell experience to fill in some holes for me and describe why because this is a learning experience for me.
I'm given a tree encoded in a single string for a project in Haskell (Example **B**DECA). The star denotes a node any other character denotes a Leaf. I'm trying to fill this data structure with the information read in from input.
data Trie = Leaf Char | Branch Trie Trie
I'm more of a math and imperative programming guy so I made the observation that I can define a subtree by parsing from left to right. A proper tree will have 1 more character than *. Mathematically I would think of a recursive structure.
If the first character is not a * the solution is the first character. Else the solution is a Branch where the subbranches are feed back into the function where the left Branch is the first set of characters where Characters out number *'s and the right Branch is everything else.
constructTrie :: String -> Trie
constructTrie x = if x !! 1 == '*' then
let leftSubtree = (first time in drop 1 x where characters out number *'s)
rightSubtree = (rest of the characters in the drop 1 x)
in Branch constructTrie(leftSubtree) constructTrie(rightSubtree)
else Leaf x !! 1
Mostly I need help defining the left and right subtree and if there is anything wrong with defining it this way.
!! (which, by the way, is 0-indexed) is usually a no-go. It's a very "imperative" thing to do, and it's especially smelly with constant indices like here. That means you really want a pattern match. Also, splitting a list (type String = [Char]) at an index and operating on the two parts separately is a bad idea, because these lists are linked and immutable, so you'll end up copying the entire first part.
The way you want to do this is as follows:
If the string is empty, fail.
If it starts with a *, then do something that somehow parses the left subtree and gets the remainder of the list in one step, and then parse the right one out of that remainder, making a Branch.
If it's another character, make a Leaf.
There's no need to figure out where to split the string, actually split the string, and then parse the halves; just parse the list until you can't anymore and then whatever's left (or should I say right?) can be parsed again.
So: define a function constructTrie' :: String -> Maybe (Trie, String) that consumes the start of a String into a Trie, and then leaves the unparsed bit behind (and gives Nothing if it just fails to parse). This function will be recursive, which is why it gets that extra output value: it needs extra plumbing to move the remainder of the list around. constructTrie itself can then be defined as a wrapper around it:
-- Maybe Trie because it's perfectly possible that the String just won't parse
constructTrie :: String -> Maybe Trie
constructTrie s = do (t, "") <- constructTrie' s
-- patmat failure in a monad calls fail; fail #Maybe _ = Nothing
return t
-- can make this local to constructTrie in a where clause
-- or leave it exposed in case it's useful
constructTrie' :: String -> Maybe (Trie, String)
constructTrie' "" = Nothing -- can't parse something from nothing!
constructTrie' ('*':leaves) = do (ln, rs) <- constructTrie' leaves
-- Parse out left branch and leave the right part
-- untouched. Doesn't copy the left half
(rn, rest) <- constructTrie' rs
-- Parse out the right branch. Since, when parsing
-- "**ABC", the recursion will end up with
-- constructTrie' "*ABC", we should allow the slop.
return (Branch ln rn, rest)
constructTrie' (x:xs) = return (Leaf x, xs)
This is a very common pattern: defining a recursive function with extra plumbing to pass around some state and wrapping it in a nicer one. I guess it corresponds to how imperative loops usually mutate variables to keep their state.

Capitalize Every Other Letter in a String -- take / drop versus head / tail for Lists

I have spent the past afternoon or two poking at my computer as if I had never seen one before. Today's topic Lists
The exercise is to take a string and capitalize every other letter. I did not get very far...
Let's take a list x = String.toList "abcde" and try to analyze it. If we add the results of take 1 and drop 1 we get back the original list
> x = String.toList "abcde"
['a','b','c','d','e'] : List Char
> (List.take 1 x) ++ (List.drop 1 x)
['a','b','c','d','e'] : List Char
I thought head and tail did the same thing, but I get a big error message:
> [List.head x] ++ (List.tail x)
==================================== ERRORS ====================================
-- TYPE MISMATCH --------------------------------------------- repl-temp-000.elm
The right argument of (++) is causing a type mismatch.
7│ [List.head x] ++ (List.tail x)
^^^^^^^^^^^
(++) is expecting the right argument to be a:
List (Maybe Char)
But the right argument is:
Maybe (List Char)
Hint: I always figure out the type of the left argument first and if it is
acceptable on its own, I assume it is "correct" in subsequent checks. So the
problem may actually be in how the left and right arguments interact.
The error message tells me a lot of what's wrong. Not 100% sure how I would fix it. The list joining operator ++ is expecting [Maybe Char] and instead got Maybe [Char]
Let's just try to capitalize the first letter in a string (which is less cool, but actually realistic).
[String.toUpper ( List.head x)] ++ (List.drop 1 x)
This is wrong since Char.toUpper requires String and instead List.head x is a Maybe Char.
[Char.toUpper ( List.head x)] ++ (List.drop 1 x)
This also wrong since Char.toUpper requires Char instead of Maybe Char.
In real life a user could break a script like this by typing non-Unicode character (like an emoji). So maybe Elm's feedback is right. This should be an easy problem it takes "abcde" and turns into "AbCdE" (or possibly "aBcDe"). How to handle errors properly?
The same question in JavaScript: How do I make the first letter of a string uppercase in JavaScript?
In Elm, List.head and List.tail both return they Maybe type because either function could be passed an invalid value; specifically, the empty list. Some languages, like Haskell, throw an error when passing an empty list to head or tail, but Elm tries to eliminate as many runtime errors as possible.
Because of this, you must explicitly handle the exceptional case of the empty list if you choose to use head or tail.
Note: There are probably better ways to achieve your end goal of string mixed capitalization, but I'll focus on the head and tail issue because it's a good learning tool.
Since you're using the concatenation operator, ++, you'll need a List for both arguments, so it's safe to say you could create a couple functions that handle the Maybe return values and translate them to an empty list, which would allow you to use your concatenation operator.
myHead list =
case List.head list of
Just h -> [h]
Nothing -> []
myTail list =
case List.tail list of
Just t -> t
Nothing -> []
Using the case statements above, you can handle all possible outcomes and map them to something usable for your circumstances. Now you can swap myHead and myTail into your code and you should be all set.

First attempt at Haskell: Converting lower case letters to upper case

I have recently started learning Haskell, and I've tried creating a function in order to convert a lower case word to an upper case word, it works, but I don't know how good it is and I have some questions.
Code:
lowerToUpperImpl element list litereMari litereMici =
do
if not (null list) then
if (head list) == element then
['A'..'Z'] !! (length ['A'..'Z'] - length (tail list ) -1)
else
lowerToUpperImpl element (tail list) litereMari litereMici
else
'0' --never to be reached
lowerToUpper element = lowerToUpperImpl element ['a'..'z'] ['A'..'Z'] ['a'..'z']
lowerToUpperWordImpl word =
do
if not (null word) then
lowerToUpper (head (word)):(lowerToUpperWordImpl (tail word))
else
""
I don't like the way I have passed the upper case and lower case
letters , couldn't I just declare a global variables or something?
What would your approach be in filling the dead else branch?
What would your suggestions on improving this be?
Firstly, if/else is generally seen as a crutch in functional programming languages, precisely because they aren't really supposed to be used as branch operations, but as functions. Also remember that lists don't know their own lengths in Haskell, and so calculating it is an O(n) step. This is particularly bad for infinite lists.
I would write it more like this (if I didn't import any libraries):
uppercase :: String -> String
uppercase = map (\c -> if c >= 'a' && c <= 'z' then toEnum (fromEnum c - 32) else c)
Let me explain. This code makes use of the Enum and Ord typeclasses that Char satisfies. fromEnum c translates c to its ASCII code and toEnum takes ASCII codes to their equivalent characters. The function I supply to map simply checks that the character is lowercase and subtracts 32 (the difference between 'A' and 'a') if it is, and leaves it alone otherwise.
Of course, you could always just write:
import Data.Char
uppercase :: String -> String
uppercase = map toUpper
Hope this helps!
The things I always recommend to people in your circumstances are these:
Break the problem down into smaller pieces, and write separate functions for each piece.
Use library functions wherever you can to solve the smaller subproblems.
As an exercise after you're done, figure out how to write on your own the library functions you used.
In this case, we can apply the points as follows. First, since String in Haskell is a synonym for [Char] (list of Char), we can break your problem into these two pieces:
Turn a character into its uppercase counterpart.
Transform a list by applying a function separately to each of its members.
Second point: as Alex's answer points out, the Data.Char standard library module comes with a function toUpper that performs the first task, and the Prelude library comes with map which performs the second. So using those two together solves your problem immediately (and this is exactly the code Alex wrote earlier):
import Data.Char
uppercase :: String -> String
uppercase = map toUpper
But I'd say that this is the best solution (shortest and clearest), and as a beginner, this is the first answer you should try.
Applying my third point: after you've come up with the standard solution, it is enormously educational to try and write your own versions of the library functions you used. The point is that this way you learn three things:
How to break down problems into easier, smaller pieces, preferably reusable ones;
The contents of the standard libraries of the language;
How to write the simple "foundation" functions that the library provides.
So in this case, you can try writing your own versions of toUpper and map. I'll provide a skeleton for map:
map :: (a -> b) -> [a] -> [b]
map f [] = ???
map f (x:xs) = ???

How to access nth element in a Haskell tuple

I have this:
get3th (_,_,a,_,_,_) = a
which works fine in GHCI but I want to compile it with GHC and it gives error. If I want to write a function to get the nth element of a tuple and be able to run in GHC what should I do?
my all program is like below, what should I do with that?
get3th (_,_,a,_,_,_) = a
main = do
mytuple <- getLine
print $ get3th mytuple
Your problem is that getLine gives you a String, but you want a tuple of some kind. You can fix your problem by converting the String to a tuple – for example by using the built-in read function. The third line here tries to parse the String into a six-tuple of Ints.
main = do
mystring <- getLine
let mytuple = read mystring :: (Int, Int, Int, Int, Int, Int)
print $ get3th mytuple
Note however that while this is useful for learning about types and such, you should never write this kind of code in practise. There are at least two warning signs:
You have a tuple with more than three or so elements. Such a tuple is very rarely needed and can often be replaced by a list, a vector or a custom data type. Tuples are rarely used more than temporarily to bring two kinds of data into one value. If you start using tuples often, think about whether or not you can create your own data type instead.
Using read to read a structure is not a good idea. read will explode your program with a terrible error message at any tiny little mistake, and that's usually not what you want. If you need to parse structures, it's a good idea to use a real parser. read can be enough for simple integers and such, but no more than that.
The type of getLine is IO String, so your program won't type check because you are supplying a String instead of a tuple.
Your program will work if proper parameter is supplied, i.e:
main = do
print $ get3th (1, 2, 3, 4, 5, 6)
It seems to me that your confusion is between tuples and lists. That is an understandable confusion when you first meet Haskell as many other languages only have one similar construct. Tuples use round parens: (1,2). A tuple with n values in it is a type, and each value can be a different type which results in a different tuple type. So (Int, Int) is a different type from (Int, Float), both are two tuples. There are some functions in the prelude which are polymorphic over two tuples, ie fst :: (a,b) -> a which takes the first element. fst is easy to define using pattern matching like your own function:
fst (a,b) = a
Note that fst (1,2) evaluates to 1, but fst (1,2,3) is ill-typed and won't compile.
Now, lists on the other hand, can be of any length, including zero, and still be the same type; but each element must be of the same type. Lists use square brackets: [1,2,3]. The type for a list with elements of type a is written [a]. Lists are constructed from appending values onto the empty list [], so a list with one element can be typed [a], but this is syntactic sugar for a:[], where : is the cons operator which appends a value to the head of the list. Like tuples can be pattern matched, you can use the empty list and the cons operator to pattern match:
head :: [a] -> a
head (x:xs) = x
The pattern match means x is of type a and xs is of type [a], and it is the former we want for head. (This is a prelude function and there is an analogous function tail.)
Note that head is a partial function as we cannot define what it does in the case of the empty list. Calling it on an empty list will result in a runtime error as you can check for yourself in GHCi. A safer option is to use the Maybe type.
safeHead :: [a] -> Maybe a
safeHead (x:xs) = Just x
safeHead [] = Nothing
String in Haskell is simply a synonym for [Char]. So all of these list functions can be used on strings, and getLine returns a String.
Now, in your case you want the 3rd element. There are a couple of ways you could do this, you could call tail a few times then call head, or you could pattern match like (a:b:c:xs). But there is another utility function in the prelude, (!!) which gets the nth element. (Writing this function is a very good beginner exercise). So your program can be written
main = do
myString <- getLine
print $ myString !! 2 --zero indexed
Testing gives
Prelude> main
test
's'
So remember, tuples us ()and are strictly of a given length, but can have members of different types; whereas lists use '[]', can be any length, but each element must be the same type. And Strings are really lists of characters.
EDIT
As an aside, I thought I'd mention that there is a neater way of writing this main function if you are interested.
main = getLine >>= print . (!!3)

Resources