Functional paragraphs - haskell

Sorry I don't quite get FP yet, I want to split a sequence of lines into a sequence of sequences of lines, assuming an empty line as paragraph division, I could do it in python like this:
def get_paraghraps(lines):
paragraphs = []
paragraph = []
for line in lines:
if line == "": # I know it could also be "if line:"
paragraphs.append(paragraph)
paragraph = []
else:
paragraph.append(line)
return paragraphs
How would you go about doing it in Erlang or Haskell?

I'm only a beginning Haskell programmer (and the little Haskell I learnt was 5 years ago), but for a start, I'd write the natural translation of your function, with the accumulator ("the current paragraph") being passed around (I've added types, just for clarity):
type Line = String
type Para = [Line]
-- Takes a list of lines, and returns a list of paragraphs
paragraphs :: [Line] -> [Para]
paragraphs ls = paragraphs2 ls []
-- Helper function: takes a list of lines, and the "current paragraph"
paragraphs2 :: [Line] -> Para -> [Para]
paragraphs2 [] para = [para]
paragraphs2 ("":ls) para = para : (paragraphs2 ls [])
paragraphs2 (l:ls) para = paragraphs2 ls (para++[l])
This works:
*Main> paragraphs ["Line 1", "Line 2", "", "Line 3", "Line 4"]
[["Line 1","Line 2"],["Line 3","Line 4"]]
So that's a solution. But then, Haskell experience suggests that there are almost always library functions for doing things like this :) One related function is called groupBy, and it almost works:
paragraphs3 :: [Line] -> [Para]
paragraphs3 ls = groupBy (\x y -> y /= "") ls
*Main> paragraphs3 ["Line 1", "Line 2", "", "Line 3", "Line 4"]
[["Line 1","Line 2"],["","Line 3","Line 4"]]
Oops. What we really need is a "splitBy", and it's not in the libraries, but we can filter out the bad ones ourselves:
paragraphs4 :: [Line] -> [Para]
paragraphs4 ls = map (filter (/= "")) (groupBy (\x y -> y /= "") ls)
or, if you want to be cool, you can get rid of the argument and do it the pointless way:
paragraphs5 = map (filter (/= "")) . groupBy (\x y -> y /= "")
I'm sure there is an even shorter way. :-)
Edit: ephemient points out that (not . null) is cleaner than (/= ""). So we can write
paragraphs = map (filter $ not . null) . groupBy (const $ not . null)
The repeated (not . null) is a strong indication that we really should abstract this out into a function, and this is what the Data.List.Split module does, as pointed out in the answer below.

I'm also trying to learn Haskell. A solution for this question could be:
paragraphs :: [String] -> [[String]]
paragraphs [] = []
paragraphs lines = p : (paragraphs rest)
where (p, rest) = span (/= "") (dropWhile (== "") lines)
where I'm using the functions from Data.List. The ones I'm using are already available from the Prelude, but you can find their documentation in the link.
The idea is to find the first paragraph using span (/= ""). This will return the paragraph, and the lines following. We then recurse on the smaller list of lines which I call rest.
Before splitting out the first paragraph, we drop any empty lines using dropWhile (== ""). This is important to eat the empty line(s) separating the paragraphs. My first attempt was this:
paragraphs :: [String] -> [[String]]
paragraphs [] = []
paragraphs lines = p : (paragraphs $ tail rest)
where (p, rest) = span (/= "") lines
but this fails when we reach the final paragraph since rest is then the empty string:
*Main> paragraphs ["foo", "bar", "", "hehe", "", "bla", "bla"]
[["foo","bar"],["hehe"],["bla","bla"]*** Exception: Prelude.tail: empty list
Dropping empty lines solves this, and it also makes the code treat any number of empty lines as a paragraph separator, which is what I would expect as a user.

The cleanest solution would be to use something appropriate from the split package.
You'll need to install that first, but then Data.List.Split.splitWhen null should do the job perfectly.

Think recursively.
get_paragraphs [] paras para = paras ++ [para]
get_paragraphs ("":ls) paras para = get_paragraphs ls (paras ++ [para]) []
get_paragraphs (l:ls) paras para = get_paragraphs ls paras (para ++ [l])

You want to group the lines, so groupBy from Data.List seems like a good candidate. It uses a custom function to determine which lines are "equal" so one can supply something that makes lines in the same paragraph "equal". For example:
import Data.List( groupBy )
inpara :: String -> String -> Bool
inpara _ "" = False
inpara _ _ = True
paragraphs :: [String] -> [[String]]
paragraphs = groupBy inpara
This has some limitations, since inpara can only compare two adjacent lines and more complex logic doesn't fit into the framework given by groupBy. A more elemental solution if is more flexible. Using basic recursion one can write:
paragraphs [] = []
paragraphs as = para : paragraphs (dropWhile null reminder)
where (para, reminder) = span (not . null) as
-- splits list at the first empty line
span splits a list at the point the supplied function becomes false (the first empty line), dropWhile removes leading elements for which the supplied function is true (any leading empty lines).

Better late than never.
import Data.List.Split (splitOn)
paragraphs :: String -> [[String]]
paragraphs s = filter (not . null) $ map words $ splitOn "\n\n" s
paragraphs "a\nb\n\nc\nd" == [["a", "b"], ["c", "d"]]
paragraphs "\n\na\nb\n\n\nc\nd\n\n\n" == [["a", "b"], ["c", "d"]]
paragraphs "\n\na\nb\n\n \n c\nd\n\n\n" == [["a", "b"], ["c", "d"]]

Related

How can split a string with two conditions?

So basically I want to split my string with two conditions , when have a empty space or a diferent letter from the next one.
An example:
if I have this string ,"AAA ADDD DD", I want to split to this, ["AAA","A","DDD","DD"]
So I made this code:
sliceIt :: String -> [String]
sliceIt xs = words xs
But it only splits the inicial string when an empty space exists.
How can I also split when a caracter is next to a diferent one?
Can this problem be solve easier with recursion?
So you want to split by words and then group equal elements in each split. You have the functions for doing so,
import Data.List
sliceIt :: String -> [String]
sliceIt s = concatMap group $ words s
sliceItPointFree = concatMap group . words -- Point free notation. Same but cooler
split :: String -> [String]
split [] = []
split (' ':xs) = split xs
split (x:xs) = (takeWhile (== x) (x:xs)) : (split $ dropWhile (== x) (x:xs))
So this is a recursive definition where there are 2 cases:
If head is a space then ignore it.
Otherwise, take as many of the same characters as you can, then call the function on the remaining part of the string.

Getting text between two empty lines in haskell

Hey i am back with another haskell question. I asked a question in here and now i can get the empty lines perfectly, now i want to try to get the text between two specific empty lines in haskell.(For example i will get the text between the beginning and first empty line.) I can't think of any ways to do that in haskell because i can't understand the syntax and use it efficiently so i really need your help. My practice about doing some io stuff is like following;`
main=do{
readFile "/tmp/foo.txt" >>= print . length . filter (== '?');
readFile "/tmp/foo.txt" >>= print . length . words;
readFile "/tmp/foo.txt" >>= print . length . filter (== '.');
readFile "/tmp/foo.txt" >>= print . length . filter null . lines;
}`
With this, i can count the number of sentences, number of question marks, number of empty lines and so on. Now i want to get the text between two empty lines. I would be very pleased if you help me with this last exercise of mine that i couldn't solve. Thanks from now!
The easiest way is to use the functions lines, groupBy and filter
lines is used to split a String in a list of Strings (one line for each element)
groupBy groups then all lines that are non-empty - this should be the most difficult part you have to write a predicate that is true for two succeeding elements if they are non-empty: groupBy (\x y -> ???)
then filter out the elements of shape [""]
here some example usage in ghci
λ > import Data.List
λ > let groupify = ???
λ > l <- readFile "~/tmux.conf"
λ > map length $ groupify l
[4,7,3,1,4,2,2,5,4,3,3,2,7,4,4,4,3,3,3,2]
you can check with the contents of my tmux config file at my github-repo
UPDATE
the solution for this problem would be
groupify = filter (/= [""]) . groupBy (\x y -> x /= "" && y /= "") . lines
You can try pattern-matching, it literally says what it does:
betweenEmptyLines :: [String] -> [String]
betweenEmptyLines [] = []
betweenEmptyLines ("":line:"":rest) = line:(betweenEmptyLines $ "":rest)
betweenEmptyLines (line:rest) = betweenEmptyLines rest
How it works:
> betweenEmptyLines ["foo", "", "the bar", "", "and", "also", "", "the baz", "", "but", "not", "rest"]
["the bar","the baz"]

Splitting a String in Haskell

I want to split a String in Haskell.
My inicial String would look something like
["Split a String in Haskell"]
and my expected output would be:
["Split","a","String","in","Haskell"].
From what i've seen, words and lines don't work here, because i have the type [String] instead of just String.
I've tried Data.List.Split, but no luck there either.
import Data.List
split = (>>= words)
main = print $ split ["Split a String in Haskell"]
map words makes [["Split","a","String","in","Haskell"]] from ["Split a String in Haskell"], and concat makes [x] from [[x]]. And concat (map f xs) is equal to xs >>= f. And h xs = xs >>= f is equal to h = (>>= f).
Another way, more simple would be
split = words . head

identifying number of words in a paragraph using haskell

I am new to Haskell and functional programing. I have a .txt file which contains some paragraphs. I want to count the number of words in each paragraph, using Haskell.
I have written the input/output code
paragraph-words:: String -> int
no_of_words::IO()
no_of_words=
do
putStrLn "enter the .txt file name:"
fileName1<- getLine
text<- readFile fileName1
let wordscount= paragraph-words text
Can anyone help me to write the function paragraph-words. which will calculate the number of words in each paragraph.
First: you don't want to be bothered with dirty IO() any more than necessary, so the signature should be
wordsPerParagraph :: String -> [Int]
As for doing this: you should first split up the text in paragraphs. Counting the words in each of them is pretty trivial then.
What you basically need is match on empty lines (two adjacent newline characters). So I'd first use the lines function, giving you a list of lines. Then you separate these, at each empty line:
paragraphs :: String -> [String]
paragraphs = split . lines
where split [] = []
split (ln : "" : lns) = ln : split lns
split (ln : lns) = let (hd, tl) = splitAt 1 $ split lns
in (ln ++ hd) : tl
A list of lines can be split into paragraphs if one takes all lines until at least one empty line ("") is reached or the list is exhausted (1). We ignore all consecutive empty lines (2) and apply the same method for the rest of our lines:
type Line = String
type Paragraph = [String]
parify :: [Line] -> [Paragraph]
parify [] = []
parify ls
| null first = parify rest
| otherwise = first : parify rest
where first = takeWhile (/= "") ls -- (1) take until newline or end
rest = dropWhile (== "") . drop (length first) $ ls
-- ^ (2) drop all empty lines
In order to split a string into its lines, you can simply use lines. To get the number of words in a Paragraph, you simply sum over the number of words in each line
singleParagraphCount :: Paragraph -> Int
singleParagraphCount = sum . map lineWordCount
The words in each line are simply length . words:
lineWordCount :: Line -> Int
lineWordCount = length . words
So all in all we get the following function:
wordsPerParagraph :: String -> [Int]
wordsPerParagraph = map (singleParagraphCount) . parify . lines
First, you can't use - in a function name, you would have to use _ instead (or better, use camelCase as leftroundabout suggests below).
Here is a function which satisfies your type signature:
paragraph_words = length . words
This first splits the text into a list of words, then counts them by returning the length of that list of words.
However this does not completely solve the problem because you haven't written code to split your text into paragraphs.

Haskell Assignment - direction needed to split a String into words

we started a paper on Haskell a few weeks ago and just received our first assignment. I'm aware that SO doesn't like homework questions, so I'm not going to ask how to do it. Instead, it would be very much appreciated if anyone could push me in the right direction with this. Seeing as it might not be a specific question, would it be more appropriate in a discussion / community wiki?
Question: Tokenize a String, that is: "Hello, World!" -> ["Hello", "World"]
Coming from a Java background, I have to forget everything about the usual way to go about this. The problem is that I am still very clueless with Haskell. This is what I've come up with:
module Main where
main :: IO()
main = do putStrLn "Type in a string:\n"
x <- getLine
putStrLn "The string entered was:"
putStrLn x
putStrLn "\n"
print (tokenize x)
tokenize :: String -> [String]
tokenize [] = []
tokenize l = token l ++ tokenize l
token :: String -> String
token [] = []
token l = takeWhile (isAlphaNum) l
What would be the first glaring mistake?
Thank you.
The first glaring mistake is
tokenize l = token l ++ tokenize l
(++) :: [a] -> [a] -> [a] appends two lists of the same type. Since token :: String -> String (and type String = [Char]), the type of tokenize that is inferred from that line is tokenize :: String -> String.
You should use (:) :: a -> [a] -> [a] here.
The next mistake in that line is that in the recursive call, you pass the same input l once again, so you have an infinite recursion, always doing the same without change. You have to remove the first token (and a bit more) from the input for the argument to the recursive call.
Another problem is that your token supposes that the input begins with alphanumeric characters.
You also need a function that ensures that condition for what you pass to token.
This line results in an infinite list (which is OK, since Haskell is lazy, so the list only gets constructed "on demand"), because it is recurring with no change in the arguments:
tokenize l = token l ++ tokenize l
We can visualise what is happening when tokenize is called as:
tokenize l = token l ++ tokenize l
= token l ++ (token l ++ tokenize l)
= token l ++ (token l ++ (token l ++ tokenize l))
= ...
To stop this happening, you need to change what the argument to tokenize so that it recurs sensibly:
tokenize l = token l ++ tokenize <something goes here>
As others already pointed out your mistake, just a little hint: While you found already the very useful takeWhile function, you should have a look at span, as this could be even more helpful here.
This has something in it that feels similar to a parser monad. However, as you're a newcomer to Haskell, it's unlikely that you're in a position to understand how parsing monads work (or use them in your code) quite yet. To give you the basics, consider what you want:
tokenize :: String -> [String]
This takes a String, chomps it up into more pieces, and generates a list of strings corresponding to the words in the input string. How might we represent this? What we want to do is find a function that processes a single string, and at the first sign of whitespace, adds that string on to the sequence of words. But then you have to process what's left over. (I.e., the rest of the string.) For example, let's say you want to tokenize:
The brown fox jumped
You first pull out "The" and then continue processing " brown fox jumped" (note the space at the beginning of the second string). You will do this recursively, so naturally you will need a recursive function.
The natural solution that sticks out is to take something where you accumulate a set of strings you've tokenized so far, keep munching on the current input until you hit whitespace, then also accumulate what you've seen in the current string (this leads to an implementation where you're mostly consing stuff, and then occasionally reversing stuff).
Your exercise seemed a bit challenging to me so I decided to solve it just for self-training. Here's what I came up with:
import Data.List
import Data.Maybe
splitByAnyOf yss xs =
foldr (\ys acc -> concat $ map (splitBy ys) acc) [xs] yss
splitBy ys xs =
case (precedingElements ys xs, succeedingElements ys xs) of
(Just "", Just s) -> splitBy ys s
(Just p, Just "") -> [p]
(Just p, Just s) -> p : splitBy ys s
otherwise -> [xs]
succeedingElements ys xs =
fromMaybe Nothing . find isJust $ map (stripPrefix ys) $ tails xs
precedingElements ys xs =
fromMaybe Nothing . find isJust $ map (stripSuffix ys) $ inits xs
where
stripSuffix ys xs =
if ys `isSuffixOf` xs then Just $ take (length xs - length ys) xs
else Nothing
main = do
print $ splitBy "!" "Hello, World!"
print $ splitBy ", " "Hello, World!"
print $ splitByAnyOf [", ", "!"] "Hello, World!"
outputs:
["Hello, World"]
["Hello","World!"]
["Hello","World"]

Resources