Word count in haskell - haskell

I'm working on this exercise:
Given a phrase, count the occurrences of each word in that phrase.
For the purposes of this exercise you can expect that a word will always be one of:
A number composed of one or more ASCII digits (ie "0" or "1234") OR
A simple word composed of one or more ASCII letters (ie "a" or "they") OR
A contraction of two simple words joined by a single apostrophe (ie "it's" or "they're")
When counting words you can assume the following rules:
The count is case insensitive (ie "You", "you", and "YOU" are 3 uses of the same word)
The count is unordered; the tests will ignore how words and counts are ordered
Other than the apostrophe in a contraction all forms of punctuation are ignored
The words can be separated by any form of whitespace (ie "\t", "\n", " ")
For example, for the phrase "That's the password: 'PASSWORD 123'!", cried the Special > Agent.\nSo I fled. the count would be:
that's: 1
the: 2
password: 2
123: 1
cried: 1
special: 1
agent: 1
so: 1
i: 1
fled: 1
My code:
module WordCount (wordCount) where
import qualified Data.Char as C
import qualified Data.List as L
import Text.Regex.TDFA as R
wordCount :: String -> [(String, Int)]
wordCount xs =
do
ys <- words xs
let zs = R.getAllTextMatches (ys =~ "\\d+|\\b[a-zA-Z']+\\b") :: [String]
g <- L.group $ L.sort [map (C.toLower) w | w <- zs]
return (head g, length g)
But it fails on the input "one fish two fish red fish blue fish". It outputs one count for each word, even the repeated ones, as if the sort and group aren't doing anything. Why?
I've read this answer, which basically does the same thing in a more advanced way using Control.Arrow.

You don't need to use words to split the line, the regex should achieve the desired splitting:
wordCount :: String -> [(String, Int)]
wordCount xs =
do
let zs = R.getAllTextMatches (xs =~ "\\d+|\\b[a-zA-Z']+\\b") :: [String]
g <- L.group $ L.sort [map C.toLower w | w <- zs]
return (head g, length g)

wordCount xs =
do
ys <- words xs
let zs = R.getAllTextMatches (ys =~ "\\d+|\\b[a-zA-Z']+\\b") :: [String]
g <- L.group $ L.sort [map (C.toLower) w | w <- zs]
return (head g, length g)
You’re splitting the input xs into words by whitespace using words. You iterate over these in the list monad with the binding statement ys <- …. Then you split each of those words into subwords using the regular expression, of which there happens to be only one match in your example. You sort and group each of the subwords in a list by itself.
I believe you can essentially just delete the initial call to words:
wordCount xs =
do
let ys = R.getAllTextMatches (xs =~ "\\d+|\\b[a-zA-Z']+\\b") :: [String]
g <- L.group $ L.sort [map C.toLower w | w <- ys]
return (head g, length g)

Related

Histogram counting apostrophes as a word

I am to create a histogram which counts the top 20 most common words, excluding the top 20 in the world. This is the result i get below:
import Data.List(sort, group, sortBy)
toWordList = words
countCommonWords wordList = length (filter isCommon wordList)
where isCommon word = elem word commonWords
dropCommonWords wordList = filter isUncommon wordList
where isUncommon w = notElem w commonWords
commonWords = ["the","and","have","not","as","be","a","I","on", "you","to","in","it","with","do","of","that","for","he","at"]
countWords wordList = map (\w -> (head w, length w)) $group $ sort wordList
compareTuples (w1, n1) (w2, n2) = if n1 < n2 then LT else if n1> n2 then GT else EQ
sortWords wordList = reverse $ sortBy compareTuples wordList
toAsteriskBar x = (replicate (snd x) '*') ++ " -> " ++ (fst x) ++ "\n"
makeHistogram wordList = concat $ map toAsteriskBar (take 20 wordList)
--Do word list
text = "It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity, it was the season of Light, it was the season of Darkness, it was the spring of hope, it was the winter of despair, we had everything before us, we had nothing before us, we were all going direct to Heaven, we were all going direct the other way--in short, the period was so far like the present period, that some of its noisiest authorities insisted on its being received, for good or for evil, in the superlative degree of comparison only. there were a king with a large jaw and a queen with a plain face, on the throne of England; there were a king with a large jaw and a queen with a fair face, on the throne of France. In both countries it was clearer than crystal to the lords of the State preserves of loaves and fishes, that things in general were settled for ever of."
main = do
let wordlist = toWordList text
putStrLn "Report:"
putStrLn ("\t" ++ (show $ length wordlist) ++ " words")
putStrLn ("\t" ++ (show $ countCommonWords wordlist) ++ " common words")
putStrLn "\nHistogram of the most frequent words (excluding common words):\n"
putStr $ makeHistogram $ sortWords $ countWords $ dropCommonWords $ wordlist
Result:
Report:
186 words
71 common words
Histogram of the most frequent words (excluding common words):
************ -> was
***** -> were
**** -> we
** -> us,
** -> times,
** -> throne
** -> there
** -> season
** -> queen
** -> large
** -> king
** -> jaw
** -> its
** -> had
** -> going
** -> face,
** -> epoch
** -> direct
** -> before
** -> all
Does anybody know why the counter is counting any word with an apostrophe attached eg. us, as a whole word?
In Brief
toWordList = words
This is the function I'd modify to sanitize your words. For example, toWordList = map (filter isAlpha) . words so you get only those characters in words that are alphabetical instead of all blocks of characters that are divided by spaces (which is what words does). EDIT: isAlpha is from the Data.Char module which you'd need to import. Edited the above snippet to add map too.
Epilog
Moving forward, I'm just going to make some code comments because why not.
import Data.List(sort, group, sortBy)
Yay, using pre-existing code. You will probably also want comparing from Data.Ord.
countCommonWords wordList = length (filter isCommon wordList)
where isCommon word = elem word commonWords
dropCommonWords wordList = filter isUncommon wordList
where isUncommon w = notElem w commonWords
These operations are O(n * m) wherenis the length of wordList andmis the length ofcommonWords`. You could make this faster by using a Set if you desire.
commonWords = ["the","and","have","not","as","be","a","I"
,"on","you","to","in","it","with","do","of","that"
,"for","he","at"]
countWords wordList = map (\w -> (head w, length w)) $ group $ sort wordList
A similar performance comment here. A common method is to use Data.Map.insertWith to keep a counter for each word.
compareTuples (w1, n1) (w2, n2) = if n1 < n2 then LT else if n1> n2 then GT else EQ
This is more easily spelled compareTuples = comparing fst

Insert space after every punctuation sign in a String Haskell

I have this function that checks if a character is one of these punctuation signs.
checkpunctuation:: Char -> Bool
checkpunctuationc = c `elem` ['.', ',', '?', '!', ':', ';', '(', ')']
I have to write another function that after every punctuation sign it adds a space
format :: String -> String
I know how to add space after a given number of characthers but don't know how to add after specific characters.
Simple recursive option:
format :: String -> String
format [] = []
format (x:xs) | checkpuntuationc x = x : ' ' : format xs
| otherwise = x : format xs
Another option is to use foldr with a helper function:
helper :: Char -> String -> String
helper x xs | checkpunctuation x = x : ' ' : xs
| otherwise = x : xs
The helper checks if the first character is a punctuation. If so it inserts a space, otherwise it does not.
and then define format as:
format :: String -> String
format = foldr helper []
A sample call:
*Main> format "Hello? Goodbye! You say goodbye!! (and I say Hello)"
"Hello? Goodbye! You say goodbye! ! ( and I say Hello) "
This function works also on "infinite strings":
*Main> take 50 $ format $ cycle "Hello?Goodbye!"
"Hello? Goodbye! Hello? Goodbye! Hello? Goodbye! He"
So although we feed it a string that keeps cycle-ing, and thus never ends, we can derive the first 50 characters of the result.
There's probably a more elegant way to do it, but
format :: String -> String
format s = concat [if (checkpunctuation c) then (c:" ") else [c] | c <- s]
will work (thanks, #Shou Ya!).
Edit based on comment
To count the total length of post-formatted punctuation characters, you can use
sumLength :: [String] -> Int
sumLength strings = 2 * (sum $ fmap length (fmap (filter checkpunctuation) strings))
as the it is twice the sum of the number of punctuation characters.

Cutting specific chunks from a Haskell String

I'm trying to cut chunks from a list, with a given predicate. I would have preferred to use a double character, e.g. ~/, but have resolved to just using $. What I essentially want to do is this...
A: "Hello, my $name is$ Danny and I $like$ Haskell"
What I want to turn this into is this:
B: "Hello, my Danny and I Haskell"
So I want to strip everything in between the given symbol, $, or my first preference was ~/, if I can figure it out. What I tried was this:
s1 :: String -> String
s1 xs = takeWhile (/= '$') xs
s2 :: String -> String
s2 xs = dropWhile (/= '$') xs
s3 :: String -> String
s3 xs = s3 $ s2 $ s1 xs
This solution seems to just bug my IDE out (possibly infinite looping).
Solution:
s3 :: String -> String
s3 xs
|'$' `notElem` xs = xs
|otherwise = takeWhile (/= '$') xs ++ (s3 $ s1 xs)
s1 :: String -> String
s1 xs = drop 1 $ dropWhile (/= '$') $ tail $ snd $ break ('$'==) xs
This seems like a nice application for parsers. A solution using trifecta:
import Control.Applicative
import Data.Foldable
import Data.Functor
import Text.Trifecta
input :: String
input = "Hello, my $name is$ Danny and I $like$ Haskell"
cutChunk :: CharParsing f => f String
cutChunk = "" <$ (char '$' *> many (notChar '$') <* char '$')
cutChunk matches $, followed by 0 or more (many) non-$ characters, then another $. Then we use ("" <$) to make this parser's value always be the empty string, thus discarding all the characters that this parser matches.
includeChunk :: CharParsing f => f String
includeChunk = some (notChar '$')
includeChunk matches the text that we want to include in the result, which is anything that's not the $ character. It's important that we use some (matching one or more characters) and not many (matching zero or more characters) because we're going to include this parser within another many expression next; if this parser matched on the empty string, then that could loop infinitely.
chunks :: CharParsing f => f String
chunks = fold <$> many (cutChunk <|> includeChunk)
chunks is the parser for everything. Read <|> as "or", as in "parse either a cutChunk or an includeChunk". many (cutChunk <|> includeChunk) is a parser that produces a list of chunks e.g. Success ["Hello, my ",""," Danny and I ",""," Haskell"], so we fold the output to concatenate those chunks together into a single string.
result :: Result String
result = parseString chunks mempty input
The result:
Success "Hello, my Danny and I Haskell"
Your infinite loop comes from calling s3 recursively with no base case:
s3 :: String -> String
s3 xs = s3 $ s2 $ s1 xs
Adding a base case corrects the infinite loop:
s3 xs
| '$' `notElem` xs = xs
| otherwise = ...
This is not the whole answer. Think about what s1 actually does and where you use its return value:
s1 "hello $my name is$ ThreeFx" == "hello "
For further reference, see the break function:
break :: (a -> Bool) -> [a] -> ([a], [a])
I think your logic is wrong, perhaps easier to write it in an elementary way
Prelude> let pr xs = go xs True
Prelude| where go [] _ = []
Prelude| go (x:xs) f | x=='$' = go xs (not f)
Prelude| | f = x : go xs f
Prelude| | otherwise = go xs f
Prelude|
Prelude> pr "Hello, my $name is$ Danny and I $like$ Haskell"
"Hello, my Danny and I Haskell"
Explanation The flag f keeps track of the state (either pass mode or not). If the current char is a token skip and switch state.

Haskell - Rename duplicate values in a list of lists

I have a list of lists of strings e.g;
[["h","e","l","l","o"], ["g","o","o","d"], ["w","o","o","r","l","d"]]
And I want to rename repeated values outside a sublist so that all the repetitions are set to new randomly generated values throughout a sublist that are not pre-existing in the list but the same inside the same sublist so that a possible result might be:
[["h","e","l","l","o"], ["g","t","t","d"], ["w","s","s","r","z","f"]]
I already have a function that can randomly generate a string of size one called randomStr:
randomStr :: String
randomStr = take 1 $ randomRs ('a','z') $ unsafePerformIO newStdGen
Presuming you want to do what I've outlined in my comment below, it's best to break this problem up into several smaller parts to tackle one at a time. I would also recommend leveraging common modules in base and containers, since it will make the code much simpler and faster. In particular, the modules Data.Map and Data.Sequence are very useful in this case. Data.Map I would say is the most useful here, as it has some very useful functions that would otherwise be difficult to write by hand. Data.Sequence is used for efficiency purposes at the end, as you'll see.
First, imports:
import Data.List (nub)
import Data.Map (Map)
import Data.Sequence (Seq, (|>), (<|))
import qualified Data.Map as Map
import qualified Data.Sequence as Seq
import Data.Foldable (toList)
import System.Random (randomRIO)
import Control.Monad (forM, foldM)
import Control.Applicative ((<$>))
Data.Foldable.toList is needed since Data.Sequence does not have a toList function, but Foldable provides one that will work. On to the code. We first want to be able to take a list of Strings and find all the unique elements in it. For this, we can use nub:
lettersIn :: [String] -> [String]
lettersIn = nub
I like providing my own names for functions like this, it can make the code more readable.
Now that we can get all the unique characters, we want to be able to assign each a random character:
makeRandomLetterMap :: [String] -> IO (Map String String)
makeRandomLetterMap letters
= fmap Map.fromList
$ forM (lettersIn letters) $ \l -> do
newL <- randomRIO ('a', 'z')
return (l, [newL])
Here we get a new random character and essentially zip it up with our list of letters, then we fmap (<$>) Map.fromList over that result. Next, we need to be able to use this map to replace letters in a list. If a letter isn't found in the Map, we just want the letter back. Luckily, Data.Map has the findWithDefault function which is perfect for this situation:
replaceLetter :: Map String String -> String -> String
replaceLetter m letter = Map.findWithDefault letter letter m
replaceAllLetters :: Map String String -> [String] -> [String]
replaceAllLetters m letters = map (replaceLetter m) letters
Since we want to be able to update this map with new letters that have been encountered in each sublist, overwriting previously encountered letters as needed, we can use Data.Map.union. Since union favors its first argument, we need to flip it:
updateLetterMap :: Map String String -> [String] -> IO (Map String String)
updateLetterMap m letters = flip Map.union m <$> makeRandomLetterMap letters
Now we have all the tools needed to tackle the problem at hand:
replaceDuplicatesRandomly :: [[String]] -> IO [[String]]
replaceDuplicatesRandomly [] = return []
For the base case, just return an empty list.
replaceDuplicatesRandomly (first:rest) = do
m <- makeRandomLetterMap first
For a non-empty list, make the initial map off the first sublist
(_, seqTail) <- foldM go (m, Seq.empty) rest
Fold over the rest, starting with an empty sequence and the first map, and extract the resulting sequence
return $ toList $ first <| seqTail
Then convert the sequence to a list after prepending the first sublist (it doesn't get changed by this function). The go function is pretty simple too:
where
go (m, acc) letters = do
let newLetters = replaceAllLetters m letters
newM <- updateLetterMap m letters
return (newM, acc |> newLetters)
It takes the current map m and an accumulation of all the sublists processed so far acc along with the current sublist letters, replaces the letters in said sublist, builds a new map for the next iteration (newM), and then returns the new map along with the accumulation of everything processed, i.e. acc |> newLetters. All together, the function is
replaceDuplicatesRandomly :: [[String]] -> IO [[String]]
replaceDuplicatesRandomly [] = return []
replaceDuplicatesRandomly (first:rest) = do
m <- makeRandomLetterMap first
(_, seqTail) <- foldM go (m, Seq.empty) rest
return $ toList $ first <| seqTail
where
go (m, acc) letters = do
let newLetters = replaceAllLetters m letters
newM <- updateLetterMap m letters
return (newM, acc |> newLetters)
It's always better to keep impure and pure computations separated.
You cannot replace by letters, which are already in a list, so you need to get a string of fresh letters:
fresh :: [String] -> String
fresh xss = ['a'..'z'] \\ foldr union [] xss
This function replaces one letter with another in a string:
replaceOne :: Char -> Char -> String -> String
replaceOne y y' = map (\x -> if x == y then y' else x)
This function replaces one letter each time with a new letter for every string in a list of strings:
replaceOnes :: Char -> String -> [String] -> (String, [String])
replaceOnes y = mapAccumL (\(y':ys') xs ->
if y `elem` xs
then (ys', replaceOne y y' xs)
else (y':ys', xs))
For example
replaceOnes 'o' "ijklmn" ["hello", "good", "world"]
returns
("lmn",["helli","gjjd","wkrld"])
A bit tricky one:
replaceMany :: String -> String -> [String] -> (String, [String])
replaceMany ys' ys xss = runState (foldM (\ys' y -> state $ replaceOnes y ys') ys' ys) xss
This function replaces each letter from ys each time with a new letter from ys' for every string in xss.
For example
replaceMany "mnpqstuvxyz" "lod" ["hello", "good", "world"]
returns
("vxyz",["hemmp","gqqt","wsrnu"])
i.e.
'l's in "hello" are replaced by the first letter in "mnpqstuvxyz"
'l' in "world" is replaced by the second letter in "mnpqstuvxyz"
'o' in "hello" is replaced by the third letter in "mnpqstuvxyz"
'o's in "good" are replaced by the fourth letter in "mnpqstuvxyz"
...
'd' in "world" is replaced by the seventh letter in "mnpqstuvxyz"
This function goes through a list of strings and replaces all letters from the head by fresh letters, that ys' contains, for each string in the rest of the list.
replaceDuplicatesBy :: String -> [String] -> [String]
replaceDuplicatesBy ys' [] = []
replaceDuplicatesBy ys' (ys:xss) = ys : uncurry replaceDuplicatesBy (replaceMany ys' ys xss)
I.e. it does what you want, but without any randomness — just picks fresh letters from a list.
All described functions are pure. Here is an impure one:
replaceDuplicates :: [String] -> IO [String]
replaceDuplicates xss = flip replaceDuplicatesBy xss <$> shuffle (fresh xss)
I.e. generate a random permutation of a string, that contains fresh letters, and pass it to replaceDuplicatesBy.
You can take the shuffle function from https://www.haskell.org/haskellwiki/Random_shuffle
And the final test:
main = replicateM_ 3 $ replaceDuplicates ["hello", "good", "world"] >>= print
prints
["hello","gxxd","wcrzy"]
["hello","gyyd","wnrmf"]
["hello","gmmd","wvrtx"]
The whole code (without shuffle): http://lpaste.net/115763
I think this is bound to raise more questions than it answers.
import Control.Monad.State
import Data.List
import System.Random
mapAccumLM _ s [] = return (s, [])
mapAccumLM f s (x:xs) = do
(s', y) <- f s x
(s'', ys) <- mapAccumLM f s' xs
return (s'', y:ys)
pick excluded for w = do
a <- pick' excluded
putStrLn $ "replacement for " ++ show for ++ " in " ++ show w ++ " excluded: " ++ show excluded ++ " = " ++ show a
return a
-- | XXX -- can loop indefinitely
pick' excluded = do
a <- randomRIO ('a','z')
if elem a excluded
then pick' excluded
else return a
transform w = do
globallySeen <- get
let go locallySeen ch =
case lookup ch locallySeen of
Nothing -> if elem ch globallySeen
then do let excluded = globallySeen ++ (map snd locallySeen)
a <- lift $ pick excluded ch w
return ( (ch, a):locallySeen, a)
else return ( (ch,ch):locallySeen, ch )
Just ch' -> return (locallySeen, ch')
(locallySeen, w') <- mapAccumLM go [] w
let globallySeen' = w' ++ globallySeen
put globallySeen'
return w'
doit ws = runStateT (mapM transform ws) []
main = do
ws' <- doit [ "hello", "good", "world" ]
print ws'

Do some replacement in Haskell List Comprehensions

My questions is if I put in a string containing such as Hello, today is a Nice Day!! How could I get rid of spaces and punctuation and also replacing the uppercase letters with lowercase?
I know how to delete them but not how to replace them.
Also to get rid of the punctuation.
Sorry I don't know how to mess around with strings, only numbers.
testList xs = [if x = [,|.|?|!] then " " | x<-xs]
import Data.Char
If you want convert the punctuation to space and the characters from upper case to lower case:
testList xs = [if x `elem` ",.?!" then ' ' else toLower x | x<-xs]
Example: testList "TeST,LiST!" == "test list "
If you want to delete the punctuation and convert the characters from upper case to lower case:
testList2 xs = [toLower x | x<-xs, not (x `elem` ",.?!")]
Example: testList2 "Te..S,!t LiS?T" == "test list"
If you don't want or can not import Data.Char, this is an implementation of toLower:
toLower' :: Char -> Char
toLower' char
| isNotUppercase = char -- no change required
| otherwise = toEnum (codeChar + diffLowerUpperChar) -- char lowered
where
codeChar = fromEnum char -- each character has a numeric code
code_A = 65
code_Z = 90
code_a = 97
isNotUppercase = codeChar < code_A || codeChar > code_Z
diffLowerUpperChar = code_a - code_A
I've been without writing a code in Haskell for a long time, but the following should remove the invalid characters (replace them by a space) and also convert the characters from Uppercase to Lowercase:
import Data.Char
replace invalid xs = [if elem x invalid then ' ' else toLower x | x <- xs]
Another way of doing the same:
repl invalid [] = []
repl invalid (x:xs) | elem x invalid = ' ' : repl invalid xs
| otherwise = toLower x : repl invalid xs
You can call the replace (or repl) function like this:
replace ",.?!" "Hello, today is a Nice Day!!"
The above code will return:
"hello today is a nice day "
Edit: I'm using the toLower function from Data.Char in Haskell, but if you want to write it by yourself, check here on Stack Overflow. That question has been asked before.
You will find the functions you need in Data.Char:
import Data.Char
process str = [toLower c | c <- str , isAlpha c]
Though personally, I think the function compositional approach is clearer:
process = map toLower . filter isAlpha
To get rid of the punctuation you can use a filter like this one
[x | x<-[1..10], x `mod` 2 == 0]
The "if" you are using won't filter. Putting an if in the "map" part of a list comprehension will only seve to choose between two options but you can't filter them out there.
As for converting things to lowercase, its the same trick as you can already pull off in numbers:
[x*2 | x <- [1..10]]
Here's a version without importing modules, using fromEnum and toEnum to choose which characters to allow:
testList xs =
filter (\x -> elem (fromEnum x) ([97..122] ++ [32] ++ [48..57])) $ map toLower' xs
where toLower' x = if elem (fromEnum x) [65..90]
then toEnum (fromEnum x + 32)::Char
else x
OUTPUT:
*Main> testList "Hello, today is a Nice Day!!"
"hello today is a nice day"
For a module-less replace function, something like this might work:
myReplace toReplace xs = map myReplace' xs where
myReplace' x
| elem (fromEnum x) [65..90] = toEnum (fromEnum x + 32)::Char
| elem x toReplace = ' '
| otherwise = x
OUTPUT:
*Main> myReplace "!," "Hello, today is a Nice Day!! 123"
"hello today is a nice day 123"
Using Applicative Style
A textual quote from book "Learn You a Haskell for Great Good!":
Using the applicative style on lists is often a good replacement for
list comprehensions. In the second chapter, we wanted to see all the
possible products of [2,5,10] and [8,10,11], so we did this:
[ x*y | x <- [2,5,10], y <- [8,10,11]]
We're just drawing from two lists and applying a function between
every combination of elements. This can be done in the applicative
style as well:
(*) <$> [2,5,10] <*> [8,10,11]
This seems clearer to me, because it's easier to see that we're just
calling * between two non-deterministic computations. If we wanted all
possible products of those two lists that are more than 50, we'd just
do:
filter (>50) $ (*) <$> [2,5,10] <*> [8,10,11]
-- [55,80,100,110]
Functors, Applicative Functors and Monoids

Resources