I am creating a program that reads a text file and splits up words and stores them in a list. I have been trying to create a function that takes in a String which is the whole text String from the file and remove punctuation e.g. ";", ",", "." but unfortunately haven't had any luck yet. The program works without the punctuation function, but not when I include it to (toWords fileContents) Please can someone look at what I have done and see what I am doing wrong.
Here is the code that I have so far:
main = do
contents <- readFile "LargeTextFile.txt"
let lowContents = map toLower contents
let outStr = countWords (lowContents)
let finalStr = sortOccurrences (outStr)
let reversedStr = reverse finalStr
putStrLn "Word | Occurrence "
mapM_ (printList) reversedStr
-- Counts all the words.
countWords :: String -> [(String, Int)]
countWords fileContents = countOccurrences (toWords (removePunc fileContents))
-- Splits words and removes linking words.
toWords :: String -> [String]
toWords s = filter (\w -> w `notElem` ["an","the","for"]) (words s)
-- Remove punctuation from text String.
removePunc :: String -> String
removePunc xs = x | x <- xs, not (x `elem` ",.?!-:;\"\'")
-- Counts, how often each string in the given list appears.
countOccurrences :: [String] -> [(String, Int)]
countOccurrences xs = map (\xs -> (head xs, length xs)) . group . sort $ xs
-- Sort list in order of occurrences.
sortOccurrences :: [(String, Int)] -> [(String, Int)]
sortOccurrences sort = sortBy (comparing snd) sort
-- Prints the list in a format.
printList a = putStrLn((fst a) ++ " | " ++ (show $ snd a))
You probably want:
removePunc xs = [ x | x <- xs, not (x `elem` ",.?!-:;\"\'") ]
with the brackets.
Related
Im trying to write a function that takes in a string and then returns the string as a list of string-words (like the words built-in function) and so far i've written
ord :: String -> [String]
ord [] = []
ord xs = let
ys = groupBy (\x y -> y /= ' ') xs
in filter (not . null) ys
I thougth this would get rid of the empty strings from the list but i only get this output
input:
ord “aa b c - dd”
output:
["aa"," b"," "," "," "," "," "," "," c"," "," "," -"," "," "," "," dd"]
when this is the output i want:
[“aa”, ”b”, ”c”, ”-“, ”dd”]
I get the same result if i try and write
ord :: String -> [String]
ord [] = []
ord xs = filter (not . null) ys
where
ys = groupBy (\x y -> y /= ' ') xs
How do i re-write this code so that i rid the list of its empty strings?
Or use the proper syntax? Im just learning Haskell and im still having trouble with the syntax...
groupBy means that you put x and y in the same group, given the condition is satisfied. But here you group the two together, given y is not equal to a space.
You thus can alter your grouping predicate, and put x and y in the same group, given both are spaces, or non-spaces:
import Data.Char(isSpace)
ord :: String -> [String]
ord [] = []
ord xs = let
ys = groupBy (\x y -> isSpace x == isSpace y) xs
in filter (not . null) ys
or shorter:
import Data.Char(isSpace)
import Data.Function(on)
ord :: String -> [String]
ord [] = []
ord xs = let
ys = groupBy (on (==) isSpace) xs
in filter (not . null) ys
Now we retrieve:
Prelude Data.List> ord "aa b c - dd"
["aa"," ","b"," ","c"," ","-"," ","dd"]
We of course still not obtain the expected result. In stead of filtering out empty strings, we can filter out strings that only contain spacing characters:
import Data.Char(isSpace)
import Data.Function(on)
ord :: String -> [String]
ord [] = []
ord xs = let
ys = groupBy (on (==) isSpace) xs
in filter (not . all isSpace) ys
We do not need to covert the empty case manually, since groupBy on an empty list produces an empty list, we can thus construct a one liner to do the processing:
import Data.Char(isSpace)
import Data.Function(on)
ord :: String -> [String]
ord = filter (not . all isSpace) . groupBy (on (==) isSpace)
Then we obtain the expected result:
Prelude Data.List Data.Char> ord "aa b c - dd"
["aa","b","c","-","dd"]
I wouldn't bother with groupBy at all here. In particular, there's no need to build lists of spaces just to throw them away. Let's start with a function that drops initial spaces and then grabs everything to the first space:
grab :: String -> (String, String)
grab = break isSpace . dropWhile isSpace
Note that the first component of grab xs will be empty if and only if all the elements of xs are spaces.
Now we can write
myWords :: String -> [String]
myWords xs = case grab xs of
([], _) -> []
(beginning, end) -> beginning : myWords end
My input is:
Plaster ["BD..", ".GA.D", ".FEG", "ABDCF", "E..."]
What I'm trying to get:
["BD..", ".GA.D", ".FEG", "ABDCF", "E..."]
My code for now:
go = do --print "Enter file name"
--path <- getLine
file <- (readFile "1.txt")
print file
let list = consume file
print list
let content = (wordsWhen (=='"') list)
print content
print (content !! 0)
print (content !! 1)
print (content !! 2)
wordsWhen :: (Char -> Bool) -> String -> [String]
wordsWhen p s = case dropWhile p s of
"" -> []
", " -> []
s' -> w : wordsWhen p s''
where (w, s'') = break p s'
consume [] = []
consume ('[':xs) = consume' xs
consume (_ :xs) = consume xs
consume' [] = []
consume' (']':xs) = []
consume' (x :xs) = x : consume' xs
So what I'm doing is
Read file from destination (now hardcoded for testing)
Get rid of the word "Plaster" with consume
Get all strings from file with wordsWhen
I tried different separators for wordsWhen, but I can't get what I need. In current form the output is:
"Plaster [\"BD..\", \".GA.D\", \".FEG\", \"ABDCF\", \"E...\"]"
"\"BD..\", \".GA.D\", \".FEG\", \"ABDCF\", \"E...\""
["BD..",", ",".GA.D",", ",".FEG",", ","ABDCF",", ","E..."]
"BD.."
", "
".GA.D"
Which is quite accurate, but I want to get rid of this words that contains only commas. I can change the separator to comma (which it's supposed to be, I think), but then output is given with all this slashes and quotation marks, like this:
"Plaster [\"BD..\", \".GA.D\", \".FEG\", \"ABDCF\", \"E...\"]"
"\"BD..\", \".GA.D\", \".FEG\", \"ABDCF\", \"E...\""
["\"BD..\""," \".GA.D\""," \".FEG\""," \"ABDCF\""," \"E...\""]
"\"BD..\""
" \".GA.D\""
" \".FEG\""
Is there a way to fix my code? Or should I do it in a different way?
EDIT: As it is my exercise, I can only use standard types and functions.
Well, you could cheat by defining a datatype with a Read instance that matches your existing input:
{-# OPTIONS_GHC -Wall -Werror -Wno-name-shadowing #-}
module Main where
data Input = Plaster [String] deriving (Read, Show)
main :: IO ()
main = do
Plaster xs <- readIO =<< readFile "1.txt"
_ <- traverse print (zip [0 :: Int ..] xs)
return ()
This works perfectly for me with ghc-8.0.2:
$ cat "1.txt"
Plaster ["BD..", ".GA.D", ".FEG", "ABDCF", "E..."]
$ ghc --make SO44269043.hs && ./SO44269043
(0,"BD..")
(1,".GA.D")
(2,".FEG")
(3,"ABDCF")
(4,"E...")
Alternately, you could define your own read instance:
data Input = Plaster [String]
instance Read Input where
readsPrec p = readParen (p >= 10) . runR $ do
Plaster <$> (string "Plaster" *> many1 whitespace *> R readList)
If you're unfamiliar with <$> and *>, it may be a little easier to read this as
readsPrec p = readParen (p >= 10) . runR $ do
_ <- string "Plaster"
_ <- many1 whitespace
xs <- R readList
return (Plaster xs)
Even without imports, it's not a lot of code to define a parser type R, basically just a wrapper that lets you define a monad instance for String -> [(a, String)]:
newtype R a = R { runR :: ReadS a }
instance Functor R where
fmap f = R . fmap (map (\(a, s) -> (f a, s))) . runR
instance Applicative R where
pure a = R $ \s -> [(a, s)]
mf <*> ma = R $ \s -> do
(f, s) <- runR mf s
(a, s) <- runR ma s
return (f a, s)
instance Monad R where
m >>= f = R $ \s -> do
(a, s) <- runR m s
runR (f a) s
The >>= (or bind) operator just means "parse some of the string
with the parser on the left, then parse the rest of the string
with the after passing the resulting value to the function on the right."
We get R readList :: R [String] for free now, so all we need to do
is skip the initial "Plaster" and any whitespace between that and the
list of strings:
string :: String -> R String
string = traverse char
many1 :: R a -> R [a]
many1 r = loop where
loop = (:) <$> r <*> (loop <|> return [])
whitespace :: R Char
whitespace = char ' ' <|> char '\t' <|> char '\n' <|> char '\r'
Normally we'd use <|> from GHC.Base, but it's not hard to define a one-off here. Basically
r <|> r' means "try to parse with r, and if it fails, try to parse with r' instead"
(<|>) :: R a -> R a -> R a
r <|> r' = R $ \s -> runR r s ++ runR r' s
And now all we need is the ability to match a single character:
char :: Char -> R Char
char c = R $ \s -> case s of
(c' : s) | c == c' -> [(c, s)]
_ -> []
If even using Prelude.readList is too easy, we can define our own parsers for lists and quoted strings:
readsPrec p = readParen (p >= 10) . runR $ do
Plaster <$> (string "Plaster" *> many1 whitespace *> listOf quotedString)
Where lists just have a leading '[', a trailing ']', and some number of delimited terms:
listOf :: R a -> R [a]
listOf term = char '[' *> (term `sepBy` string ", ") <* char ']'
sepBy :: R a -> R b -> R [a]
sepBy term delim = sepBy1 term delim <|> return []
sepBy1 :: R a -> R b -> R [a]
sepBy1 term delim = loop where
loop = (:) <$> term <*> ((delim *> loop) <|> return [])
Similarly, a quoted string just has a leading '"', a trailing '"', and some number of escaped characters:
quotedString :: R String
quotedString = char '"' *> many escapedChar <* char '"'
many :: R a -> R [a]
many r = many1 r <|> return []
escapedChar :: R Char
escapedChar = R $ \s -> case s of
'\\' : '\\' : s -> [('\\', s)]
'\\' : '"' : s -> [('"', s)]
c : s | c /= '\\' && c /= '"' -> [(c, s)]
_ -> []
It's worth noting the similarity between many/many1 and sepBy/sepBy1 - if we were really
lazy, we could define one in terms of the other:
many1 r = r `sepBy1` return ()
term `sepBy1` delim = (:) <$> term <*> many (delim *> term)
This is how it could be done:
import System.Environment
import System.IO
import Data.Maybe
import Text.Read
readStringList :: String -> Maybe [String]
readStringList = readMaybe
main = do --print "Enter file name"
handle <- openFile "1.txt" ReadMode
hSeek handle AbsoluteSeek 8
file <- hGetContents handle
let list = fromJust (readStringList file )
print list
let filterThis = "," :: String
let filtered = filter (/=filterThis) list
print filtered
To get first command line argument, use getArgs.
Here is a quick and dirty parser.
Be careful, it only works for well formed input, is not performant and code is not factorised. But there is no cheating ;)
Maybe it could give you some inspiration to solve your exercise.
plaster :: String -> String
plaster ('P':'l':'a':'s':'t':'e':'r':' ':xs) = xs
plaster s = undefined
brackets :: String -> String
brackets ('[':xs) = brackets xs
brackets (x:']':_) = [x]
brackets (x:xs) = x:brackets xs
quotes :: String -> String
quotes ('"':xs) = quotes xs
quotes (x:'"':_) = [x]
quotes (x:xs) = x:quotes xs
sepByComma :: String -> [String]
sepByComma s = go s ""
where
go [] acc = [acc]
go (',':' ':xs) acc = [acc] ++ go xs ""
go (x:xs) acc = go xs (acc ++ [x])
parse :: String -> [String]
parse s = map quotes . sepByComma . brackets . plaster $ s
Here is an alternative using only the very basics, no Monads, Functors or Applicative operators.
main :: IO()
main = do
input <- getLine
let output = parse input
print output
parse :: String -> [String]
parse = map stripQuotes . parse' . tokenize []
where
parse' :: [String] -> [String]
-- If the input matches the pattern, call parseList on the inner tokens.
-- Does not nest brackets! This is a simple regex match.
parse' ("Plaster":"[":tokens) | last tokens == "]" =
parseList [] (removeLast tokens)
parse' _ = error "The input does not have the form \"Plaster [...]\"."
parseList :: [String] -> [String] -> [String]
-- Empty list.
parseList tokens [] = tokens
-- Unexpected tokens.
parseList _ (",":_) = error "Unexpected comma."
parseList _ ("[":_) = error "No support for nested brackets."
parseList _ ("]":_) = error "Unexpected input after \"]\"."
-- One-element list.
parseList tokens [x] = tokens ++ [x]
-- Comma-separated list with at least two elements.
parseList tokens (x:",":y:ys) = parseList (tokens ++ [x]) (y:ys)
-- Comma at end of list, so we don’t want to give the "expected comma" error!
parseList _ [_,","] = error "Extra comma at end of list."
-- More than one element not separated by commas.
parseList _ (x:_) = error $ "Expected comma after \"" ++ x ++ "\"."
stripQuotes :: String -> String
stripQuotes ('"':xs) | last xs == '"' = removeLast xs
stripQuotes xs = error $ "Expected string literal instead of " ++ xs ++ "."
removeLast :: [a] -> [a]
removeLast xs = take ((length xs) - 1) xs
whitespace :: [Char]
whitespace = [' ', '\n', '\t'] -- Incomplete, but sufficient.
isWhitespace :: Char -> Bool
isWhitespace c = elem c whitespace
tokenize :: [String] -> String -> [String]
-- If we’ve consumed all the input, we’re done.
tokenize tokens [] = tokens
-- We’d need something a little more complicated for longer operators:
tokenize tokens ('[':xs) = tokenize (tokens ++ ["["]) xs
tokenize tokens (']':xs) = tokenize (tokens ++ ["]"]) xs
tokenize tokens (',':xs) = tokenize (tokens ++ [","]) xs
-- Not currently processing a token, so skip whitespace.
-- Otherwise, start a new token.
tokenize tokens (x:xs) | isWhitespace x = tokenize tokens xs
| otherwise = tokenize' tokens [x] xs
where
tokenize' :: [String] -> String -> String -> [String]
-- If we’ve consumed all the input, the current token is the last.
tokenize' ts t [] = ts ++ [t]
-- If we encounter an operator, it is the token after the current one.
tokenize' ts t ('[':ys) = tokenize (ts ++ [t] ++ ["["]) ys
tokenize' ts t (']':ys) = tokenize (ts ++ [t] ++ ["]"]) ys
tokenize' ts t (',':ys) = tokenize (ts ++ [t] ++ [","]) ys
-- Whitespace means the current token is complete.
-- Otherwise, append y to the current token and continue.
tokenize' ts t (y:ys) | isWhitespace y = tokenize (ts ++ [t]) ys
| otherwise = tokenize' ts (t ++ [y]) ys
You wouldn’t do this in production code; this is simple enough to do with a regex, and parsing is (more or less) a solved problem. Parser combinators are the trendy way to go.
I want to create a while loop to concatenate strings in the xs list
until I find an empty string, but it seems neither we have a chance to
increment an Int nor create a while loop.
So this looks like a pseudo code for Haskell, but how can I actually implement my solution?
prt :: String -> [String] -> Int -> String
prt str xs x = do
while((xs !! (x)) /= "")
str = str ++ (xs !! (x++))
Forget array indexes: they are often not needed. All you have to do for your task is getting the longest prefix of your list containing only non-empty strings.
takeWhile (not . null) xs
-- or
takeWhile (/= "") xs
Then you want to concatenate these strings.
concat $ takeWhile (/= "") xs
If you want to start after n strings for some reason, just drop the first n before beginning:
concat $ takeWhile (/= "") $ drop n xs
If you really want to do a custom "loop", use recursion:
g xs n = f $ drop n xs
f [] = ""
f ("":xs) = ""
f (x:xs) = x ++ f xs
How I can make here filter (x:xs) = (x, length (x:xs)) that puts length when length > 1?
Currently, if input is abcaaabbb output is [('a',1),('b',1),('c',1),('a',3),('b',3)], but I'm looking for abca3b3.
My code:
import Data.List
encode :: [Char] -> [(Char, Int)]
encode s = map go (group s)
where go (x:xs) = (x, length (x:xs))
main = do
s <- getLine
print (encode s)
Last string will be putStrLn (concat (map (\(x,y) -> x : [y]) (encode s))) for convert list to string.
As I am a newbie myself, this is probably not very haskellian. But you can do it about like this (xs as would be the list [('a', 1), ('b', 2), ('a', 3)]):
Create "a1b2a3":
concat $ map (\(c, l) -> c:(show l)) xs
Filter out 1s:
filter (\x -> x /= '1') "a1b2a3"
will give you "ab2a3"
You can't have a list like this in Haskell:
[('a'),('b'),('c'),('a',3),('b',3)]
Each element if a list needs to have the same type in haskell, and ('c') [('a') :: Char] and ('b',3) [('a',1) :: Num t => (Char, t)] are different types.
Maybe also have a look at List of different types?
I would suggest, that you change your list to a (Char, Maybe num) datastructure.
Edit:
From your new question, I think you have been searching for this:
import Data.List
encode :: [Char] -> [(Char, Int)]
encode s = map go (group s)
where go (x:xs) = (x, length (x:xs))
f :: (Char, Int) -> String
f (a, b) = if b == 1 then [a] else [a] ++ show b
encode2 :: [(Char, Int)] -> String
encode2 [] = []
encode2 (x:xs) = f(x) ++ encode2 xs
main = do
s <- getLine
putStrLn $ encode2 $ encode s
Not sure if this suits your needs, but if you do not need filtering, this does the work:
encode::String -> String
encode "" = ""
encode (x:xs) = doIt0 xs x 1 where
doIt0 [] ch currentPos = [ch]++showPos currentPos
doIt0 (x:xs) ch currentPos
|x==ch = doIt0 xs ch $ currentPos+1
|otherwise= [ch]++ (showPos currentPos) ++ (doIt0 xs x 1)
showPos pos = if pos> 1 then show pos else ""
main = do
s <- getLine
print (encode s)
Hey. For a tutorial this week, one of the questions asks to create a function formatLines by using other functions formatLine and formatList, to format a list of lines.
My code looks like this;
type Line = String
formatLine :: Line -> String
formatLine l = l ++ "\n"
formatList :: (a -> String) -> [a] -> String
formatList f [] = []
formatList f xs = f (head xs) ++ formatList f (tail xs)
formatLines :: [Line] -> String
formatLines xs = formatList formatLine xs
The code seems (to me, at least) like it should work, but instead of creating a new line where "\n" is, \n gets appended to the string.
Any help would be greatly appreciated.
That is because you are probably using print to print the result. Instead, use putStr. Observe:
Prelude> print "test\ntest\n"
"test\ntest"
Prelude> putStr "test\ntest\n"
test
test
Other than that, you can use pattern matching to write formatList without head and tail:
formatList :: (a -> String) -> [a] -> String
formatList f [] = []
formatList f (x:xs) = f x ++ formatList f xs
But there is actually no need to define formatList yourself, as it is identical to the function concatMap:
formatList :: (a -> String) -> [a] -> String
formatList = concatMap
Combining all this, you can also just write (note that (++ "\n") is a section):
formatLines :: [String] -> String
formatLines = concatMap (++ "\n")
...which in turn is equivalent to unlines:
formatLines :: [String] -> String
formatLines = unlines
Just try
formatLines = unwords