I'm a beginner with Megaparsec and Haskell in general, and trying to write a parser for the following grammar:
A word will always be one of:
A number composed of one or more ASCII digits (ie "0" or "1234") OR
A simple word composed of one or more ASCII letters (ie "a" or "they") OR
A contraction of two simple words joined by a single apostrophe (ie "it's" or "they're")
So far, I've got the following (this can probably be simplified):
data Word = Number String | SimpleWord String | Contraction String deriving (Show)
word :: Parser MyParser.Word
word = M.choice
[ Number <$> number
, Contraction <$> contraction
, SimpleWord <$> simpleWord
]
number :: Parser String
number = M.some C.numberChar
simpleWord :: Parser String
simpleWord = M.some C.letterChar
contraction :: Parser String
contraction = do
left <- simpleWord
void $ C.char '\''
right <- simpleWord
return (left ++ "'" ++ right)
But I'm having problem with defining a parser for skipping white spaces and anything that is non-alphanumeric. For example, given the input 'abc', the parser should discard the apostrophes and just take the "simple word".
The following doesn't compile:
filler :: Parser Char
filler = M.some (C.spaceChar A.<|> not C.alphaNumChar)
spaceConsumer :: Parser ()
spaceConsumer = L.space filler A.empty A.empty
lexeme :: Parser a -> Parser a
lexeme = L.lexeme spaceConsumer
Here is the complete working code that I came up with.
type Parser =
M.Parsec
-- The type for custom error messages. We have none, so use `Void`.
Void
-- The input stream type. Let's use `String` for now.
String
data Word = Number String | SimpleWord String | Contraction String deriving (Eq)
instance Show WordCount.Word where
show (Number x) = x
show (SimpleWord x) = x
show (Contraction x) = x
words :: String -> Either String [String]
-- Force parser to consume entire input
-- <* Sequence actions, discarding the value of the second argument.
words input = case M.parse (M.some WordCount.word A.<* M.eof) "" input of
-- :t err = M.ParseErrorBundle String Void
Left err ->
let e = M.errorBundlePretty err
_ = putStr e
in Left e
Right (x) -> Right $ map (show) x
word :: Parser WordCount.Word
word =
M.skipManyTill filler $
lexeme $
M.choice
-- <$> is infix for 'fmap'
[ Number <$> number,
Contraction <$> M.try contraction,
SimpleWord <$> simpleWord
]
number :: Parser String
number = M.some MC.numberChar
simpleWord :: Parser String
simpleWord = M.some MC.letterChar
contraction :: Parser String
contraction = do
left <- simpleWord
void $ MC.char '\''
right <- simpleWord
return $ left ++ "'" ++ right
-- Define separator characters
isSep :: Char -> Bool
isSep x = C.isSpace x || (not . C.isAlphaNum) x
-- Fillers fill the space between tokens
filler :: Parser ()
filler = void $ M.some $ M.satisfy isSep
-- 3rd and 4th arguments are for ignoring comments
spaceConsumer :: Parser ()
spaceConsumer = L.space filler A.empty A.empty
-- A parser that discards trailing space
lexeme :: Parser a -> Parser a
lexeme = L.lexeme spaceConsumer
First, you probably want to use some1 for number and simple words, otherwise "" would be a number.
Your filler parser is good. That should use some because you want to allow for e.g. "they1234" to parse as SimpleWord "they" and Number "1234".
What you need to say for the overall parser is that your text consists of zero or more words separated by filler, with optional filler before and after. Fortunately megaparsec re-exports lots of useful stuff from Control.Monad.Combinators for doing this.
So we can use sepBy for the words separated by filler:
document :: Parser [Word]
document = do
_ <- filler -- Throw away any filler at the start.
result <- word `sepBy` filler
_ <- filler -- Throw away any filler at the end.
return result
We don't need optional for the start and end filler because filler can be zero length.
Finally, a style point: in a real parser you would want to make the Word type a bit more sophisticated. Something like:
data SimpleWord = Number String | SimpleWord String
data Word = Word SimpleWord | Contraction SimpleWord SimpleWord
That way whatever bit of code deals with Contraction downstream doesn't have to find the apostrophe all over again or deal with the "impossible" case where there isn't one. Once you've found the structure information in your input, don't throw it away. But that's a side issue for this exercise.
I want to know how to make modifications to a text that is full of special characters and codes and replace those codes with strings.
I have the following text:
text=
"#chomsky/syntactic structures/chomskySynt/: published in 1957. #bloomfield/language/bloomfieldLan/: published in 1933. #chomsky/aspects of a theory of syntax/chomskyAsp/: published in 1965. ... #see/chomskySynt/ is considered the starting point of generative linguistics.... Another hypothesis was introduced in #see/chomskyAsp/."
I want to turn it into=
"Chomsky 1: Syntactic structures : published in 1957. Bloomfield 1: Language : published in 1933. Chomsky 2: Aspects of a theory of syntax : published in 1965. ... Chomsky 1 is considered the starting point of generative linguistics ... Another hypothesis was introduced in Chomsky 2..."
Explanation of the special characters and codes: the information on a book starts with # followed by the name of the author (chomsky for example) followed by / then title of the book / then the special code for the book (chomskyAsp) then /
The citation of a book starts with #see followed by / then the special code of the book (ex. chomskySyn) /
The modifications are:
To count how many times an author is cited and concatenate the number to the name: Chomsky 1, for example.
Author name will start with a capital letter
Remove the special code : chomskySynt which serves only as an identification code.
Replace the reference : #see/chomskyAsp with the Chomsky 2. That is replace the reference with the actual author and number.
Here is my code:
RemoveSlash = myReplace "/"" " text
removeDash = map lines $ (filter(any isLetter) . groupBy ( (==) `on` (=='#'))) $ removeSlash
flattenList= concat removeDash
splitIntoWords = map words flattenList
And here is the myReplace function:
myReplace _ _ [] = []
myReplace a b s#(x:xs)= if isPrefixOf a s
then b++myReplace a b (drop(length a)s)
else x: myReplace a b xs
Here is the result so far:
[["chomsky syntactic structures chomskySynt published in 1957. "], ["bloomfield language bloomfieldLan published in 1933. "],["chomsky aspects of a theory of syntax chomskyAsp published in 1965. ... "],["see chomskySynt is considered the starting point of generative linguistics.... Another hypothesis was introduced in "],["see chomskyAsp"]]
The reason I flattened the list and split it into words is now if I do:
map head splitIntoWords
I get ["chomsky","bloomfield","chomsky","see","see"]
I am stuck at this stage. How do I count how many times an author is cited and concatenate the number to the name. I thought of using the zip function:
zipChomsky =zip [1, 2][x | x <- diviser,(head x) == "chomsky"]
This gives:
[(1["chomsky","syntactic","structures","chomskySynt","published","in","1957."]),(2,["chomsky","aspects","of","a","theory","of","syntax","chomskyAsp","published","in","1965.","..."])]
But the result is very different from: Chomsky 1: ...
EDIT: I didn't mean to make the answer this long, but the problem turned out a non-trivial task, and I'm not quite sure how much detail I should put in the answer. In case you understand all the tools I'm using, the full code is just at the end of this answer.
In your case, you'll need:
an approach to parse your input document
a suitable data structure to store the input information
displaying the data as output format
For the parsing part, perhaps Regex is enough (maybe), but I guess the Parsec library is a better choice. For detailed usage of Parsec please refer to the link, and I'll only try to show how to use it in your case:
First, import Text.ParserCombinators.Parsec.
A document is a list of
a literal string
a definition, with format #<Author>/<Title>/<Code>/, as in "#chomsky/syntactic structures/chomskySynt/"
a citation, with format #see/<Code>/, as in "#see/chomskyAsp/"
Hence we define
data Index = Index {
getAuthor :: String,
getTitle :: String,
getSpecialCode :: String,
getAuthorCount :: Int
-- For counting author later.
} deriving (Show)
data Content = Def Index
| Cite String Index
-- We'll fill in Index later.
| Literal String
deriving (Show)
and our input document will just be turned into [Content].
Correspondingly, we'll use the following function (actually, parser) to parse the input:
document = many (try def <|> try cite <|> literal)
literal = Literal <$> many1 (noneOf "#")
def = do
char '#'
author <- many1 $ noneOf "/"
char '/'
title <- many1 $ noneOf "/"
char '/'
code <- many1 $ noneOf "/"
char '/'
return $ Def author title code
cite = do
try $ string "#see/"
code <- many1 $ noneOf "/"
char '/'
return $ Cite code nullIndex
A short explanation:
A document is many (def or cite or literal), with operator <|> combining parsers.
A literal is a string, stopping at '#', with at least 1 char (using many1); a parser inside many should not accept empty input, think of why!
A def is #<Author>/<Title>/<Code>/, and we can write in do-notation since Parser is a monad.
A cite goes similarly.
A def, cite, or string "#see/" parse multiple characters, hence is possible to fail when they have consumed some chars; therefore, we use the combinator try.
By the way, nullIndex is just a placeholder before we actually fill this record:
nullIndex :: Index
nullIndex = Index "" "" "" 0
Now we only need a function with signiture [Content] -> String.
We can start with captializing the author name:
capitalizeAuthor :: Content -> Content
capitalizeAuthor (Def x) = Def (x {getAuthor = author'}) where
author' = toUpper (head author) : tail author
author = getAuthor x
capitalizeAuthor y = y
The other tasks are not local, since the relation between Contents should be observed, hence we will use a foldl across the list.
Define
import Data.Map.Strict ((!))
import qualified Data.Map.Strict as M
type CodeDict = M.Map String Index
-- Map Code Index
type AuthorDict = M.Map String Int
-- Map Author Count
type Fold = (CodeDict, AuthorDict, [Content])
emptyFold :: Fold
emptyFold = (M.empty, M.empty, [])
The Fold type will store the state when we modify along the original [Content].
(I realize that the code will be much clearer if I use the State monad, but I'm not sure if I need to explain it then ...)
In addition, a folding function for foldl
accum :: Fold -> Content -> Fold
accum (c,a,ls) (Def x) = (c',a',Def x':ls) where
a' = M.insertWith (+) author 1 a
c' = M.insert code x' c
x' = x {getAuthorCount = count}
count = maybe 1 (+1) $ a !? author
author = getAuthor x
code = getSpecialCode x
accum (c,a,ls) (Cite code _) = (c,a,Cite code (c ! code) : ls)
accum (c,a,ls) y = (c,a,y:ls)
After foldr, the resulted list will contain the contents with
getAuthorCount correctly filled
Cites transferred into Defs, since they have the same outputting format.
The resulted list is reversed, so you'll need Data.List.reverse.
Finally, you can define your own version of Show for Content. For example,
instance Show Index where
show x = getAuthor x ++ " "
++ show (getAuthorCount x) ++ ": "
++ getTitle x ++ " "
instance Show Content where
show (Def idx) = show idx
show (Cite x idx) = getAuthor idx ++ " "
++ show (getAuthorCount idx)
show (Literal x) = x
as I figured out from your output sample.
The full length code:
import Data.Char
import Data.List (reverse)
import Data.Map.Strict ((!),(!?))
import qualified Data.Map.Strict as M
import Text.ParserCombinators.Parsec
data Index = Index {
getAuthor :: String,
getTitle :: String,
getSpecialCode :: String,
getAuthorCount :: Int
-- For counting author later.
}
nullIndex :: Index
nullIndex = Index "" "" "" 0
instance Show Index where
show x = getAuthor x ++ " "
++ show (getAuthorCount x) ++ ": "
++ getTitle x ++ " "
data Content = Def Index
| Cite String Index
| Literal String
instance Show Content where
show (Def idx) = show idx
show (Cite x idx) = getAuthor idx ++ " "
++ show (getAuthorCount idx)
show (Literal x) = x
document = many (try cite <|> try def <|> literal)
literal = Literal <$> many1 (noneOf "#")
def = do
char '#'
author <- many1 $ noneOf "/"
char '/'
title <- many1 $ noneOf "/"
char '/'
code <- many1 $ noneOf "/"
char '/'
return $ Def $ Index author title code 0
cite = do
try $ string "#see/"
code <- many1 $ noneOf "/"
char '/'
return $ Cite code nullIndex
capitalizeAuthor :: Content -> Content
capitalizeAuthor (Def x) = Def (x {getAuthor = author'}) where
author' = toUpper (head author) : tail author
author = getAuthor x
capitalizeAuthor y = y
type CodeDict = M.Map String Index
-- Map Code Index
type AuthorDict = M.Map String Int
-- Map Author Count
type Fold = (CodeDict, AuthorDict, [Content])
emptyFold :: Fold
emptyFold = (M.empty, M.empty, [])
accum :: Fold -> Content -> Fold
accum (c,a,ls) (Def x) = (c',a',Def x':ls) where
a' = M.insertWith (+) author 1 a
c' = M.insert code x' c
x' = x {getAuthorCount = count}
count = maybe 1 (+1) $ a !? author
author = getAuthor x
code = getSpecialCode x
accum (c,a,ls) (Cite code _) = (c,a,Cite code (c ! code) : ls)
accum (c,a,ls) y = (c,a,y:ls)
main :: IO ()
main = do
line <- getLine
let parsed = parse document "" line
case parsed of
Left x -> print x
Right cs -> do
let cs1 = map capitalizeAuthor cs
let (_,_,cs2) = foldl accum emptyFold cs1
let output = concatMap show $ reverse cs2
putStrLn output
I have actually asked this question before (here) but it turns out that the solution provided did not handle all test cases. Also, I need 'Text' parser rather than 'String', so I need parsec3.
Ok, the parser should allow for EVERY type of char inbetween quotes, even quotes. The end of the quoted text is marked by a ' character, followed by |, a space or end of input.
So,
'aa''''|
should return a string
aa'''
This is what I have:
import Text.Parsec
import Text.Parsec.Text
quotedLabel :: Parser Text
quotedLabel = do -- reads the first quote.
spaces
string "'"
lab <- liftM pack $ endBy1 anyChar endOfQuote
return lab
endOfQuote = do
string "'"
try(eof) <|> try( oneOf "| ")
Now, the problem here is of course that eof has a different type than oneOf "| ", so compilation falls.
How do I fix this? Is there a better way to achieve what I am trying to do?
Whitespace
First a comment on handling white space...
Generally the practice is to write your parsers so that they
consume the whitespace following a token
or syntactic unit. It's common to define combinator like:
lexeme p = p <* spaces
to easily convert a parser p to one that discards the whitespace
following whatever p parses. E.g., if you have
number = many1 digit
simply use lexeme number whenever you want to eat up the
whitespace following the number.
For more on this approach to handling whitespace and other advice
on parsing languages, see this Megaparsec tutorial.
Label expressions
Based on your previous SO question it appears you want
to parse expressions of the form:
label1 | label2 | ... | labeln
where each label may be a simple label or a quoted label.
The idiomatic way to parse this pattern is to use sepBy like this:
labels :: Parser String
labels = sepBy1 (try quotedLabel <|> simpleLabel) (char '|')
We define both simpleLabel and quotedLabel in terms of
what characters may occur in them. For simpleLabel a valid
character is a non-| and non-space:
simpleLabel :: Parser String
simpleLabel = many (noneOf "| ")
A quotedLabel is a single quote followed by a run
of valid quotedLabel-characters followed by an ending
single quote:
sq = char '\''
quotedLabel :: Parser String
quotedLabel = do
char sq
chs <- many validChar
char sq
return chs
A validChar is either a non-single quote or a single
quote not followed by eof or a vertical bar:
validChar = noneOf [sq] <|> try validQuote
validQuote = do
char sq
notFollowedBy eof
notFollowedBy (char '|')
return sq
The first notFollowedBy will fail if the single quote appears just
before the end of input. The second notFollowedBy will fail if
next character is a vertical bar. Therefore the sequence of the two
will succeed only if there is a non-vertical bar character following
the single quote. In this case the single quote should be interpreted
as part of the string and not the terminating single quote.
Unfortunately this doesn't quite work because the
current implementation of notFollowedBy
will always succeed with a parser which does not consume any
input -- i.e. like eof. (See this issue for more details.)
To work around this problem we can use this alternate
implementation:
notFollowedBy' :: (Stream s m t, Show a) => ParsecT s u m a -> ParsecT s u m ()
notFollowedBy' p = try $ join $
do {a <- try p; return (unexpected (show a));}
<|> return (return ())
Here is the complete solution with some tests. By adding a few lexeme
calls you can make this parser eat up any white space where you decide
it is not significant.
import Text.Parsec hiding (labels)
import Text.Parsec.String
import Control.Monad
notFollowedBy' :: (Stream s m t, Show a) => ParsecT s u m a -> ParsecT s u m ()
notFollowedBy' p = try $ join $
do {a <- try p; return (unexpected (show a));}
<|> return (return ())
sq = '\''
validChar = do
noneOf "'" <|> try validQuote
validQuote = do
char sq
notFollowedBy' eof
notFollowedBy (char '|')
return sq
quotedLabel :: Parser String
quotedLabel = do
char sq
str <- many validChar
char sq
return str
plainLabel :: Parser String
plainLabel = many (noneOf "| ")
labels :: Parser [String]
labels = sepBy1 (try quotedLabel <|> try plainLabel) (char '|')
test input expected = do
case parse (labels <* eof) "" input of
Left e -> putStrLn $ "error: " ++ show e
Right v -> if v == expected
then putStrLn $ "OK - got: " ++ show v
else putStrLn $ "NOT OK - got: " ++ show v ++ " expected: " ++ show expected
test1 = test "a|b|c" ["a","b","c"]
test2 = test "a|'b b'|c" ["a", "b b", "c"]
test3 = test "'abc''|def" ["abc'", "def" ]
test4 = test "'abc'" ["abc"]
test5 = test "x|'abc'" ["x","abc"]
To change the result of any functor computation you can just use:
fmap (const x) functor_comp
e.g.:
getLine :: IO String
fmap (const ()) getLine :: IO ()
eof :: Parser ()
oneOf "| " :: Parser Char
fmap (const ()) (oneOf "| ") :: Parser ()
Another option is to use operators from Control.Applicative:
getLine *> return 3 :: IO Integer
This performs getLine, discards the result and returns 3.
In your case, you might use:
try(eof) <|> try( oneOf "| " *> return ())
This question might sound trivial but i am a Haskell beginner and i've read some tutorials for parsec but can't figure out how to store the result from a parser in a list (or in my case a list of list of string).
Here is the code of the parser:
-- Adapted from http://book.realworldhaskell.org/read/using-parsec.html -> ch16/csv9.hs and ch16/csv6.hs
import Text.ParserCombinators.Parsec
pgmFile = endBy line eol
line = sepBy cell (char ' ')
cell = many (noneOf " \n")
eol = try (string "\n\r")
<|> try (string "\r\n")
<|> string "\n"
<|> string "\r"
<?> "end of line"
parsePGM :: String -> Either ParseError [[String]]
parsePGM input = parse pgmFile "(unknown)" input
main =
do c <- getContents
case parse pgmFile "(stdin)" c of
Left e -> do putStrLn "Error parsing input:"
print e
Right r -> mapM_ print r
{-- /snippet all --}
returnString =
do c <- getContents
case parse pgmFile "(stdin)" c of
Left e -> do putStrLn "Error parsing input:"
print e
Right r -> r
The returnString function does not work but the main function accepts a ascii pgm file and parses it in a list of list of strings where every line is a list and the words are the content of the list.
I want to store the resulting list in a variable to later work with it.
So how can i do this?
I am really grateful for every help i can get!
Edit:
The error message for the function returnString is:
pgmCH9.hs:32:24:
Couldn't match type β[]β with βIOβ
Expected type: IO ()
Actual type: [[[Char]]]
In the expression: r
In a case alternative: Right r -> r
I guess this is exactly what Sarah wrote in her comment. The type in the Left and Right case must be identical. The output of the main Right case looked like the list which i want to store so that is why i tried to save it with the help of a return value.
Here is the type of parse pgmFile "(stdin)":
ghci> :type parse pgmFile "(stdin)"
parse pgmFile "(stdin)" :: [Char] -> Either ParseError [[[Char]]]
Edit2: So here is the revised code with the added content based on the comments:
-- Adapted from http://book.realworldhaskell.org/read/using-parsec.html -> ch16/csv9.hs and ch16/csv6.hs
import Text.ParserCombinators.Parsec
import System.IO
--
{- A PGM file contains a header witht the type, a comment, the width and height of the picture
and the maximum value for all pixels. The picture presists of width*height pixels, each of
which is seperated by a space or the end-of-line character (eol). -}
--pgmFile :: Text.Parsec.Prim.ParsecT [Char] u Data.Functor.Identity.Identity [[[Char]]]
pgmFile = endBy line eol
-- Each line contains 1 or more pixels, separated by a space
--line :: Text.Parsec.Prim.ParsecT [Char] u Data.Functor.Identity.Identity [[Char]]
line = sepBy pixel (char ' ')
-- Each pixel contains of characters and is limited by space or a newline
--pixel :: Text.Parsec.Prim.ParsecT [Char] u Data.Functor.Identity.Identity [Char]
pixel = many (noneOf " \n")
--eol :: Text.Parsec.Prim.ParsecT [Char] u Data.Functor.Identity.Identity String
eol = try (string "\n\r")
<|> try (string "\r\n")
<|> string "\n"
<|> string "\r"
<?> "end of line"
parsePGM :: String -> Either ParseError [[String]]
parsePGM input = parse pgmFile "(unknown)" input
main :: IO ()
main =
do c <- getContents
case parse pgmFile "(stdin)" c of
Left e -> do putStrLn "Error parsing input:"
print e
Right r -> mapM_ print r
returnString :: IO ()
returnString =
do c <- readFile "test_ascii.pgm"
case parse pgmFile "(stdin)" c of
Left e -> do putStrLn "Error parsing input:"
print e
Right r -> print r
I'm quite new to Haskell and I want to work a bit with parsers.
I'm trying to create a small program that will read an expression from a text.txt file and will return the result.
First of all I read the text and convert what I read to a string:
module Main where
import Data.Char
convert :: String -> String
convert = unlines . (map convertLine) . lines
convertLine :: String -> String
convertLine = unwords . (map convertWord) . words
convertWord :: String -> String
convertWord s = (toUpper (head s)):(tail s)
main = do
name <- readFile "test.txt"
putStr $ convert name
Now, I want to parse that string and I start with:
parse :: (Read a, Num a, Fractional a) => String -> [SyntacticalElement a]
parse "" = []
parse putStr = element : (parse rest)
where (element, rest) = next_elem putStr
Here I get the error: next_elem is not in scope
Any idea why?
Update:
module Main where
import Data.Char
convert :: String -> String
convert = unlines . (map convertLine) . lines
convertLine :: String -> String
convertLine = unwords . (map convertWord) . words
convertWord :: String -> String
convertWord s = (toUpper (head s)):(tail s)
main = do
name <- readFile "test.txt"
putStr $ convert name
number = ['0'..'9'] ++ ['.']
operator = ['+', '-', '*', '/']
open_brackets = ['(', '[']
close_brackets = [')', ']']
brackets = open_brackets ++ close_brackets
allowed_chars = number ++ operator ++ brackets
parse :: (Read a, Num a, Fractional a) => String -> [SyntacticalElement a]
parse "" = []
parse putStr = element : (parse rest)
where (element, rest) = next_elem putStr
next_elem :: (Read a, Num a, Fractional a) => String -> (SyntacticalElement a, String)
next_elem s#(first:_)
| is_open_bracket first = (to_sublist content, rest_b)
| is_operator first = (to_operator operator, rest_o)
| is_number first = (to_number number, rest_n)
| is_close_bracket first = error "Unexpected closing bracket!"
| otherwise = error $ "Invalid Expression: \"" ++ s ++ "\""
where (number, rest_n) = span is_number s
(operator, rest_o) = span is_operator s
(content, rest_b) = parse_bracket s
Because the compiler doesn't know what next_elem means. Why would it? Where is it defined?
Btw for these situations it's often handy to use interact rather than bothering to read from a file.