Monadic Parser - handling string with one character - haskell

I was reading this Monadic Parsing article while I was trying to implement a pretty simple string parser in Haskell and also get a better understanding of using monads. Down below you can see my code, implementing functions for matching a single character or a whole string. It works as expected, but I observed two strange behaviors that I can't explain.
I have to handle single characters in string, otherwise, the parser will return only empty lists. To be exact, if I remove this line string [c] = do char c; return [c] it won't work anymore. I was expecting that string (c:s) would handle string (c:[]) properly. What could be the cause here?
In my opinion, string definition should be equivalent to string s = mapM char s as it would create a list of [Parser Char] for each character in s and collect the results as Parser [Char]. If I use the definition based on mapM, the program would get stuck in an infinite loop and won't print anything. Is something about lazy evalutation that I miss here?
.
module Main where
newtype Parser a = Parser { apply :: String->[(a, String)] }
instance Monad Parser where
return a = Parser $ \s -> [(a, s)]
ma >>= k = Parser $ \s -> concat [apply (k a) s' | (a, s') <- apply ma s]
instance Applicative Parser where
pure = return
mf <*> ma = do { f <- mf; f <$> ma; }
instance Functor Parser where
fmap f ma = f <$> ma
empty :: Parser a
empty = Parser $ const []
anychar :: Parser Char
anychar = Parser f where
f [] = []
f (c:s) = [(c, s)]
satisfy :: (Char -> Bool) -> Parser Char
satisfy prop = do
c <- anychar
if prop c then return c
else empty
char :: Char -> Parser Char
char c = satisfy (== c)
string :: String -> Parser String
string [] = empty
string [c] = do char c; return [c] --- if I remove this line, all results will be []
string (c:s) = do char c; string s; return (c:s)
main = do
let s = "12345"
print $ apply (string "123") s
print $ apply (string "12") s
print $ apply (string "1") s
print $ apply (string []) s
PS. I think the title of the question is not suggestive enough, please propose an edit if you have a better idea.

Since you did string [] = empty instead of string [] = return [], you can't use it as a base case for recursion that builds up a list.
fmap f ma = f <$> ma is wrong, since <$> is defined in terms of fmap. If you want to define fmap in terms of your other instances, then do fmap = liftA or fmap = liftM. Since mapM uses fmap internally but your original string didn't, this problem didn't come up in your first simple test.

string [] = empty
means: "If you need to parse an empty string, fail -- it can not be parsed at all, no matter what's the input string".
By comparison,
string [] = return ""
means: "If you need to parse an empty string, succeed and return the empty string -- it can always be parsed, no matter what's the input string".
By using the first equation, when you recurse in the case string (c:cs) you need to stop at one character (string [c]) since reaching zero characters will run empty and make the whole parser fail.
Hence, you need to either use that string [c] = return [c] equation, or modify the base "empty string" case so that it succeeds. Arguably, the latter would be more natural.

Related

string to Integer parsing with Exception Handling in Haskell

Trying to make type nat which return [ (int, string) ]; handle zero as string
and type int which can handle negative integer also.
import Data.Char ( isDigit )
type Parser tok a = [tok] -> [(a, [tok])]
return :: a -> Parser tok a
return v = \ts -> [(v, ts)]
failure :: Parser tok a
failure = \_ -> []
item :: Parser tok tok
item [] = []
item (t:ts) = [(t, ts)]
sat :: (tok -> Bool) -> Parser tok tok
sat test = item >>= \t ->
if test t then return t
else failure
(<|>) :: Parser tok a -> Parser tok a -> Parser tok a
p1 <|> p2 = \ts -> case p1 ts of
[] -> p2 ts
rs1 -> rs1
many1 :: Parser tok a -> Parser tok [a]
many p = many1 p <|> return []
many1 p = p >>= \v ->
many p >>= \vs ->
return (v:vs)
digit = sat isDigit
first I make nat and use if ~ then ~ else make exception of 0.
nat :: Parser Char Int
nat = many1 digit >>= \s ->
if (s == "0") then failure
else return (read s)
It was able to handle one zero.
nat "0"
> []
nat "0def"
> []
but can't handle when consecutive numbers starting with 0 come to input.
nat "012def"
> [(12, "def")] -- probably my code only handle 1 Zero
-- expect return []
nat "000def"
> [(0, "def")] -- but i really don't know why this output is coming
-- expect return []
I tried to put aside the problem about nat and make an int first.
first I tried to use nat to define int.
int :: Parser Char Int
int = nat >>= \s ->
if (s < 0) then return (-s)
else return s
And I realize that i can't make to compiler recognize that s is negative.
so i tried to make it simillar as nat. which I want to add char '-'?
int :: Parser Char Int
int = many1 digit >>= \s ->
if (s == "-") then return (read s) ++ "-"
else return (read s)
I have two question
Why nat only handle 1 zero? I thought many1 is recursive parsing string step-by-step.
How can I add "-" in int? does it related with thins like synthetic function?
I am weak at using Haskell. is there something i'm missing?
Don't panic. It is not so difficult.
Why nat only handle 1 zero? I thought many1 is recursive parsing
string step-by-step.
You are right many1 digit will apply the parser digit on the longer chain possible. This means that when you are parsing 012def the parsed part is "012" and the remaining is "def". But you are missing where the error is. It is in the following
\s -> if (s == "0") then failure
else return (read s)
Once, many1 digit finish it provides "012" to your function via the bind (>>=) and it is process. But "012" is not "0". So the else part is applied : return (read "012"). The context "def" is not modified. So this results to the answer [12,"def"]. As you can see the problem is not the first part of the bind, but the second one. Since you know that you will only get digit (because many1 digit can only parse digit), I suggest that you change your binder with
let n = read s
in if n == 0 then failure
else n
More elegant (monadic) solution are possibles, but this one is probably quite readable and explicit for new haskeller.
How can I add "-" in int? does it related with thins like synthetic function?
You can't. You want to read a Char that cannot be translate as a digit. So you can't read it as digit. This is a simple parsing problem. You have to deal with the '-' separately. You can use a choice between two parsers. The first parses the Char '-' followed by a natural. The second parses only natural. Then translate the result in an Int. Take care to not forget the 0. It can be handle in the first, the second or both. This means that you can't use nat in both... or you will have to make a third parser that only parse 0. So remember that the order is important.
I am pretty sure that can be write this considering the previous work. You already have the digit and the choice combinator. The char parser is pretty obvious. You just have to think about it.
Hope this will help.

Stop Parser From Accepting Extra Characters

I have a very simple parser with pretty standard parsing functions used with Haskell. I have the parser recognizing what I want, but the problem is that is accepts any extra input after what I'm looking for has been recognized. The simple example code I'll give should recognize a string "x", then return, Just 'x', or return Nothing for an input string other than "x". Given input "x", it returns Just 'x', as it should, given the string "d", it returns Nothing, as it should, given "dx", it returns Nothing, as it should, but if given "xd", it will return Just 'x', while I want it to return Nothing.
newtype Parser a = P (String -> [(a,String)])
instance Functor Parser where
fmap g p = P (\inp -> case parse p inp of
[] -> []
[(v,out)] -> [(g v, out)])
instance Applicative Parser where
pure v = P (\inp -> [(v,inp)])
pg <*> px = P (\inp -> case parse pg inp of
[] -> []
[(g,out)] -> parse (fmap g px) out)
instance Monad Parser where
p >>= f = P (\inp -> case parse p inp of
[] -> []
[(v,out)] -> parse (f v) out)
parse (P p) inp = p inp
item :: Parser Char
item = P (\inp -> case inp of
[] -> []
(x:xs) -> [(x,xs)])
sat :: (Char -> Bool) -> Parser Char
sat p = do x <- item
if p x then return x else empty
empty = P (\inp -> [])
test = do { x <- sat (\x -> x == 'x'); return x}
run s = case parse test s of
[(a, _)] -> Just a
otherwise -> Nothing
given:
run "xd"
returns:
Just 'x'
My most reasonable attempt was to try to recognize anything that's not an alpha-numeric character, but that just seems to make the parser try to parse beyond the string length and it returns Nothing. Is there a standard way of dealing with this?
The fix is actually really simple. Just check that the remaining string after parsing is empty in the run function. Like this:
run s = case parse test s of
[(a, "")] -> Just a
otherwise -> Nothing
Now any spurious characters will cause the function to return Nothing as you want.
An alternative is to split the check into a separate eof parser that succeeds if the string is empty and fails if there are any characters left and add that to the end of your parser.

How far does "try" back track?

So ... I messed up a recording in CSV format:
23,95489,0,20,9888
Due to language settings floating point numbers were written with commas as seperator ... in a comma separated value file ...
Problem is that the file does not have a nice formatting for every float. Some have no point at all and the number of numbers behind the point varies too.
My idea was to build a MegaParsec parser that would try to read every possible floating point formatting, move on and if back track if it finds an error.
Eg for the example above:
read 23,95489 -> good
read 0,20 -> good (so far)
read 9888 -> error (because value is too high for column (checked by guard))
(back tracking to 2.) read 0 -> good again
read 20,9888 -> good
done
I've implemented that as (pseudo code here):
floatP = try pointyFloatP <|> unpointyFloatP
lineP = (,,) <$> floatP <* comma <*> floatP <* comma <*> floatP <* comma
My problem is that apparently the try only works in the 'current' float. There is no backtracking to previous positions. Is this correct?
And if so ... how would I go about implementing further back tracking?
How far does “try” back track?
The parser try p consumes exactly as much input as p if p parses successfully, otherwise it does not consume any input at all. So if you look at that in terms of backtracking, it backtracks to the point where you were when you invoked it.
My problem is that apparently the try only works in the 'current' float. There is no backtracking to previous positions. Is this correct?
Yes, try does not "unconsume" input. All it does is to recover from a failure in the parser you give it without consuming any input. It does not undo the effects of any parsers that you've applied previously, nor does it affect subsequent parsers that you apply after try p succeeded.
And if so ... how would I go about implementing further back tracking?
Basically what you want is to not only know whether pointyFloatP succeeds on the current input, but also whether the rest of your lineP would succeed after successfully pointyFloatP - and if it doesn't you want to backtrack back to before you applied pointyFloatP. So basically you want the parser for the whole remaining line in the try, not just the float parser.
To achieve that you can make floatP take the parser for the remaining line as an argument like this:
floatP restP = try (pointyFloatP <*> restP) <|> unpointyFloatP <*> restP
Note that this kind of backtracking isn't going to be very efficient (but I assume you knew that going in).
Update: Include a custom monadic parser for more complex rows.
Using the List Monad for Simple Parsing
The list monad makes a better backtracking "parser" than Megaparsec. For example, to parse the cells:
row :: [String]
row = ["23", "95489", "0", "20", "9888"]
into exactly three columns of values satisfying a particular bound (e.g., less than 30), you can generate all possible parses with:
{-# OPTIONS_GHC -Wall #-}
import Control.Monad
import Control.Applicative
rowResults :: [String] -> [[Double]]
rowResults = cols 3
where cols :: Int -> [String] -> [[Double]]
cols 0 [] = pure [] -- good, finished on time
cols 0 _ = empty -- bad, didn't use all the data
-- otherwise, parse exactly #n# columns from cells #xs#
cols n xs = do
-- form #d# from one or two cells
(d, ys) <- num1 xs <|> num2 xs
-- only accept #d < 30#
guard $ d < 30
ds <- cols (n-1) ys
return $ d : ds
-- read number from a single cell
num1 (x:xs) | ok1 x = pure (read x, xs)
num1 _ = empty
-- read number from two cells
num2 (x:y:zs) | ok1 x && ok2 y = pure (read (x ++ "." ++ y), zs)
num2 _ = empty
-- first cell: "0" is okay, but otherwise can't start with "0"
ok1 "0" = True
ok1 (c:_) | c /= '0' = True
ok1 _ = False
-- second cell: can't end with "0" (or *be* "0")
ok2 xs = last xs /= '0'
The above list-based parser tries to reduce ambiguity by assuming that if "xxx,yyy" is a number, the "xxx" won't start with zeros (unless it's just "0"), and the "yyy" won't end with a zero (or, for that matter, be a single "0"). If this isn't right, just modify ok1 and ok2 as appropriate.
Applied to row, this gives the single unambiguous parse:
> rowResults row
[[23.95489,0.0,20.9888]]
Applied to an ambiguous row, it gives all parses:
> rowResults ["0", "12", "5", "0", "8601"]
[[0.0,12.5,0.8601],[0.0,12.5,0.8601],[0.12,5.0,0.8601]]
Anyway, I'd suggest using a standard CSV parser to parse your file into a matrix of String cells like so:
dat :: [[String]]
dat = [ ["23", "95489", "0", "20", "9888"]
, ["0", "12", "5", "0", "8601"]
, ["23", "2611", "2", "233", "14", "422"]
]
and then use rowResults above get the row numbers of rows that were ambiguous:
> map fst . filter ((>1) . snd) . zip [1..] . map (length . rowResults) $ dat
[2]
>
or unparsable:
> map fst . filter ((==0) . snd) . zip [1..] . map (length . rowResults) $ dat
[]
>
Assuming there are no unparsable rows, you can regenerate one possible fixed file, even if some rows are ambiguous, but just grabbing the first successful parse for each row:
> putStr $ unlines . map (intercalate "," . map show . head . rowResults) $ dat
23.95489,0.0,20.9888
0.0,12.5,0.8601
23.2611,2.233,14.422
>
Using a Custom Monad based on the List Monad for More Complex Parsing
For more complex parsing, for example if you wanted to parse a row like:
type Stream = [String]
row0 :: Stream
row0 = ["Apple", "15", "1", "5016", "2", "5", "3", "1801", "11/13/2018", "X101"]
with a mixture of strings and numbers, it's actually not that difficult to write a monadic parser, based on the list monad, that generates all possible parses.
The key idea is to define a parser as a function that takes a stream and generates a list of possible parses, with each possible parse represented as a tuple of the object successfully parsed from the beginning of the stream paired with the remainder of the stream. Wrapped in a newtype, our parallel parser would look like:
newtype PParser a = PParser (Stream -> [(a, Stream)]) deriving (Functor)
Note the similarity to the type ReadS from Text.ParserCombinators.ReadP, which is also technically an "all possible parses" parser (though you usually only expect one, unambiguous parse back from a reads call):
type ReadS a = String -> [(a, String)]
Anyway, we can define a Monad instance for PParser like so:
instance Applicative PParser where
pure x = PParser (\s -> [(x, s)])
(<*>) = ap
instance Monad PParser where
PParser p >>= f = PParser $ \s1 -> do -- in list monad
(x, s2) <- p s1
let PParser q = f x
(y, s3) <- q s2
return (y, s3)
There's nothing too tricky here: pure x returns a single possible parse, namely the result x with an unchanged stream s, while p >>= f applies the first parser p to generate a list of possible parses, takes them one by one within the list monad to calculate the next parser q to use (which, as per usual for a monadic operation, can depend on the result of the first parse), and generates a list of possible final parses that are returned.
The Alternative and MonadPlus instances are pretty straightforward -- they just lift emptiness and alternation from the list monad:
instance Alternative PParser where
empty = PParser (const empty)
PParser p <|> PParser q = PParser $ \s -> p s <|> q s
instance MonadPlus PParser where
To run our parser, we have:
parse :: PParser a -> Stream -> [a]
parse (PParser p) s = map fst (p s)
and now we can introduce primitives:
-- read a token as-is
token :: PParser String
token = PParser $ \s -> case s of
(x:xs) -> pure (x, xs)
_ -> empty
-- require an end of stream
eof :: PParser ()
eof = PParser $ \s -> case s of
[] -> pure ((), s)
_ -> empty
and combinators:
-- combinator to convert a String to any readable type
convert :: (Read a) => PParser String -> PParser a
convert (PParser p) = PParser $ \s1 -> do
(x, s2) <- p s1 -- for each possible String
(y, "") <- reads x -- get each possible full read
-- (normally only one)
return (y, s2)
and parsers for various "terms" in our CSV row:
-- read a string from a single cell
str :: PParser String
str = token
-- read an integer (any size) from a single cell
int :: PParser Int
int = convert (mfilter ok1 token)
-- read a double from one or two cells
dbl :: PParser Double
dbl = dbl1 <|> dbl2
where dbl1 = convert (mfilter ok1 token)
dbl2 = convert $ do
t1 <- mfilter ok1 token
t2 <- mfilter ok2 token
return $ t1 ++ "." ++ t2
-- read a double that's < 30
dbl30 :: PParser Double
dbl30 = do
x <- dbl
guard $ x < 30
return x
-- rules for first cell of numbers:
-- "0" is okay, but otherwise can't start with "0"
ok1 :: String -> Bool
ok1 "0" = True
ok1 (c:_) | c /= '0' = True
ok1 _ = False
-- rules for second cell of numbers:
-- can't be "0" or end in "0"
ok2 :: String -> Bool
ok2 xs = last xs /= '0'
Then, for a particular row schema, we can write a row parser as we normally would with a monadic parser:
-- a row
data Row = Row String Int Double Double Double
Int String String deriving (Show)
rowResults :: PParser Row
rowResults = Row <$> str <*> int <*> dbl30 <*> dbl30 <*> dbl30
<*> int <*> str <*> str <* eof
and get all possible parses:
> parse rowResults row0
[Row "Apple" 15 1.5016 2.0 5.3 1801 "11/13/2018" "X101"
,Row "Apple" 15 1.5016 2.5 3.0 1801 "11/13/2018" "X101"]
>
The full program is:
{-# LANGUAGE DeriveFunctor #-}
{-# OPTIONS_GHC -Wall #-}
import Control.Monad
import Control.Applicative
type Stream = [String]
newtype PParser a = PParser (Stream -> [(a, Stream)]) deriving (Functor)
instance Applicative PParser where
pure x = PParser (\s -> [(x, s)])
(<*>) = ap
instance Monad PParser where
PParser p >>= f = PParser $ \s1 -> do -- in list monad
(x, s2) <- p s1
let PParser q = f x
(y, s3) <- q s2
return (y, s3)
instance Alternative PParser where
empty = PParser (const empty)
PParser p <|> PParser q = PParser $ \s -> p s <|> q s
instance MonadPlus PParser where
parse :: PParser a -> Stream -> [a]
parse (PParser p) s = map fst (p s)
-- read a token as-is
token :: PParser String
token = PParser $ \s -> case s of
(x:xs) -> pure (x, xs)
_ -> empty
-- require an end of stream
eof :: PParser ()
eof = PParser $ \s -> case s of
[] -> pure ((), s)
_ -> empty
-- combinator to convert a String to any readable type
convert :: (Read a) => PParser String -> PParser a
convert (PParser p) = PParser $ \s1 -> do
(x, s2) <- p s1 -- for each possible String
(y, "") <- reads x -- get each possible full read
-- (normally only one)
return (y, s2)
-- read a string from a single cell
str :: PParser String
str = token
-- read an integer (any size) from a single cell
int :: PParser Int
int = convert (mfilter ok1 token)
-- read a double from one or two cells
dbl :: PParser Double
dbl = dbl1 <|> dbl2
where dbl1 = convert (mfilter ok1 token)
dbl2 = convert $ do
t1 <- mfilter ok1 token
t2 <- mfilter ok2 token
return $ t1 ++ "." ++ t2
-- read a double that's < 30
dbl30 :: PParser Double
dbl30 = do
x <- dbl
guard $ x < 30
return x
-- rules for first cell of numbers:
-- "0" is okay, but otherwise can't start with "0"
ok1 :: String -> Bool
ok1 "0" = True
ok1 (c:_) | c /= '0' = True
ok1 _ = False
-- rules for second cell of numbers:
-- can't be "0" or end in "0"
ok2 :: String -> Bool
ok2 xs = last xs /= '0'
-- a row
data Row = Row String Int Double Double Double
Int String String deriving (Show)
rowResults :: PParser Row
rowResults = Row <$> str <*> int <*> dbl30 <*> dbl30 <*> dbl30
<*> int <*> str <*> str <* eof
row0 :: Stream
row0 = ["Apple", "15", "1", "5016", "2", "5", "3", "1801", "11/13/2018", "X101"]
main = print $ parse rowResults row0
Off-the-shelf Solutions
I find it a little surprising I can't find an existing parser library out there that provides this kind of "all possible parses" parser. The stuff in Text.ParserCombinators.ReadP takes the right approach, but it assumes that you're parsing characters from a String rather than arbitrary tokens from some other stream (in our case, Strings from a [String]).
Maybe someone else can point out an off-the-shelf solution that would save you from having to role your own parser type, instances, and primitives.

Function Loop in string too long

i have these functions:
item :: Parser Char
item = Parser i
where i [] = []
i (x:xs) = [(x,xs)]
many :: Eq a=> Parser a -> Parser [a]
many p = many1 p +++ return []
many1 :: Eq a=> Parser a -> Parser [a]
many1 p = do v <- p
vs <- many p
returns (v:vs)
Im getting strange results applying differents strings. If i execute:
parse (many item) "x:=0"
i get
[('x',0)]
while if i use another string longer like "if x=0 then x:=0 else x:=1" it looks like go in loop. Doing some attempts it seems that if the string is longer more than 19 chars the function doesnt end. It's strange because on other string long less 19 chars it works good. What it could be?
Other definitions:
newtype Parser a = Parser { parse :: String -> [(a, String)] }
instance Monad Parser where
return t = Parser $ \s -> [(t, s)]
m >>= k = Parser $ \s -> [(x, y) | (u, v) <- parse m s, (x, y) <- parse (k u) v]
(+++) :: Eq a => Parser a -> Parser a -> Parser a
p +++ q = Parser $ \s -> if((parse p s)==[]) then parse q s else parse p s
Your code works fine, it's just that you've written your parser to have infinite backtracking and therefore O(2^n) runtime. Every character you add doubles the time it takes to complete:
$ time hugs foo.hs <<< 'parse (many item) "if x=0 then x:=0 els"'
[...]
Main> [("if x=0 then x:=0 els","")]
Main> [Leaving Hugs]
real 0m11.076s
user 0m10.578s
sys 0m0.016s
vs
$ time hugs foo.hs <<< 'parse (many item) "if x=0 then x:=0 else"'
[...]
Main> [("if x=0 then x:=0 else","")]
Main> [Leaving Hugs]
real 0m22.346s
user 0m22.048s
sys 0m0.036s
Your implementation of (+++) doesn’t do what you think it does. In particular, it will only return successful parsed from one of its arguments, rather than from both. Here is how to do what you want:
(+++) :: Parser a -> Parser a -> Parser a
p +++ q = Parser $ \s -> parse p s ++ parse q s
Although this doesn’t remove duplicates so be warned that you can end up with an explosion of parses by doing eg many (many item).

how to parse a string of comma-separated values into a list of strings in haskell?

so if I have a string "(this, is, a, story, all, about, how)" into a list of the words inside it ["this", "is", "a", "story", "all", "about", "how"] as an instance of ReadP String? I've tried a bunch of different ways, one of which being this:
parseStr :: ReadP String
parseStr = do
skipSpaces
n <- munch1 isAlphaOrDigit
skipComma
return $ n
which parses all values but the last. I thought if I combined it with this parse:
parseLast :: ReadP String
parseLast = do
skipSpaces
n <- munch1 isAlphaOrDigit
return $ n
as
parseLet = (many parseStr) +++ parseLast
but that didn't work either. Any tips?
edit: more definitions
isAlphaOrDigit :: Char -> Bool
isAlphaOrDigit a = (isDigit a) || (isAlpha a)
comma = satisfy (','==)
skipComma = const () <$> some comma
The parser a +++ b sends the entire input string to a and the entire input string to b, producing all the results that either parser produced. You instead want a parser that sends the first part of the string to a and the second part to b, then lets you combine the results. Try this instead:
parseLet = liftA2 (\ss s -> ss ++ [s]) (many parseStr) parseLast
Many parser libraries also offer a manySepBy combinator (perhaps with a slightly different name) for this exact use case; you might consider looking through the ReadP library for such a thing.

Resources