Unable to understand a mutual recursion - haskell

I am reading Programming In Haskell, in the 8th chapter, the author gives an example of writing parsers.
The full source is here: http://www.cs.nott.ac.uk/~gmh/Parsing.lhs
I can't understand the following part: many permits zero or more applications of p,
whereas many1 requires at least one successful application:
many :: Parser a → Parser [a ]
many p = many1 p +++ return [ ]
many1 :: Parser a → Parser [a ]
many1 p = do v ← p
vs ← many p
return (v : vs)
How the recursive call happens at
vs <- many p
vs is the result value of many p, but many p called many1 p, all many1 has in its definition is a do notation, and again has result value v, and vs, when does the recursive call return?
Why does the following snippet can return [("123","abc")] ?
> parse (many digit) "123abc"
[("123", "abc")]

The recursion stops at the v <- p line. The monadic behavior of the Parser will just propagate a [] to the end of the computation when p cannot be parsed anymore.
p >>= f = P (\inp -> case parse p inp of
[] -> [] -- this line here does not call f
[(v,out)] -> parse (f v) out)
The second function is written in do-notation, which is just a nice syntax for the following:
many1 p = p >>= (\v -> many p >>= (\vs -> return (v : vs)))
If parsing p produces an empty list [] the function \v -> many p >>= (\vs -> return (v : vs)) will not be called, stopping the recursion.

For the last question:
> parse (many digit) "123abc"
[("123", "abc")]
Means that parsing has been successful as at least one result has been returned in the answer list. Hutton parsers always return a list - the empty list means parsing failure.
The result ("123", "abc") means that parsing has found three digits "123" and stopped at 'a' which is not a digit - so the "rest of the input" is "abc".
Note that many means "as many as possibly" not "one or more". If it were "one or more" you'd get this result instead:
[("1", "23abc"), ("12", "3abc"), ("123", "abc")]
This behaviour wouldn't be very good for deterministic parsing, though it might sometimes be needed for natural language parsing.

Let me strip this down to the barest bones to make absolutely clear why do-blocks can be misunderstood if they're read simply as imperative code. Consider this snippet:
doStuff :: Maybe Int
doStuff = do
a <- Nothing
doStuff
It looks like doStuff will recurse forever, after all, it's defined to do a sequence of things ending with doStuff. But the sequence of lines in a do-block is not simply a sequence of operations that is performed in order. If you're at a point in a do-block, the way the rest of the block is handled is determined by the definition of >>=. In my example, the second argument to >>= is only used if the first argument isn't Nothing. So the recursion never happens.
Something similar can happen in many different monads. Your example is just a little more complex: when there are no more ways to parse something, the stuff after the >>= is ignored.

Related

Parser Combinators (ReadP) - return entire list if it fails, otherwise return only the passing one

I have a string, for example "MMMMABCNNNXYZPPPPP". I know that this string may have ABC in it and may have XYZ in it, but it is not required to have either. Additionally, the XYZ may be swapped for DEF (e.g. "MMMMABCNNNDEFPPPPP") and the behavior should remain the same.
I would like to parse the string and return the sequences between them, as well as which one of XYZ or DEF was present. Example:
data Divider1 = Abc
data Divider2 = Xyz | Def
--"MMMMABCNNNXYZPPPPP" should return ("MMMM", Just Abc, "NNN", Just Xyz, "PPPPP")
--"MMMMABCNNNDEFPPPPP" should return ("MMMM", Just Abc, "NNN", Just Def, "PPPPP")
Note that if ABC is not present, I would like to return everything before the divider2 and if XYZ and DEF are both not present, I would like to return everything after divider 1.
Example:
--"MMMMNNNXYZPPPPP" should return ("MMMM", Nothing, "NNN", Just Xyz, "PPPPP")
--"MMMMABCNNNPPPPP" should return ("MMMM", Just Abc, "NNN", Nothing, "PPPPP")
If neither ABC nor XYZ is present then I don't care if it returns nothing, or if it returns the entire string.
Currently my code is
parseEverything = many $ satisfy someGeneralCondition--check if all characters are valid
parseAbc = (\str -> Abc) <$> string "ABC"
parseXyz = (\str -> Xyz) <$> string "XYZ"
parseDef = (\str -> Def) <$> string "DEF"
parseFull = do
beforeAbc <- gather parseEverything
parseAbc <- (Just <$> parseAbc) <++ return Nothing
beforeDivider2 <- gather parseEverything
parseDivider2 <- (Just <$> parseXyz) <++ (Just <$> parseDef) <++ (Just <$> Nothing)
everythingElse <- look
return (beforeAbc, parseAbc, beforeDivider2, parseDivider2, everythingElse)
But when I run this on the example string "MMMMABCNNNXYZPPPPP", I get mostly failed results with just one result that I want. The problem is that I need to return everything in beforeAbc if parseAbc fails, but if parseAbc passes then I just need to return that. And the same thing with parseXyz and parseDef. I don't think that <++ is the correct operator to do this. I also tried a variant of this code using option, but it gave the same result. Is there a simple solution that I am missing, and/or should I set up the parsers in a different way?
Thanks in advance!
Edit: does this have anything to do with chainl or chainr or manyTill?
Updated: See note on applicative parsers below.
Here's what's going wrong with your current approach. As you undoubtedly know, the parsers in Text.ParserCombinators.ReadP generate all possible valid parses of all possible prefixes of the string. If you write a parser:
letterAndOther = do
letters <- many (satisfy isLetter)
others <- many get
return (letters, others)
which grabs an initial string of letters followed by the "rest" of the string and run it on a simple test string, you'll usually get way more than you bargained for:
> readP_to_S letterAndOther "abc"
[(("",""),"abc"),(("","a"),"bc"),(("a",""),"bc"),(("","ab"),"c"),
(("a","b"),"c"),(("ab",""),"c"),(("","abc"),""),(("a","bc"),""),
(("ab","c"),""),(("abc",""),"")]
In other words, in a do-block, each monadic action will typically generate a tree of possible parses. In your current code, the very first line of the do-block:
beforeAbc <- gather parseEverything
introduces a whole tree of parse branches, one branch for each possible initial prefix. These branches only get pruned if a later line of the do-block introduces a parse that fails. But, every line of your do-block represents a parser that always succeeds. For example, this always succeeds:
parseAbc <- (Just <$> parseAbc) <++ return Nothing
because even if the first divider isn't found, the right-hand side parser return Nothing will always succeed.
I would suggest the following approach. First, as we discovered in the comments, the first thing you want to do is figure out what your parser should return. Instead of trying to shoehorn the result into a weird tuple, it's a good idea to leverage Haskell's best feature, it's algebraic data types. Define a return type for your parse:
data Result
= TwoDividers String Divider1 String Divider2 String
| FirstDivider String Divider1 String
| SecondDivider String Divider2 String
| NoDividers String
This is unambiguous and covers all possibilities. Admittedly, including Divider1 in the first two constructors is redundant, since there's only one possible Divider1, but programs are for humans to read, too, and keeping Divider1 explicit improves readability.
Now, let's define parsers for the first and second dividers:
divider1 = Abc <$ string "ABC"
divider2 = (Def <$ string "DEF") +++ (Xyz <$ string "XYZ")
Note that I've chosen to define a single divider2 instead of separate parsers for Def and Xyz. Since, in your grammar, it's always the case that "DEF" can appear anywhere "XYZ" can and vice versa, it makes sense to combine them into one parser.
We'll also want a parser for arbitrary strings (basically your parseEverything):
anything = many $ satisfy isLetter -- valid characters
Now, let's write a parser for the full string. A key insight here is that we have four alternatives (i.e., the four constructors for our Result type). It's true that they share some structure, but a first crack at a parser can just treat them as independent alternatives. We'll use the <++ operator to choose the best match:
result =
(TwoDividers <$> anything <*> divider1 <*> anything <*> divider2 <*> anything)
<++ (FirstDivider <$> anything <*> divider1 <*> anything)
<++ (SecondDivider <$> anything <*> divider2 <*> anything)
<++ (NoDividers <$> anything)
A quick test of this will show we've forgotten something:
> readP_to_S result "MMMMABCNNNXYZPPPPP"
[(TwoDividers "MMMM" Abc "NNN" Xyz "","PPPPP"),...]
By default, the parser combinators will try every possible prefix of the input string, leaving more for later parsers. So, we should wrap this up in a final parser function that checks for the end-of-string:
parseResult = readP_to_S (result <* eof)
and with the tests:
main = mapM_ (print . parseResult)
[ "MMMMABCNNNXYZPPPPP"
, "MMMMABCNNNDEFPPPPP"
, "MMMMNNNXYZPPPPP"
, "MMMMABCNNNPPPPP"
]
we get the expected unique parsed output:
[(TwoDividers "MMMM" Abc "NNN" Xyz "PPPPP","")]
[(TwoDividers "MMMM" Abc "NNN" Def "PPPPP","")]
[(SecondDivider "MMMMNNN" Xyz "PPPPP","")]
[(FirstDivider "MMMM" Abc "NNNPPPPP","")]
Note on Applicative Parsers. I've used applicative syntax here, rather than the monad syntax. The difference isn't purely syntactical -- you can always write an applicative expression in monadic form, but there are monadic operations that can't be expressed applicatively, so the monadic syntax is strictly more powerful. However, when an expression can be written both ways, often the applicative syntax is more succinct and easier to write and understand, at least once you get used to it.
In a nutshell, the expression p <*> x <*> y <*> z creates a new parser that applies the parsers p, x, y, and z in order, and then applies the result from parser p (which needs to be a function f) to the results from the rest of the parsers (which must be appropriate arguments for f). In many cases, the function f is a known function and doesn't need to be returned by a parser, so a common variant is to write f <$> x <*> y <*> z. This applies the parsers x, y, and z in order, and then applies f (given directly instead of returned by a parser) to the results from those parsers. For example, the expression:
FirstDivider <$> anything <*> divider1 <*> anything
runs three parsers in order to get anything, followed by a divider1, followed by anything, and then applies the function/contructor FirstDivider to the three arguments resulting from those parsers.
The operators <* and *> can be thought of as variants of <*>. The expression p <*> x first parses p, then parses x, then applies the result of the former to the latter. The expression p <* x first parses p, then parses x, but instead of applying the former to the latter, it returns the value the arrow is pointing to (i.e., whatever p produced) and throws away the other value. Similarly p *> x parses p then parses x, then returns whatever x produced. In particular:
someParser <* eof
first runs someParser, then parses (i.e., checks for) EOF, then returns whatever someParser produced.
This syntax can really shine when parsing more traditional languages into an abstract syntax tree. If you want to parse statements like:
let x = 1 + 5
into a Statement type like:
data Statement = ... | Let Var Expr | ...
you can write a Parsec parser that looks like:
statement = ...
<|> Let <$ string "let" <*> var <* symbol "=" <*> expr
...
The monadic equivalent in do-notation looks like this:
do string "let"
v <- var
symbol "="
e <- expr
return $ Let v e
which is fine, I suppose, but kind of obscures the simple structure of the parse. The applicative version is basically just the list of tokens to parse, with a little bit of syntactic sugar sprinkled in.
Anyway, here's the full program:
import Data.Char
import Text.ParserCombinators.ReadP
data Divider1 = Abc deriving (Show)
data Divider2 = Xyz | Def deriving (Show)
data Result
= TwoDividers String Divider1 String Divider2 String
| FirstDivider String Divider1 String
| SecondDivider String Divider2 String
| NoDividers String
deriving (Show)
anything :: ReadP String
anything = many $ satisfy isLetter -- valid characters
divider1 :: ReadP Divider1
divider1 = Abc <$ string "ABC"
divider2 :: ReadP Divider2
divider2 = (Def <$ string "DEF") +++ (Xyz <$ string "XYZ")
result :: ReadP Result
result =
(TwoDividers <$> anything <*> divider1 <*> anything <*> divider2 <*> anything)
<++ (FirstDivider <$> anything <*> divider1 <*> anything)
<++ (SecondDivider <$> anything <*> divider2 <*> anything)
<++ (NoDividers <$> anything)
parseResult :: String -> [(Result, String)]
parseResult = readP_to_S (result <* eof)
main :: IO ()
main = mapM_ (print . parseResult)
[ "MMMMABCNNNXYZPPPPP"
, "MMMMABCNNNDEFPPPPP"
, "MMMMNNNXYZPPPPP"
, "MMMMABCNNNPPPPP"
]

Why is sequence [getLine, getLine, getLine] not evaluated lazily?

main = do
input <- sequence [getLine, getLine, getLine]
mapM_ print input
Let's see this program in action:
m#m-X555LJ:~$ runhaskell wtf.hs
asdf
jkl
powe
"asdf"
"jkl"
"powe"
Surprisingly to me, there seems to be no laziness here. Instead, all 3 getLines are evaluated eagerly, the read values are stored in memory and then, not before, all are printed.
Compare to this:
main = do
input <- fmap lines getContents
mapM_ print input
Let's see this in action:
m#m-X555LJ:~$ runhaskell wtf.hs
asdf
"asdf"
lkj
"lkj"
power
"power"
Totally different stuff. Lines are read one by one and printed one by one. Which is odd to me because I don't really see any differences between these two programs.
From LearnYouAHaskell:
When used with I/O actions, sequenceA is the same thing as sequence!
It takes a list of I/O actions and returns an I/O action that will
perform each of those actions and have as its result a list of the
results of those I/O actions. That's because to turn an [IO a] value
into an IO [a] value, to make an I/O action that yields a list of
results when performed, all those I/O actions have to be sequenced so
that they're then performed one after the other when evaluation is
forced. You can't get the result of an I/O action without performing
it.
I'm confused. I don't need to perform ALL IO actions to get the results of just one.
A few paragraphs earlier the book shows a definition of sequence:
sequenceA :: (Applicative f) => [f a] -> f [a]
sequenceA [] = pure []
sequenceA (x:xs) = (:) <$> x <*> sequenceA xs
Nice recursion; nothing here hints me that this recursion should not be lazy;just like in any other recursion, to get the head of the returned list Haskell doesn't have to go down through ALL steps of recursion!
Compare:
rec :: Int -> [Int]
rec n = n:(rec (n+1))
main = print (head (rec 5))
In action:
m#m-X555LJ:~$ runhaskell wtf.hs
5
m#m-X555LJ:~$
Clearly, the recursion here is performed lazily, not eagerly.
Then why is the recursion in the sequence [getLine, getLine, getLine] example performed eagerly?
As to why it is important that IO actions are run in order
regardless of the results: Imagine an action createFile :: IO () and
writeToFile :: IO (). When I do a sequence [createFile,
writeToFile] I'd hope that they're both done and in order, even
though I don't care about their actual results (which are both the
very boring value ()) at all!
I'm not sure how this applies to this Q.
Maybe I'll word my Q this way...
In my mind this:
do
input <- sequence [getLine, getLine, getLine]
mapM_ print input
should detoriate to something like this:
do
input <- do
input <- concat ( map (fmap (:[])) [getLine, getLine, getLine] )
return input
mapM_ print input
Which, in turn, should detoriate to something like this (pseudocode, sorry):
do
[ perform print on the result of getLine,
perform print on the result of getLine,
perform print on the result of getLine
] and discard the results of those prints since print was applied with mapM_ which discards the results unlike mapM
getContents is lazy, getLine isn't. Lazy IO isn't a feature of Haskell per se, it's a feature of some particular IO actions.
I'm confused. I don't need to perform ALL IO actions to get the results of just one.
Yes you do! That is one of the most important features of IO, that if you write a >> b or equivalently,
do a
b
then you can be sure that a is definitely "run" before b (see footnote). getContents is actually the same, it "runs" before whatever comes after it... but the result it returns is a sneaky result that sneakily does more IO when you try to evaluate it. That is actually the surprising bit, and it can lead to some very interesting results in practice (like the file you're reading the contents of being deleted or changed while you're processing the results of getContents), so in practical programs you probably shouldn't be using it, it mostly exists for convenience in programs where you don't care about such things (Code Golf, throwaway scripts or teaching for instance).
As to why it is important that IO actions are run in order regardless of the results: Imagine an action createFile :: IO () and writeToFile :: IO (). When I do a sequence [createFile, writeToFile] I'd hope that they're both done and in order, even though I don't care about their actual results (which are both the very boring value ()) at all!
Addressing the edit:
should detoriate to something like this:
do
input <- do
input <- concat ( map (fmap (:[])) [getLine, getLine, getLine] )
return input
mapM_ print input
No, it actually turns into something like this:
do
input <- do
x <- getLine
y <- getLine
z <- getLine
return [x,y,z]
mapM_ print input
(the actual definition of sequence is more or less this:
sequence [] = return []
sequence (a:as) = do
x <- a
fmap (x:) $ sequence as
Technically, in
sequenceA (x:xs) = (:) <$> x <*> sequenceA xs
we find <*>, which first runs the action on the left, then the action on the right, and finally applies their result together. This is what makes the first effect in the list to be occur first, and so on.
Indeed, on monads, f <*> x is equivalent to
do theF <- f
theX <- x
return (theF theX)
More in general, note that all the IO actions are generally executed in order, first to last (see below for a few rare exceptions). Doing IO in a completely lazy way would be a nightmare for the programmer. For instance, consider:
do let aX = print "x" >> return 4
aY = print "y" >> return 10
x <- aX
y <- aY
print (x+y)
Haskell guarantees that the output is x y 14, in that order. If we had completely lazy IO we could also get y x 14, depending on which argument is forced first by +. In such case, we would need to know exactly the order in which the lazy thunks are demanded by every operation, which is something the programmer definitely does not want to care about. Under such detailed semantics, x + y is no longer equivalent to y + x, breaking equational reasoning in many cases.
Now, if we wanted to force IO to be lazy we could use one of the forbidden functions, e.g.
do let aX = unsafeInterleaveIO (print "x" >> return 4)
aY = unsafeInterleaveIO (print "y" >> return 10)
x <- aX
y <- aY
print (x+y)
The above code makes aX and aY lazy IO actions, and the order of the output is now at the whim of the compiler and the library implementation of +. This is in general dangerous, hence the unsafeness of lazy IO.
Now, about the exceptions. Some IO actions which only read from the environment, like getContents were implemented with lazy IO (unsafeInterleaveIO). The designers felt that for such reads, lazy IO can be acceptable, and that the precise timing of the reads is not that important in many cases.
Nowadays, this is controversial. While it can be convenient, lazy IO can be too unpredictable in many cases. For instance, we can't know where the file will be closed, and that could matter if we're reading from a socket. We also need to be very careful not to force the reads too early: that often leads to a deadlock when reading from a pipe. Today, it is usually preferred to avoid lazy IO, and resort to some library like pipes or conduit for "streaming"-like operations, where there is no ambiguity.

In Parsec, how do I run second parser, only if the first parser consumed some input?

I need a combinator like p1 << p2, but p2 should run only if p1 has succeeded and consumed some input.
If p1 succeeded without consuming input, p2 should not run.
If p1 failed, then p2 is also ignored?
Overall result is r1's result
Parsec primitives make an internal distinction between a parser that succeeds after consuming some input and a parser that succeeds after consuming no input which you should be able to leverage. In particular, the following ought to work to parse p and then -- conditioned on p successfully consuming input -- parse q and discard its results:
ifConsumed :: Monad m => ParsecT s u m a -> ParsecT s u m b -> ParsecT s u m a
ifConsumed p q = mkPT k
where -- k :: State s u -> m (Consumed (m (Reply s u a)))
k s = do cons <- runParsecT p s
case cons of
Consumed mrep -> do
rep <- mrep
case rep of
Ok x s' err -> runParsecT (fmap (const x) q) s'
Error err -> return . Consumed . return $ Error err
Empty mrep -> do
rep <- mrep
case rep of
Ok x s' err -> return . Empty . return $ Ok x s' err
Error err -> return . Empty . return $ Error err
It's ugly because Parsec doesn't directly expose the ParsecT constructor, so you have to use the mkPt and runParsecT intermediaries which add a lot of boilerplate.
In a nutshell, it runs the p parser. If this succeeds with input consumed (the Consumed -> Ok branch), it runs the q parser modified via fmap to return the value parsed by p. On the other hand, if p succeeds with no input consumed (the Empty -> Ok branch), it simply returns success without running the q parser.
The only caveat is that I'm not 100% sure how, within the Parsec library itself, the invariant is preserved whereby the Consumed -> Ok branch only gets called when input has been consumed, so I don't know if this is truly reliable. You'll want to test it carefully in your particular use case.
For the following parser --- which parses a list of one or more elements separated commas where each element consists of zero or more digits, then two exclamation marks only if the previous parser consumed some input, then a semicolon --- it seems to work:
p :: Parser [String]
p = ifConsumed (sepBy1 (many digit) (char ',')) (char '!' >> char '!') <* char ';'
runp :: String -> Either ParseError [String]
runp = parse p ""
Some tests:
runp "" -- fails, expecting semicolon
runp ";" -- returns [""]
runp "!!;" -- fails, "!!" w/ no preceding content
runp ",;" -- fails, missing "!!"
runp ",!!;" -- returns ["",""]
runp ",!;" -- fails, expecting second "!"
runp ",1,23;" -- fails, missing "!!"
runp ",1,23!!;" -- returns ["","1","23"]
With a naive parser implementation, you should be able to do this:
(<<) p1 p2 = P $ \inp -> case parse p1 inp of
ErrorResult e -> ErrorResult e
SuccessResult (rem, res) -> if rem == inp
then SuccessResult (rem, res)
else parse p2 rem
Though Parsec is more advanced, you could probably roll your own there as well.
I don't think you can do that for arbitrary parsers p1 and p2: you need them to communicate somehow. If you could do this, it seems to me that you would break referential transparency.
For example, consider parsing the input string repeat 'x': whether p1 consumes a character or not, p2 will see the string as an endless sea of x characters. If it hasn't communicated with p1 somehow (eg by modifying something in the parser's state), then you can't know whether a character was consumed; if your combinator were somehow able to treat these two cases differently, it would be breaking the rules.

How does getLine work in haskell?

Looking at the definition of getLine in the Haskell Prelude,
I get how the recursion works, where you keep asking for a character until you hit a newline and you buildup a list which you then return wrapped in an IO.
However my question is how do the return statements work in this case, specifically how does return (c:....:return "") work when you hit the base case. How do you cons a return "" on to a list?
return isn't a control structure like in most languages. It's a constructor for monadic values. Let's take a look at its type:
return :: Monad m => a -> m a
In this case, given a String value, it produces a IO String value.
The fact that return is the last expression evaluated in each branch of the if doesn't mean return ends execution; other expressions could occur after return. Consider this simple example from the list monad:
foo :: Int -> Int -> [Int]
foo x y = return x ++ return y
In the list monad, return simply creates a new single-item list containing its argument. Those two lists are then concatenated into the final result list returned by the function.
$ return 3 :: [Int]
[3]
$ foo 3 4
[3,4]
do-notation is a syntax sugar.
do x <- e
rest
is equivalent to
e >>= \x -> rest
where >>= is a flatMap or bind operation (it attaches a callback to IO container).
flatMap :: IO a -> (a -> IO b) -> IO b meaning is: given container of type IO a attach a callback of type a -> IO b, fired when container succeeds in its operation, and this produces a new container of type IO b
So
getLine =
getChar >>= \c ->
if c == '\n'
then (return [])
else getLine >>= \rest ->
return (c : rest)
What is means? getLine immediately delegates execution to getChar IO-container, with a callback, which analyses the character passed to it. If its a newline, it does "return """, which is a construction of IO-container, returning empty String immediately.
Otherwise, we call ourselves, grab the rest and return current character attached to rest.
P.S.: return is used to turn a pure value into container, since Monad interface doesn't allow us to bind non-container-producing callbacks (there are very good reasons for this).

How to translate this python to Haskell?

I'm learning Haskell and as an exercise I'm trying to convert write the read_from function following code to Haskell. Taken from Peter Norvig's Scheme interpreter.
Is there a straightforward way do this?
def read(s):
"Read a Scheme expression from a string."
return read_from(tokenize(s))
parse = read
def tokenize(s):
"Convert a string into a list of tokens."
return s.replace('(',' ( ').replace(')',' ) ').split()
def read_from(tokens):
"Read an expression from a sequence of tokens."
if len(tokens) == 0:
raise SyntaxError('unexpected EOF while reading')
token = tokens.pop(0)
if '(' == token:
L = []
while tokens[0] != ')':
L.append(read_from(tokens))
tokens.pop(0) # pop off ')'
return L
elif ')' == token:
raise SyntaxError('unexpected )')
else:
return atom(token)
def atom(token):
"Numbers become numbers; every other token is a symbol."
try: return int(token)
except ValueError:
try: return float(token)
except ValueError:
return Symbol(token)
There is a straightforward way to "transliterate" Python into Haskell. This can be done by clever usage of monad transformers, which sounds scary, but it's really not. You see, due to purity, in Haskell when you want to use effects such as mutable state (e.g. the append and pop operations are performing mutation) or exceptions, you have to make it a little more explicit. Let's start at the top.
parse :: String -> SchemeExpr
parse s = readFrom (tokenize s)
The Python docstring said "Read a Scheme expression from a string", so I just took the liberty of encoding this as the type signature (String -> SchemeExpr). That docstring becomes obsolete because the type conveys the same information. Now... what is a SchemeExpr? According to your code, a scheme expression can be an int, float, symbol, or list of scheme expressions. Let's create a data type that represents these options.
data SchemeExpr
= SInt Int
| SFloat Float
| SSymbol String
| SList [SchemeExpr]
deriving (Eq, Show)
In order to tell Haskell that the Int we are dealing with should be treated as a SchemeExpr, we need to tag it with SInt. Likewise with the other possibilities. Let's move on to tokenize.
tokenize :: String -> [Token]
Again, the docstring turns into a type signature: turn a String into a list of Tokens. Well, what's a Token? If you look at the code, you'll notice that the left and right paren characters are apparently special tokens, which signal particular behaviors. Anything else is... unspecial. While we could create a data type to more clearly distinguish parens from other tokens, let's just use Strings, to stick a little closer to the original Python code.
type Token = String
Now let's try writing tokenize. First, let's write a quick little operator for making function chaining look a bit more like Python. In Haskell, you can define your own operators.
(|>) :: a -> (a -> b) -> b
x |> f = f x
tokenize s = s |> replace "(" " ( "
|> replace ")" " ) "
|> words
words is Haskell's version of split. However, Haskell has no pre-cooked version of replace that I know of. Here's one that should do the trick:
-- add imports to top of file
import Data.List.Split (splitOn)
import Data.List (intercalate)
replace :: String -> String -> String -> String
replace old new s = s |> splitOn old
|> intercalate new
If you read the docs for splitOn and intercalate, this simple algorithm should make perfect sense. Haskellers would typically write this as replace old new = intercalate new . splitOn old, but I used |> here for easier Python audience understanding.
Note that replace takes three arguments, but above I only invoked it with two. In Haskell you can partially apply any function, which is pretty neat. |> works sort of like the unix pipe, if you couldn't tell, except with more type safety.
Still with me? Let's skip over to atom. That nested logic is a bit ugly, so let's try a slightly different approach to clean it up. We'll use the Either type for a much nicer presentation.
atom :: Token -> SchemeExpr
atom s = Left s |> tryReadInto SInt
|> tryReadInto SFloat
|> orElse (SSymbol s)
Haskell doesn't have the automagical coersion functions int and float, so instead we will build tryReadInto. Here's how it works: we're going to thread Either values around. An Either value is either a Left or a Right. Conventionally, Left is used to signal error or failure, while Right signals success or completion. In Haskell, to simulate the Python-esque function call chaining, you just place the "self" argument as the last one.
tryReadInto :: Read a => (a -> b) -> Either String b -> Either String b
tryReadInto f (Right x) = Right x
tryReadInto f (Left s) = case readMay s of
Just x -> Right (f x)
Nothing -> Left s
orElse :: a -> Either err a -> a
orElse a (Left _) = a
orElse _ (Right a) = a
tryReadInto relies on type inference in order to determine which type it is trying to parse the string into. If the parse fails, it simply reproduces the same string in the Left position. If it succeeds, then it performs whatever function is desired and places the result in the Right position. orElse allows us to eliminate the Either by supplying a value in case the former computations failed. Can you see how Either acts as a replacement for exceptions here? Since the ValueExceptions in the Python code are always caught inside the function itself, we know that atom will never raise an exception. Similarly, in the Haskell code, even though we used Either on the inside of the function, the interface that we expose is pure: Token -> SchemeExpr, no outwardly-visible side effects.
OK, let's move on to read_from. First, ask yourself the question: what side effects does this function have? It mutates its argument tokens via pop, and it has internal mutation on the list named L. It also raises the SyntaxError exception. At this point, most Haskellers will be throwing up their hands saying "oh noes! side effects! gross!" But the truth is that Haskellers use side effects all the time as well. We just call them "monads" in order to scare people away and avoid success at all costs. Mutation can be accomplished with the State monad, and exceptions with the Either monad (surprise!). We will want to use both at the same time, so we'll in fact use "monad transformers", which I'll explain in a bit. It's not that scary, once you learn to see past the cruft.
First, a few utilities. These are just some simple plumbing operations. raise will let us "raise exceptions" as in Python, and whileM will let us write a while loop as in Python. For the latter, we simply have to make it explicit in what order the effects should happen: first perform the effect to compute the condition, then if it's True, perform the effects of the body and loop again.
import Control.Monad.Trans.State
import Control.Monad.Trans.Class (lift)
raise = lift . Left
whileM :: Monad m => m Bool -> m () -> m ()
whileM mb m = do
b <- mb
if b
then m >> whileM mb m
else return ()
We again want to expose a pure interface. However, there is a chance that there will be a SyntaxError, so we will indicate in the type signature that the result will be either a SchemeExpr or a SyntaxError. This is reminiscent of how in Java you can annotate which exceptions a method will raise. Note that the type signature of parse has to change as well, since it might raise the SyntaxError.
data SyntaxError = SyntaxError String
deriving (Show)
parse :: String -> Either SyntaxError SchemeExpr
readFrom :: [Token] -> Either SyntaxError SchemeExpr
readFrom = evalStateT readFrom'
We are going to perform a stateful computation on the token list that is passed in. Unlike the Python, however, we are not going to be rude to the caller and mutate the very list passed to us. Instead, we will establish our own state space and initialize it to the token list we are given. We will use do notation, which provides syntactic sugar to make it look like we're programming imperatively. The StateT monad transformer gives us the get, put, and modify state operations.
readFrom' :: StateT [Token] (Either SyntaxError) SchemeExpr
readFrom' = do
tokens <- get
case tokens of
[] -> raise (SyntaxError "unexpected EOF while reading")
(token:tokens') -> do
put tokens' -- here we overwrite the state with the "rest" of the tokens
case token of
"(" -> (SList . reverse) `fmap` execStateT readWithList []
")" -> raise (SyntaxError "unexpected close paren")
_ -> return (atom token)
I've broken out the readWithList portion into a separate chunk of code,
because I want you to see the type signature. This portion of code introduces
a new scope, so we simply layer another StateT on top of the monad stack
that we had before. Now, the get, put, and modify operations refer
to the thing called L in the Python code. If we want to perform these operations
on the tokens, then we can simply preface the operation with lift in order
to strip away one layer of the monad stack.
readWithList :: StateT [SchemeExpr] (StateT [Token] (Either SyntaxError)) ()
readWithList = do
whileM ((\toks -> toks !! 0 /= ")") `fmap` lift get) $ do
innerExpr <- lift readFrom'
modify (innerExpr:)
lift $ modify (drop 1) -- this seems to be missing from the Python
In Haskell, appending to the end of a list is inefficient, so I instead prepended, and then reversed the list afterwards. If you are interested in performance, then there are better list-like data structures you can use.
Here is the complete file: http://hpaste.org/77852
So if you're new to Haskell, then this probably looks terrifying. My advice is to just give it some time. The Monad abstraction is not nearly as scary as people make it out to be. You just have to learn that what most languages have baked in (mutation, exceptions, etc), Haskell instead provides via libraries. In Haskell, you must explicitly specify which effects you want, and controlling those effects is a little less convenient. In exchange, however, Haskell provides more safety so you don't accidentally mix up the wrong effects, and more power, because you are in complete control of how to combine and refactor effects.
In Haskell, you wouldn't use an algorithm that mutates the data it operates on. So no, there is no straightforward way to do that. However, the code can be rewritten using recursion to avoid updating variables. Solution below uses the MissingH package because Haskell annoyingly doesn't have a replace function that works on strings.
import Data.String.Utils (replace)
import Data.Tree
import System.Environment (getArgs)
data Atom = Sym String | NInt Int | NDouble Double | Para deriving (Eq, Show)
type ParserStack = (Tree Atom, Tree Atom)
tokenize = words . replace "(" " ( " . replace ")" " ) "
atom :: String -> Atom
atom tok =
case reads tok :: [(Int, String)] of
[(int, _)] -> NInt int
_ -> case reads tok :: [(Double, String)] of
[(dbl, _)] -> NDouble dbl
_ -> Sym tok
empty = Node $ Sym "dummy"
para = Node Para
parseToken (Node _ stack, Node _ out) "(" =
(empty $ stack ++ [empty out], empty [])
parseToken (Node _ stack, Node _ out) ")" =
(empty $ init stack, empty $ (subForest (last stack)) ++ [para out])
parseToken (stack, Node _ out) tok =
(stack, empty $ out ++ [Node (atom tok) []])
main = do
(file:_) <- getArgs
contents <- readFile file
let tokens = tokenize contents
parseStack = foldl parseToken (empty [], empty []) tokens
schemeTree = head $ subForest $ snd parseStack
putStrLn $ drawTree $ fmap show schemeTree
foldl is the haskeller's basic structured recursion tool and it serves the same purpose as your while loop and recursive call to read_from. I think the code can be improved a lot, but I'm not so used to Haskell. Below is an almost straight transliteration of the above to Python:
from pprint import pprint
from sys import argv
def atom(tok):
try:
return 'int', int(tok)
except ValueError:
try:
return 'float', float(tok)
except ValueError:
return 'sym', tok
def tokenize(s):
return s.replace('(',' ( ').replace(')',' ) ').split()
def handle_tok((stack, out), tok):
if tok == '(':
return stack + [out], []
if tok == ')':
return stack[:-1], stack[-1] + [out]
return stack, out + [atom(tok)]
if __name__ == '__main__':
tokens = tokenize(open(argv[1]).read())
tree = reduce(handle_tok, tokens, ([], []))[1][0]
pprint(tree)

Resources