I'm stuck on a problem with writing a parser in Haskell that I hope someone can help out with!
It is a bit more complicated than my usual parser because there are two layers of parsing. First a language definition is parsed into an AST, then that AST is transformed into another parser that parses the actual language.
I have made pretty good progress so far but I'm stuck on implementing recursion in the language definition. As the language definition is transformed from AST into a parser in a recursive function, I can't work out how it can call itself if it doesn't exist yet.
I'm finding it a bit hard to explain my problem, so maybe an example will help.
The language definition might define that a language consists of three keywords in sequence and then optional recursion in brackets.
A B C ($RECURSE)
Which would be parsed into an AST like:
[Keyword "A", Keyword "B", Keyword "C", Optional (Many [Recurse])]
The Many is not really required for this example, but in my actual project, optional blocks can have multiple syntax elements in them so an Optional would contain a Many with n elements.
I would then want it to get transformed into a parser that parses strings like:
A B C
A B C (A B C)
A B C (A B C (A B C))
I've boiled down my project into the simplest possible example. You can see my TODO comment where I'm stuck trying to implement the recursion.
{-# LANGUAGE OverloadedStrings #-}
module Example
( runExample,
)
where
import Control.Applicative hiding (many, some)
import Data.Text (Text)
import Data.Void
import System.IO as SIO
import Text.Megaparsec hiding (State)
import Text.Megaparsec.Char (space1, string')
import qualified Text.Megaparsec.Char.Lexer as L
import Text.Megaparsec.Debug
import Text.Pretty.Simple (pPrint)
-- Types
type Parser = Parsec Void Text
data SyntaxAst = Keyword Text | Recurse | Optional SyntaxAst | Many [SyntaxAst]
-- Megaparsec Base Parsers
-- Space consumer - used by other parsers to ignore whitespace
sc :: Parser ()
sc =
L.space
space1
(L.skipLineComment "--")
(L.skipBlockComment "/*" "*/")
-- Runs a parser, then consumes any left over space with sc
lexeme :: Parser a -> Parser a
lexeme = L.lexeme sc
-- Parses a string, then consumes any left over space with sc
symbol :: Text -> Parser Text
symbol = L.symbol sc
-- Parses something between parentheses
inParens :: Parser a -> Parser a
inParens =
between
(symbol "(")
(symbol ")")
-- Transforms the AST into a parser
transformSyntaxExprToParser :: SyntaxAst -> Parser [Text]
transformSyntaxExprToParser (Many exprs) = dbg "Many" (createParser exprs)
transformSyntaxExprToParser (Keyword text) = dbg "Keyword" (pure <$> lexeme (string' text))
transformSyntaxExprToParser (Optional inner) = dbg "Optional" (option [] (try (inParens (transformSyntaxExprToParser inner))))
transformSyntaxExprToParser Recurse = dbg "Recurse" (pure ["TODO"]) -- TODO: How do I recurse here?
-- transformSyntaxExprToParser s Recurse = dbg "Recurse" (createParser s) -- Seems to work in the example, but in my actual application creates an infinite loop and freezes
-- Walks over the parser AST and convert it to a parser
createParser :: [SyntaxAst] -> Parser [Text]
createParser expressions =
do
foldr1 (liftA2 (<>)) (fmap transformSyntaxExprToParser expressions)
runExample :: IO ()
runExample = do
-- To make the example simple, lets cut out the language definition parsing and just define
-- it literally.
let languageParser = createParser [Keyword "A", Keyword "B", Keyword "C", Optional (Many [Recurse])]
let run p = runParser p "" "A B C (A B C (A B C))"
let result = run languageParser
case result of
Left bundle -> SIO.putStrLn (errorBundlePretty bundle)
Right xs -> pPrint xs
A few things I've tried:
Pass the original AST up to the transformSyntaxExprToParser function and call createParser when the Recurse token is encountered. This didn't work due to infinite loops.
Using mutable references like IORef/STRef to pass in a reference that is updated to reference the final parser once the transformation is finished. I couldn't work out how to thread the IO/ST monads into the parser transform function.
State monads. I couldn't work out how to pass a reference through the state monad.
I hope that makes sense, let me know if I need to elaborate more. I can also push up my full project if it will help.
Thanks for reading!
Edit: I've made changes to my original example to demonstrate the infinite loop problem (integrating the excellent suggestions in the answer below) at https://pastebin.com/DN0JJ9BA
I believe you can use laziness here. Pass the final parser as a parameter to transformSyntaxExprToParser, and when you see a Recurse, return that parser.
transformSyntaxExprToParser :: Parser [Text] -> SyntaxAst -> Parser [Text]
transformSyntaxExprToParser self = go
where
go (Keyword text) = dbg "Keyword" (pure <$> lexeme (string' text))
go (Optional inner) = dbg "Optional" (option [] (try (inParens (go inner))))
go Recurse = dbg "Recurse" self
createParser :: [SyntaxAst] -> Parser [Text]
createParser expressions = parser
where
parser = foldr1 (liftA2 (<>))
(fmap (transformSyntaxExprToParser parser) expressions)
This ought to produce exactly the same kind of recursive parser as if you had written it directly. A Parser is ultimately just a data structure which you can construct using its instances of Monad, Applicative, Alternative, &c.
Your idea of doing this with a mutable reference such as an IORef is essentially what’s happening under the hood anyway when constructing and evaluating a thunk.
Your idea here was almost correct:
Pass the original AST up to the transformSyntaxExprToParser function and call createParser when the Recurse token is encountered. This didn't work due to infinite loops.
The problem is that you were constructing a new parser for every Recurse, from the same input, which contains a Recurse, thus constructing a new parser…and so on. What my code above does is just pass in the same parser.
If you need to perform monadic side effects while constructing the parser, such as logging, then you can use a recursive do, for example, with some hypothetical MonadLog class for illustration:
{-# Language RecursiveDo #-}
transformSyntaxExprToParser :: (MonadLog m) => Parser [Text] -> SyntaxAst -> m (Parser [Text])
transformSyntaxExprToParser self = go
where
go (Keyword text) = do
logMessage "Got ‘Keyword’"
pure $ dbg "Keyword" (pure <$> lexeme (string' text))
go (Optional inner) = do
logMessage "Got ‘Optional’"
inner' <- go inner
pure $ dbg "Optional" (option [] (try (inParens inner')))
go Recurse = do
logMessage "Got ‘Recurse’"
pure $ dbg "Recurse" self
createParser :: (MonadFix m, MonadLog m) => [SyntaxAst] -> m (Parser [Text])
createParser expressions = do
rec
parser <- fmap (foldr1 (liftA2 (<>)))
(traverse (transformSyntaxExprToParser parser) expressions)
pure parser
The rec block introduces a recursive binding which you may construct using side effects. In general, some care is required to ensure that recursive definitions like this are sufficiently lazy, that is, you don’t force the result sooner than intended, but here the recursion pattern is very simple, and you never examine the self parser, only treat it as a black box to hook up to other parsers.
This method also makes it explicit what the scope of a Recurse is, and opens the possibility of introducing local recursive parsers, with a new call to transformSyntaxExprToParser with a new local self argument.
Related
I have a DSL, and a parser for it written in Haskell with the Parsec package. Now I want to deprecate a specific language feature of the DSL. In the next release, I want the parser to accept both the new and the old syntax, but I want the parser to spawn a deprecation message. I could not find how to do this. Is this possible, and if so, how can this be done?
Instead of emitting messages during parsing, it would be better to return extra information at the end of parsing: whether or not deprecated syntax was encountered.
The ParsecT type admits a type parameter for state set by the user during parsing:
ParsecT s u m a is a parser with stream type s, user state type u, underlying monad m and return type a. Parsec is strict in the user state.
The user state can be set with putState and modifyState. It can be obtained using getState.
Most parsec combinators are polymorphic on the user state. Most combinators for your own DSL should be, as well. But parsers for deprecated parts of the syntax should set a "flag" in your user state.
Something like this:
import Text.Parsec
import Text.Parsec.Char
import Data.Functor.Identity
type Parser = ParsecT [Char] Bool Identity -- using a Bool state
myParser :: Parser Char
myParser =
try (do char 'a'
putState True
char 'b')
<|>
try (do char 'a'
char 'c')
main :: IO ()
main = do
print $ runParser ((,) <$> myParser <*> getState) False "" "ab"
print $ runParser ((,) <$> myParser <*> getState) False "" "ac"
-- results:
-- Right ('b',True)
-- Right ('c',False)
Of course, instead of a simple boolean flag, it would be better to put more information into the state.
Notice that state set by a sub-parser is "forgotten" if the sub-parser backtracks. That is the correct behavior for our purposes: otherwise, we would get "false positives" triggered by branches that are ultimately discarded.
A common alternative to parsec is megaparsec. The latter doesn't allow for user-defined state in the parser type itself, but it can be emulated using a StateT transformer over the ParsecT type.
I have some input data of the form below (this is just a small sample).
ID_SID_0_LANG=eng
ID_VIDEO_FORMAT=H264
ID_VIDEO_HEIGHT=574
ID_START_TIME=0.00
ID_SUBTITLE_ID=0
ID_VIDEO_ID=0
ID_VIDEO_FPS=25.000
ID_VIDEO_WIDTH=700
I'm trying to see if I can parse this with Parsec. For the sake of our example, I want to pull out two values, the width and the height. I am trying to see if this can be done with Parsec.
The lines may come in any order
If either the width or the height are missing, I'd like a ParseError
If either the width or the height occur more than once, I'd like a ParseError
The other lines are mixed and varied per input, I can assume nothing beyond their basic format.
I'd like to use Parsec because I'm going to have to parse the values (which, in general, may be of different types - enumerations for codecs, elapsed types, strings, etc.). And I'd like my returned data structure to contain Naturals rather than, say, Maybe Natural, to simplify later code.
My problem is how to "parse" the leading ID_ lines that aren't interesting to me, but pick up only those that are. So I want to parse "any number of uninteresting ID_ lines; a height (or width); any number of uninteresting ID_ lines; a width (or height if width already found); any number of uninteresting ID_ lines). And I'd like to do this without repeating the notion of what constitutes an "interesting" key, because repetition is a primary cause of subtle error when being later maintained.
My best effort so far is to parse lines producing a list of Data Structure Modifiers for the interesting lines, each with a Key, and separately checking for presence of the required lines and lack of duplication of the unique lines; but that's not satisfying because I'm repeating the "interesting" keys.
Can this be elegantly done with Parsec?
Thanks,
Given that you want an "elegant" Parsec solution, I think you're looking for a variant of a permutation parser.
For background reading, see the documentation for Text.Parsec.Perm and its more modern incarnation in module Control.Applicative.Permutation of the parser-combinators library. In addition, this Functional Pearl paper Parsing Permutation Phrases describes the approach and is great fun to read.
There are two special aspects to your problem: First, I'm not aware of an existing permutation parser that allows for "unmatched" content before, between and after matched portions in a clean manner, and hacks like building the skip logic into the component parsers or deriving an extra parser to identify skippable lines for use in intercalateEffect from Control.Applicative.Permutation seem ugly. Second, the special structure of your input -- the fact that the lines can be recognized by the identifier rather than only general component parsers -- means that we can write a more efficient solution than a usual permutation parser, one that looks up identifiers in a map instead of trying a list of parsers in sequence.
Below is a possible solution. On the one hand, it's using a sledgehammer to kill a fly. In your simple situation, writing an ad hoc parser to read in the identifiers and their RHSs, check for required identifiers and duplicates, and then invoke identifier-specific parsers for the RHSs, seems more straightforward. On the other hand, maybe there are more complicated scenarios where the solution below would be justified, and I think it's conceivable it might be useful to others.
Anyway, here's the idea. First, some preliminaries:
{-# OPTIONS_GHC -Wall #-}
module ParseLines where
import Control.Applicative
import Control.Monad
import Data.List (intercalate)
import Text.Parsec (unexpected, eof, parseTest)
import Text.Parsec.Char (char, letter, alphaNum, noneOf, newline, digit)
import Text.Parsec.String (Parser)
import qualified Data.Map.Lazy as Map
import qualified Data.Set as Set
Let's say we have a data type representing the final result of the parse:
data Video = Video
{ width :: Int
, height :: Int
} deriving (Show)
We're going to construct a Permutation a parser. The type a is what we're going to eventually return (and in this case, it's always Video). This Permutation will actually be a Map from "known" identifiers like ID_VIDEO_WIDTH to a special kind of parser that will parse the right-hand side for the given identifier (e.g., an integer like 700) and then return -- not the parsed integer -- but a continuation Permutation a that parses the remaining data to construct a Video, with the parsed integer (e.g., 700) "baked in" to the continuation. The continuation will have a map that recognizes the "remaining" values, and we'll also keep track of known identifiers we've already read to flag duplicates.
We'll use the following type:
type Identifier = String
data Permutation a = Permutation
-- "seen" identifiers for flagging duplicates
(Set.Set Identifier)
(Either
-- if there are more values to read, map identifier to a parser
-- that parses RHS and returns continuation for parsing the rest
(Map.Map Identifier (Parser (Permutation a)))
-- or we're ready for an eof and can return the final value
a)
"Running" such a parser involves converting it to a plain Parser, and this is where we implement the logic for identifying recognized lines, flagging duplicates, and skipping unrecognized identifiers. First, here's a parser for identifiers. If you wanted to be more lenient, you could use many1 (noneOf "\n=") or something.
ident :: Parser String
ident = (:) <$> letter' <*> many alphaNum'
where letter' = letter <|> underscore
alphaNum' = alphaNum <|> underscore
underscore = char '_'
and here's a parser for skipping the rest of a line when we see an unrecognized identifier:
skipLine :: Parser ()
skipLine = void $ many (noneOf "\n") >> newline
Finally, here's how we run the Permutation parser:
runPermutation :: Permutation a -> Parser a
runPermutation p#(Permutation seen e)
= -- if end of file, return the final answer (or error)
eof *>
case e of
Left m -> fail $
"eof before " ++ intercalate ", " (Map.keys m)
Right a -> return a
<|>
-- otherwise, parse the identifier
do k <- ident <* char '='
-- is it one we're waiting for?
case either (Map.lookup k) (const Nothing) e of
-- no, it's not, so check for duplicates and skip
Nothing -> if Set.member k seen
then unexpected ("duplicate " ++ k)
else skipLine *> runPermutation p
-- yes, it is
Just prhs -> do
-- parse the RHS to get a continuation Permutation
-- and run it to parse rest of parameters
(prhs <* newline) >>= runPermutation
To see how this is supposed to work, here's how we would directly construct a Permutation to parse a Video. It's long, but not that complicated:
perm2 :: Permutation Video
perm2 = Permutation
-- nothing's been seen yet
Set.empty
-- parse width or height
$ Left (Map.fromList
[ ("ID_VIDEO_WIDTH", do
-- parse the width
w <- int
-- return a continuation permutation
return $ Permutation
-- we've seen width
(Set.fromList ["ID_VIDEO_WIDTH"])
-- parse height
$ Left (Map.fromList
[ ("ID_VIDEO_HEIGHT", do
-- parse the height
h <- int
-- return a continuation permutation
return $ Permutation
-- we've seen them all
(Set.fromList ["ID_VIDEO_WIDTH", "ID_VIDEO_HEIGHT"])
-- have all parameters, so eof returns the video
$ Right (Video w h))
]))
-- similarly for other permutation:
, ("ID_VIDEO_HEIGHT", do
h <- int
return $ Permutation
(Set.fromList ["ID_VIDEO_HEIGHT"])
$ Left (Map.fromList
[ ("ID_VIDEO_WIDTH", do
w <- int
return $ Permutation
(Set.fromList ["ID_VIDEO_WIDTH", "ID_VIDEO_HEIGHT"])
$ Right (Video w h))
]))
])
int :: Parser Int
int = read <$> some digit
You can test it like so:
testdata1 :: String
testdata1 = unlines
[ "ID_SID_0_LANG=eng"
, "ID_VIDEO_FORMAT=H264"
, "ID_VIDEO_HEIGHT=574"
, "ID_START_TIME=0.00"
, "ID_SUBTITLE_ID=0"
, "ID_VIDEO_ID=0"
, "ID_VIDEO_FPS=25.000"
, "ID_VIDEO_WIDTH=700"
]
test1 :: IO ()
test1 = parseTest (runPermutation perm2) testdata1
You should be able to verify that it provides appropriate errors for missing keys, duplicate entries for known keys, and accepts keys in any order.
Finally, we obviously don't want to construct permutation parsers like perm2 manually, so we take a page from the Text.Parsec.Perm module and introduce the following syntax:
video :: Parser Video
video = runPermutation (Video <$$> ("ID_VIDEO_WIDTH", int) <||> ("ID_VIDEO_HEIGHT", int))
and define operators to construct the necessary Permutation objects. These definitions are a little tricky, but they follow pretty directly from the definition of Permutation.
(<$$>) :: (a -> b) -> (Identifier, Parser a) -> Permutation b
f <$$> xq = Permutation Set.empty (Right f) <||> xq
infixl 2 <$$>
(<||>) :: Permutation (a -> b) -> (Identifier, Parser a) -> Permutation b
p#(Permutation seen e) <||> (x, q)
= Permutation seen (Left (Map.insert x q' m'))
where
q' = (\a -> addQ x a p) <$> q
m' = case e of Right _ -> Map.empty
Left m -> Map.map (fmap (<||> (x, q))) m
infixl 1 <||>
addQ :: Identifier -> a -> Permutation (a -> b) -> Permutation b
addQ x a (Permutation seen e)
= Permutation (Set.insert x seen) $ case e of
Right f -> Right (f a)
Left m -> Left (Map.map (fmap (addQ x a)) m)
and the final test:
test :: IO ()
test = parseTest video testdata1
giving:
> test
Video {width = 700, height = 574}
>
Here's the final code, slightly rearranged:
{-# OPTIONS_GHC -Wall #-}
module ParseLines where
import Control.Applicative
import Control.Monad
import Data.List (intercalate)
import Text.Parsec (unexpected, eof, parseTest)
import Text.Parsec.Char (char, letter, alphaNum, noneOf, newline, digit)
import Text.Parsec.String (Parser)
import qualified Data.Map.Lazy as Map
import qualified Data.Set as Set
-- * Permutation parser for identifier settings
-- | General permutation parser for a type #a#.
data Permutation a = Permutation
-- | "Seen" identifiers for flagging duplicates
(Set.Set Identifier)
-- | Either map of continuation parsers for more identifiers or a
-- final value once we see eof.
(Either (Map.Map Identifier (Parser (Permutation a))) a)
-- | Create a one-identifier 'Permutation' from a 'Parser'.
(<$$>) :: (a -> b) -> (Identifier, Parser a) -> Permutation b
f <$$> xq = Permutation Set.empty (Right f) <||> xq
infixl 2 <$$>
-- | Add a 'Parser' to a 'Permutation'.
(<||>) :: Permutation (a -> b) -> (Identifier, Parser a) -> Permutation b
p#(Permutation seen e) <||> (x, q)
= Permutation seen (Left (Map.insert x q' m'))
where
q' = (\a -> addQ x a p) <$> q
m' = case e of Right _ -> Map.empty
Left m -> Map.map (fmap (<||> (x, q))) m
infixl 1 <||>
-- | Helper to add a parsed component to a 'Permutation'.
addQ :: Identifier -> a -> Permutation (a -> b) -> Permutation b
addQ x a (Permutation seen e)
= Permutation (Set.insert x seen) $ case e of
Right f -> Right (f a)
Left m -> Left (Map.map (fmap (addQ x a)) m)
-- | Convert a 'Permutation' to a 'Parser' that detects duplicates
-- and skips unknown identifiers.
runPermutation :: Permutation a -> Parser a
runPermutation p#(Permutation seen e)
= -- if end of file, return the final answer (or error)
eof *>
case e of
Left m -> fail $
"eof before " ++ intercalate ", " (Map.keys m)
Right a -> return a
<|>
-- otherwise, parse the identifier
do k <- ident <* char '='
-- is it one we're waiting for?
case either (Map.lookup k) (const Nothing) e of
-- no, it's not, so check for duplicates and skip
Nothing -> if Set.member k seen
then unexpected ("duplicate " ++ k)
else skipLine *> runPermutation p
-- yes, it is
Just prhs -> do
-- parse the RHS to get a continuation Permutation
-- and run it to parse rest of parameters
(prhs <* newline) >>= runPermutation
-- | Left-hand side of a setting.
type Identifier = String
-- | Parse an 'Identifier'.
ident :: Parser Identifier
ident = (:) <$> letter' <*> many alphaNum'
where letter' = letter <|> underscore
alphaNum' = alphaNum <|> underscore
underscore = char '_'
-- | Skip (rest of) a line.
skipLine :: Parser ()
skipLine = void $ many (noneOf "\n") >> newline
-- * Parsing video information
-- | Our video data.
data Video = Video
{ width :: Int
, height :: Int
} deriving (Show)
-- | Parsing integers (RHS of width and height settings)
int :: Parser Int
int = read <$> some digit
-- | Some test data
testdata1 :: String
testdata1 = unlines
[ "ID_SID_0_LANG=eng"
, "ID_VIDEO_FORMAT=H264"
, "ID_VIDEO_HEIGHT=574"
, "ID_START_TIME=0.00"
, "ID_SUBTITLE_ID=0"
, "ID_VIDEO_ID=0"
, "ID_VIDEO_FPS=25.000"
, "ID_VIDEO_WIDTH=700"
]
-- | `Video` parser based on `Permutation`.
video :: Parser Video
video = runPermutation (Video <$$> ("ID_VIDEO_WIDTH", int) <||> ("ID_VIDEO_HEIGHT", int))
-- | The final test.
test :: IO ()
test = parseTest video testdata1
Indeed a simple solution would be to parse the file into Map ByteString ByteString, checking for duplicates while parsing, and then build the target result from that, checking that all required fields are present.
parseMap :: Parsec (Map ByteString ByteString)
-- ...
parseValues :: Map ByteString ByteString -> Parsec MyDataStructure
-- ...
Function parseValues can use Parsec again to parse the fields (perhaps using runP on each one) and to report errors or missing fields.
The disadvantage of this solution that parsing is done on two levels (once to get ByteStrings and the second time to parse them). And that this way we can't report correctly the position of errors found in parseValues. However, Parsec allows to get and set the current position in a file, so it might be feasible to include them in the map, and then use them when parsing the individual strings:
parseMap :: Parsec (Map ByteString (SourcePos, ByteString))
Using Parsec directly to parse the full result might be possible, but I'm afraid it'd be tricky to accomplish to allow arbitrary order and at the same time different output types of the fields.
If you don't mind a slight performance loss, write one parser for the width-line and one for the length-line and do something like this:
let ls = lines input in
case ([x | Right x <- parseWidth ls], [x | Right x <- parseLength ls]) of
([w],[l]) -> ...
_ -> parserError ...
It's easy to add separate error cases for repetated/missing values without repeating anything.
I'm looking for a parser that would try to use the first parser, and return Left a if it succeeds, or if it fails try the second parser and return Right b. In other words, something with the signature:
Parser a -> Parser b -> Parser (Either a b)
Where, e.g., type Parser a = P.Parsec String () a
It's not particularly hard to implement it on my own:
parseEither pa pb = (Left <$> pa) <|> (Right <$> pb)
But it seems to be such a useful and trivial construct that I was wondering if anything similar already exists in the Parsec library.
Is it possible to infer the type from many1?
MWE
module Main where
import System.Environment (getArgs)
import Text.ParserCombinators.Parsec
import Data.Either (rights)
type Vertex vertexWeight = (String, vertexWeight)
parseVertex :: Parser (Vertex a)
parseVertex = do
name <- many1 (noneOf "/")
char '/'
weight <- many1 (noneOf "\n")
return $ (name, weight)
main :: IO ()
main = do
putStrLn $ rights $ [parse parseVertex "test" "a/2"]
In the above example, I'd like for the weight parameter to get outputted as an Int, but this does not type-check.
Would it be wiser to represent a vertex as (String, String) and define parsers for the weight?
The type Parser (Vertex a) is shorthand for forall a. Parser (Vertex a), i.e. its type states that for any choice of a, it can have type Parser (Vertex a). This is clearly not what you want: you want to say that parseVertex will always have type Parser (Vertex a) for some choice of a, but this choice is to be made by parseVertex, not at its call site.
What you should do, is use a type T such that Parser (Vertex T) covers all possible return values of parseVertex. For example, if you use Parser (Vertex (Either Int String)), then parseVertex can choose based on the parse results so far if it will return something of the form (s, Left x), or (s, Right t), where s :: String, x :: Int and t :: String.
Of course, that also means that consumers of parseVector now have to be able to handle both cases.
I'm learning Haskell and as an exercise I'm trying to convert write the read_from function following code to Haskell. Taken from Peter Norvig's Scheme interpreter.
Is there a straightforward way do this?
def read(s):
"Read a Scheme expression from a string."
return read_from(tokenize(s))
parse = read
def tokenize(s):
"Convert a string into a list of tokens."
return s.replace('(',' ( ').replace(')',' ) ').split()
def read_from(tokens):
"Read an expression from a sequence of tokens."
if len(tokens) == 0:
raise SyntaxError('unexpected EOF while reading')
token = tokens.pop(0)
if '(' == token:
L = []
while tokens[0] != ')':
L.append(read_from(tokens))
tokens.pop(0) # pop off ')'
return L
elif ')' == token:
raise SyntaxError('unexpected )')
else:
return atom(token)
def atom(token):
"Numbers become numbers; every other token is a symbol."
try: return int(token)
except ValueError:
try: return float(token)
except ValueError:
return Symbol(token)
There is a straightforward way to "transliterate" Python into Haskell. This can be done by clever usage of monad transformers, which sounds scary, but it's really not. You see, due to purity, in Haskell when you want to use effects such as mutable state (e.g. the append and pop operations are performing mutation) or exceptions, you have to make it a little more explicit. Let's start at the top.
parse :: String -> SchemeExpr
parse s = readFrom (tokenize s)
The Python docstring said "Read a Scheme expression from a string", so I just took the liberty of encoding this as the type signature (String -> SchemeExpr). That docstring becomes obsolete because the type conveys the same information. Now... what is a SchemeExpr? According to your code, a scheme expression can be an int, float, symbol, or list of scheme expressions. Let's create a data type that represents these options.
data SchemeExpr
= SInt Int
| SFloat Float
| SSymbol String
| SList [SchemeExpr]
deriving (Eq, Show)
In order to tell Haskell that the Int we are dealing with should be treated as a SchemeExpr, we need to tag it with SInt. Likewise with the other possibilities. Let's move on to tokenize.
tokenize :: String -> [Token]
Again, the docstring turns into a type signature: turn a String into a list of Tokens. Well, what's a Token? If you look at the code, you'll notice that the left and right paren characters are apparently special tokens, which signal particular behaviors. Anything else is... unspecial. While we could create a data type to more clearly distinguish parens from other tokens, let's just use Strings, to stick a little closer to the original Python code.
type Token = String
Now let's try writing tokenize. First, let's write a quick little operator for making function chaining look a bit more like Python. In Haskell, you can define your own operators.
(|>) :: a -> (a -> b) -> b
x |> f = f x
tokenize s = s |> replace "(" " ( "
|> replace ")" " ) "
|> words
words is Haskell's version of split. However, Haskell has no pre-cooked version of replace that I know of. Here's one that should do the trick:
-- add imports to top of file
import Data.List.Split (splitOn)
import Data.List (intercalate)
replace :: String -> String -> String -> String
replace old new s = s |> splitOn old
|> intercalate new
If you read the docs for splitOn and intercalate, this simple algorithm should make perfect sense. Haskellers would typically write this as replace old new = intercalate new . splitOn old, but I used |> here for easier Python audience understanding.
Note that replace takes three arguments, but above I only invoked it with two. In Haskell you can partially apply any function, which is pretty neat. |> works sort of like the unix pipe, if you couldn't tell, except with more type safety.
Still with me? Let's skip over to atom. That nested logic is a bit ugly, so let's try a slightly different approach to clean it up. We'll use the Either type for a much nicer presentation.
atom :: Token -> SchemeExpr
atom s = Left s |> tryReadInto SInt
|> tryReadInto SFloat
|> orElse (SSymbol s)
Haskell doesn't have the automagical coersion functions int and float, so instead we will build tryReadInto. Here's how it works: we're going to thread Either values around. An Either value is either a Left or a Right. Conventionally, Left is used to signal error or failure, while Right signals success or completion. In Haskell, to simulate the Python-esque function call chaining, you just place the "self" argument as the last one.
tryReadInto :: Read a => (a -> b) -> Either String b -> Either String b
tryReadInto f (Right x) = Right x
tryReadInto f (Left s) = case readMay s of
Just x -> Right (f x)
Nothing -> Left s
orElse :: a -> Either err a -> a
orElse a (Left _) = a
orElse _ (Right a) = a
tryReadInto relies on type inference in order to determine which type it is trying to parse the string into. If the parse fails, it simply reproduces the same string in the Left position. If it succeeds, then it performs whatever function is desired and places the result in the Right position. orElse allows us to eliminate the Either by supplying a value in case the former computations failed. Can you see how Either acts as a replacement for exceptions here? Since the ValueExceptions in the Python code are always caught inside the function itself, we know that atom will never raise an exception. Similarly, in the Haskell code, even though we used Either on the inside of the function, the interface that we expose is pure: Token -> SchemeExpr, no outwardly-visible side effects.
OK, let's move on to read_from. First, ask yourself the question: what side effects does this function have? It mutates its argument tokens via pop, and it has internal mutation on the list named L. It also raises the SyntaxError exception. At this point, most Haskellers will be throwing up their hands saying "oh noes! side effects! gross!" But the truth is that Haskellers use side effects all the time as well. We just call them "monads" in order to scare people away and avoid success at all costs. Mutation can be accomplished with the State monad, and exceptions with the Either monad (surprise!). We will want to use both at the same time, so we'll in fact use "monad transformers", which I'll explain in a bit. It's not that scary, once you learn to see past the cruft.
First, a few utilities. These are just some simple plumbing operations. raise will let us "raise exceptions" as in Python, and whileM will let us write a while loop as in Python. For the latter, we simply have to make it explicit in what order the effects should happen: first perform the effect to compute the condition, then if it's True, perform the effects of the body and loop again.
import Control.Monad.Trans.State
import Control.Monad.Trans.Class (lift)
raise = lift . Left
whileM :: Monad m => m Bool -> m () -> m ()
whileM mb m = do
b <- mb
if b
then m >> whileM mb m
else return ()
We again want to expose a pure interface. However, there is a chance that there will be a SyntaxError, so we will indicate in the type signature that the result will be either a SchemeExpr or a SyntaxError. This is reminiscent of how in Java you can annotate which exceptions a method will raise. Note that the type signature of parse has to change as well, since it might raise the SyntaxError.
data SyntaxError = SyntaxError String
deriving (Show)
parse :: String -> Either SyntaxError SchemeExpr
readFrom :: [Token] -> Either SyntaxError SchemeExpr
readFrom = evalStateT readFrom'
We are going to perform a stateful computation on the token list that is passed in. Unlike the Python, however, we are not going to be rude to the caller and mutate the very list passed to us. Instead, we will establish our own state space and initialize it to the token list we are given. We will use do notation, which provides syntactic sugar to make it look like we're programming imperatively. The StateT monad transformer gives us the get, put, and modify state operations.
readFrom' :: StateT [Token] (Either SyntaxError) SchemeExpr
readFrom' = do
tokens <- get
case tokens of
[] -> raise (SyntaxError "unexpected EOF while reading")
(token:tokens') -> do
put tokens' -- here we overwrite the state with the "rest" of the tokens
case token of
"(" -> (SList . reverse) `fmap` execStateT readWithList []
")" -> raise (SyntaxError "unexpected close paren")
_ -> return (atom token)
I've broken out the readWithList portion into a separate chunk of code,
because I want you to see the type signature. This portion of code introduces
a new scope, so we simply layer another StateT on top of the monad stack
that we had before. Now, the get, put, and modify operations refer
to the thing called L in the Python code. If we want to perform these operations
on the tokens, then we can simply preface the operation with lift in order
to strip away one layer of the monad stack.
readWithList :: StateT [SchemeExpr] (StateT [Token] (Either SyntaxError)) ()
readWithList = do
whileM ((\toks -> toks !! 0 /= ")") `fmap` lift get) $ do
innerExpr <- lift readFrom'
modify (innerExpr:)
lift $ modify (drop 1) -- this seems to be missing from the Python
In Haskell, appending to the end of a list is inefficient, so I instead prepended, and then reversed the list afterwards. If you are interested in performance, then there are better list-like data structures you can use.
Here is the complete file: http://hpaste.org/77852
So if you're new to Haskell, then this probably looks terrifying. My advice is to just give it some time. The Monad abstraction is not nearly as scary as people make it out to be. You just have to learn that what most languages have baked in (mutation, exceptions, etc), Haskell instead provides via libraries. In Haskell, you must explicitly specify which effects you want, and controlling those effects is a little less convenient. In exchange, however, Haskell provides more safety so you don't accidentally mix up the wrong effects, and more power, because you are in complete control of how to combine and refactor effects.
In Haskell, you wouldn't use an algorithm that mutates the data it operates on. So no, there is no straightforward way to do that. However, the code can be rewritten using recursion to avoid updating variables. Solution below uses the MissingH package because Haskell annoyingly doesn't have a replace function that works on strings.
import Data.String.Utils (replace)
import Data.Tree
import System.Environment (getArgs)
data Atom = Sym String | NInt Int | NDouble Double | Para deriving (Eq, Show)
type ParserStack = (Tree Atom, Tree Atom)
tokenize = words . replace "(" " ( " . replace ")" " ) "
atom :: String -> Atom
atom tok =
case reads tok :: [(Int, String)] of
[(int, _)] -> NInt int
_ -> case reads tok :: [(Double, String)] of
[(dbl, _)] -> NDouble dbl
_ -> Sym tok
empty = Node $ Sym "dummy"
para = Node Para
parseToken (Node _ stack, Node _ out) "(" =
(empty $ stack ++ [empty out], empty [])
parseToken (Node _ stack, Node _ out) ")" =
(empty $ init stack, empty $ (subForest (last stack)) ++ [para out])
parseToken (stack, Node _ out) tok =
(stack, empty $ out ++ [Node (atom tok) []])
main = do
(file:_) <- getArgs
contents <- readFile file
let tokens = tokenize contents
parseStack = foldl parseToken (empty [], empty []) tokens
schemeTree = head $ subForest $ snd parseStack
putStrLn $ drawTree $ fmap show schemeTree
foldl is the haskeller's basic structured recursion tool and it serves the same purpose as your while loop and recursive call to read_from. I think the code can be improved a lot, but I'm not so used to Haskell. Below is an almost straight transliteration of the above to Python:
from pprint import pprint
from sys import argv
def atom(tok):
try:
return 'int', int(tok)
except ValueError:
try:
return 'float', float(tok)
except ValueError:
return 'sym', tok
def tokenize(s):
return s.replace('(',' ( ').replace(')',' ) ').split()
def handle_tok((stack, out), tok):
if tok == '(':
return stack + [out], []
if tok == ')':
return stack[:-1], stack[-1] + [out]
return stack, out + [atom(tok)]
if __name__ == '__main__':
tokens = tokenize(open(argv[1]).read())
tree = reduce(handle_tok, tokens, ([], []))[1][0]
pprint(tree)