Code generation for compiler in Haskell - haskell

I am writing a compiler for a small imperative language. The target language is Java bytecode, and the compiler is implemented in Haskell.
I've written a frontend for the language - i.e I have a lexer, parser and typechecker. I'm having trouble figuring out how to do code generation.
I keep a data structure representing the stack of local variables. I can query this structure with the name of a local variable and get its position in the stack. This data structure is passed around as I walk the syntax tree, and variables are popped and pushed as I enter and exit new scopes.
What I having trouble figuring out is how to emit the bytecode. Emitting strings at terminals and concatenating them at higher levels seems like a poor solution, both clarity- and performance-wise.
tl;dr How do I emit bytecode while waling the syntax tree?

My first project in Haskell a few months back was to write a c compiler, and what resulted was a fairly naive approach to code generation, which I'll walk through here. Please do not take this as an example of good design for a code generator, but rather view it as a quick and dirty (and ultimately naive) way to get something that works fairly quickly with decent performance.
I began by defining an intermediate representation LIR (Lower Intermediate Representation) which closely corresponded to my instruction set (x86_64 in my case):
data LIRInst = LIRRegAssignInst LIRReg LIRExpr
| LIRRegOffAssignInst LIRReg LIRReg LIRSize LIROperand
| LIRStoreInst LIRMemAddr LIROperand
| LIRLoadInst LIRReg LIRMemAddr
| LIREnterInst LIRInt
| LIRJumpLabelInst LIRLabel
| LIRIfInst LIRRelExpr LIRLabel LIRLabel -- false, then true
| LIRCallInst LIRLabel LIRLabel -- method label, return label
| LIRCalloutInst String
| LIRRetInst [LIRLabel] String -- list of successors, and the name of the method returning from
| LIRLabelInst LIRLabel
deriving (Show, Eq, Typeable)
Next up came a monad that would handle interleaving state throughout the translation (I was blissfully unaware of our friend-the State Monad-at the time):
newtype LIRTranslator a = LIRTranslator
{ runLIR :: Namespace -> (a, Namespace) }
instance Monad LIRTranslator where
return a = LIRTranslator (\s -> (a, s))
m >>= f = LIRTranslator (\s ->
let (a, s') = runLIR m s
in runLIR (f a) s')
along with the state that would be 'threaded' through the various translation phases:
data Namespace = Namespace
{ temp :: Int -- id's for new temporaries
, labels :: Int -- id's for new labels
, scope :: [(LIRLabel, LIRLabel)] -- current program scope
, encMethod :: String -- current enclosing method
, blockindex :: [Int] -- index into the SymbolTree
, successorMap :: Map.Map String [LIRLabel]
, ivarStack :: [(LIRReg, [CFGInst])] -- stack of ivars (see motioned code)
}
For convenience, I also specified a series of translator monadic functions, for example:
-- |Increment our translator's label counter
incLabel :: LIRTranslator Int
incLabel = LIRTranslator (\ns#(Namespace{ labels = l }) -> (l, ns{ labels = (l+1) }))
I then proceeded to recursively pattern-match my AST, fragment-by-fragment, resulting in many functions of the form:
translateBlock :: SymbolTree -> ASTBlock -> LIRTranslator [LIRInst]
translateBlock st (DecafBlock _ [] _) = withBlock (return [])
translateBlock st block =
withBlock (do b <- getBlock
let st' = select b st
declarations <- mapM (translateVarDeclaration st') (blockVars block)
statements <- mapM (translateStm st') (blockStms block)
return (concat declarations ++ concat statements))
(for translating a block of the target language's code) or
-- | Given a SymbolTree, Translate a single DecafMethodStm into [LIRInst]
translateStm st (DecafMethodStm mc _) =
do (instructions, operand) <- translateMethodCall st mc
final <- motionCode instructions
return final
(for translating a method call) or
translateMethodPrologue :: SymbolTree -> DecafMethod -> LIRTranslator [LIRInst]
translateMethodPrologue st (DecafMethod _ ident args _ _) =
do let numRegVars = min (length args) 6
regvars = map genRegVar (zip [LRDI, LRSI, LRDX, LRCX, LR8, LR9] args)
stackvars <- mapM genStackVar (zip [1..] (drop numRegVars args))
return (regvars ++ stackvars)
where
genRegVar (reg, arg) =
LIRRegAssignInst (symVar arg st) (LIROperExpr $ LIRRegOperand reg)
genStackVar (index, arg) =
do let mem = LIRMemAddr LRBP Nothing ((index + 1) * 8) qword -- ^ [rbp] = old rbp; [rbp + 8] = ret address; [rbp + 16] = first stack param
return $ LIRLoadInst (symVar arg st) mem
for an example of actually generating some LIR code. Hopefully these three examples will give you a good starting point; ultimately, you'll want to go slowly, focusing on one fragment (or intermediate type) within your AST at a time.

If you haven't done this before, you can do it in small passes:
1) for every statement produce some byte code (with out properly addressed memory locations)
2) after that is done, if you have looping, gotos, etc, put in the real addresses (you know them
now that you have it all layed out)
3) replace the memory fetches/stores with the correct locations
4) dump it out to a JAR file
Note that this is very simplified and doesn't try to do any performance optimisation. It will give you a functional program which will execute. This also assumes you know the codes for the JVM (which is where I am presuming you are going to execute it.)
To start, just have a subset of the language which does sequential arithmetic statements. This will allow you to figure out how to map variable memory locations to statements via the parse tree. Next add some looping to get jumps to work. Similarly add conditionals. Finally, you can add the final parts of your language.

Related

The "Haskell way" to extract/cumulate results inside an predefined vistor pattern iterator

I'm getting started with Haskell (from many years of C and c++) and have decided to attempt a small database project. I'm using a predefined binder library to a C database library (Database.kyotocabint). I'm struggling to get my head round how to do anything with the iterator interfaces due to the separation of effects when using a pre-defined method.
The toy demo to iterate over the data base and print it out (which works fine) is
test7 = do
db <- openTree "testdatabase/mydb.kct" defaultLoggingOptions (Writer [] [])
let visitor = \k v -> putStr (show k) >> putStr ":" >> putStrLn (show v) >>
return (Left NoOperation)
iterate db visitor False
close db
Where iterate and visitor are provided by the library bindings and the relevant types are
iterate :: forall db. WithDB db => db -> VisitorFull -> Writable -> IO ()
visitor :: ByteString -> ByteString -> IO (Either VisitorAction b)
But I can't see to how extract information out from inside the iterator rather than process each one individually - for example collect all the keys beginning with 'a' in a list or even just count the number of entries.
Am I limited because iterate just has the type IO () and so I can't build in side effects and would have to rebuild this replacing the library versions? The state monad on paper seems to adress this but the visitor type doesn't seem to allow me to maintain the state over subsequent visitor calls.
What would be the Haskell way to solve this ?
Matthew
Edit - many thanks for the clear answer below which siad both 0 its not the Haskell way but also provided a solution - this answer led me to Mutable objects which I found a clear explanation of the options.
The kyotocabinet library unfortunately does not seem to support your operation. Beyond iterate, it should expose some similar operation which returns something more complex than IO (), say IO a or IO [a] while requiring a more complex visitor function.
Still, since we work inside IO, there is a workaround: we can exploit IORefs and collect results. I want to stress, though, that this is not idiomatic code one would write in Haskell, but something one is forced to use because if the limitation of this library.
Anyway, the code would look something like this (untested):
test7 = do
db <- openTree "testdatabase/mydb.kct" defaultLoggingOptions (Writer [] [])
w <- newIORef [] -- create mutable var, initialize to []
let visitor = \k v -> do
putStrLn (show k ++ ":" ++ show v)
modifyIORef w ((k,v):) -- prepend (k,v) to the list w
return (Left NoOperation)
iterate db visitor False
result <- readIORef w -- get the whole list
print result
close db
Since you come from C++, you might want to compare the code above to the following pseudo-C++:
std::vector<std::pair<int,int>> w;
db.iterate([&](int k, int v) {
std::cout << k << ", " << v << "\n";
w.push_back({k,v});
});
// here we can read w, even if db.iterate returns void
Again, this is not something I would consider idiomatic Haskell.

Generating a unique value in Haskell do-notation

To generate x86 assembly code, I have defined a custom type called X86:
data X86 a = X86 { code :: String, counter :: Integer, value :: (X86 a -> a) }
This type is used in do-notation like the following. This makes it easy to write templates for generating if-statements, for-loops, etc...
generateCode :: X86 ()
generateCode = do
label1 <- allocateUniqueLabel
label2 <- allocateUniqueLabel
jmp label1
label label1
jmp label2
label label2
Instructions are defined like this:
jmp :: String -> X86 ()
jmp l = X86 { code = "jmp " ++ l ++ ";\n", counter = 0, value = const () }
label :: String -> X86 ()
label l = X86 { code = l ++ ":\n", counter = 0, value = const () }
And the completed assembly file is printed like so:
printAsm :: X86 a -> String
printAsm X86{code=code} = code
main = do
putStrLn (printAsm generateCode)
I implemented the X86 monad in the following manner. Essentially, the sequence operator concatenates blocks of assembly code in order and ensures the counters are incremented.
instance Monad X86 where
x >> y = X86 { code = code x ++ code y, counter = counter x + counter y, value = value y }
x >>= f = x >> y
where y = f (value x x)
The problem is the labels are not incremented properly, so they are not unique! The following is the output:
jmp Label1;
Label1:
jmp Label1;
Label1:
I desire the output to have a unique value for each label:
jmp Label1;
Label1:
jmp Label2;
Label2:
To complete the example, here is the implementation of the allocatedUniqueLabel function:
allocateUniqueId :: X86 Integer
allocateUniqueId = X86 { code = "", counter = 1, value = counter }
allocateUniqueLabel :: X86 String
allocateUniqueLabel = do
id <- allocateUniqueId
return ("Label" ++ show id)
How can I fix my X86 monad so the labels are unique?
Here is what I've tried:
Incrementing a global counter. => Haskell does not safely allow global state outside the IO monad.
Using the State monad. => I have looked into a number of examples, but do not understand how to integrate them into my existing X86 monad.
Keep track of the counter outside of the monad. => I rather the counter is updated "behind the scenes"; otherwise, a lot of code templates which do not use labels will need to propagate the counter manually.
We can use mtl classes to describe X86 code as effectful programs. We want:
to generate code, this is a Writer effect;
to maintain a counter, this is a State effect.
We worry about instantiating these effects last, and in the description of the programs we use MonadWriter and MonadState constraints.
import Control.Monad.State -- mtl
import Control.Monad.Writer
Allocating a new identifier increments the counter, without generating any code. This only uses the State effect.
type Id = Integer
allocateUniqueLabel :: MonadState Id m => m String
allocateUniqueLabel = do
i <- get
put (i+1) -- increment
return ("Label" ++ show (i+1))
And of course, we have actions to generate code, that don't need to care about the current state. So they use the Writer effect.
jmp :: MonadWriter String m => String -> m ()
jmp l = tell ("jmp " ++ l ++ ";\n")
label :: MonadWriter String m => String -> m ()
label l = tell (l ++ ":\n")
The actual program looks the same as the original, but with more general types.
generateCode :: (MonadState Id m, MonadWriter String m) => m ()
generateCode = do
label1 <- allocateUniqueLabel
label2 <- allocateUniqueLabel
jmp label1
label label1
jmp label2
label label2
The effects are instantiated when we run this program, here using runWriterT/runWriter and runStateT/runState (the order doesn't matter much, these two effects commute).
type X86 = WriterT String (State Id)
runX86 :: X86 () -> String
runX86 gen = evalState (execWriterT gen) 1 -- start counting from 1
-- evalState and execWriterT are wrappers around `runStateT` and `runWriterT`:
-- - execWriterT: discards the result (of type ()), only keeping the generated code.
-- - evalState: discards the final state, only keeping the generated code,
-- and does some unwrapping after there are no effects to handle.
You probably want to use this monad stack:
type X86 a = StateT Integer (Writer String) a
Since you have a state and a writer, you could also consider using RWS (reader-writer-state all in one):
type X86 a = RWS () String Integer a
Let's pick the first one for fun. I'd first define a helper function to increment the counter (monads cannot lawfully increment a counter "automatically"):
instr :: X86 a -> X86 a
instr i = do
x <- i
modify (+1)
return x
Then you could define jmp as:
jmp :: String -> X86 ()
jmp l = instr $ do
lift (tell ("jmp " ++ l ++ ";\n"))
-- 'tell' is one of Writer's operations, and then we 'lift'
-- it into StateT
(The do there is superfluous, however I suspect there will be a pattern of starting instruction definitions with instr $ do)
I would not roll my own monad for this -- it can be instructive to do so, but I think you'll get more mileage using the standard libraries for this one.
As you probably now underestand from the other answers, the problem with your
approach was that even though you were using the counter, you were still
generating your labels locally. In particular
label1 <- allocateUniqueLabel
label label1
was equivalent to
X86 { code = "Label1:\n", counter = 1, value = const () }
We need to assemble the whole code first, generate the labels, and only
afterwards (in some sense) generate the actual code using the labels.
And this is what the other answers are suggesting by having the counter stored
in the State (or RWS) monad.
There is yet another issue that we can address: You want to be able to jump both
forwards and backwards. This is most likely why you have separate
allocateUniqueLabel and label functions. But this allows to set the same
label twice.
It is actually possible to use to do notation with "backwards" binding using
MonadFix,
which defines this monadic operation:
mfix :: (a -> m a) -> m a
Since both State and RWS have MonadFix instances, we can indeed write code
like this:
{-# LANGUAGE GeneralizedNewtypeDeriving, RecursiveDo #-}
module X86
( X86()
, runX86
, label
, jmp
) where
import Control.Monad.RWS
-- In production code it'll be much faster if we replace String with
-- ByteString.
newtype X86 a = X86 (RWS () String Int a)
deriving (Functor, Applicative, Monad, MonadFix)
runX86 :: X86 a -> String
runX86 (X86 k) = snd (execRWS k () 1)
newtype Label = Label { getLabel :: String }
label :: X86 Label
label = X86 $ do
counter <- get
let l = "Label" ++ show counter
tell (l ++ ":\n")
modify (+1)
return (Label l)
jmp :: Label -> X86 ()
jmp (Label l) = X86 . tell $ "jmp " ++ l ++ ";\n"
And use it like this:
example :: X86 ()
example = do
rec l1 <- label
jmp l2
l2 <- label
jmp l1
There are a few things to note:
We need to use the RecursiveDo extension to enable the rec keyword.
Keyword rec delimits a block of mutually recursive definitions. In our case
it could also start one line later (rec jmp l2). GHC then translates it into
using mfix internally. (Using the deprecated mdo keyword instead of rec
would make the code somewhat more natural.)
We wrap the internals in the X86 newtype. First it's always good to hide the
internal implementation, it allows easy refactorings later. Second, mfix
requires that the function passed to it a -> m a isn't strict in its
argument. The effect must not depend on the argument, otherwise mfix
diverges. This is condition is satisfied for our functions, but if the
internals are exposed, someone could define a contrived function like this:
-- | Reset the counter to the specified label.
evilReset :: Label -> X86 ()
evilReset = X86 . put . read . drop 5 . getLabel
Not only it breaks the uniqueness of labels, but also causes the following code
to hang:
diverge :: X86 ()
diverge = do
rec evilReset l2
l2 <- label
return ()
Another quite similar alternative would be to use the
Rand
monad and generate labels with the
Random
instance of
UUID.
Something like WriterT String Rand a, which also has a MonadFix instance.
(From purely academic perspective it might be possible to construct an arrow instead of a monad, that'd implement
ArrowLoop,
but disallow state modifications that depend on values, such as in evilReset. But encapsulation of X86 achieves the same goal, keeping the much friendlier do syntax.)

Could someone provide a machines implementation of the following plan?

I am playing around with the machines module by Edward Kmett, and I'm getting a little confused here and there. I thought the best way to ask a question is to provide a toy use case. Described below.
Machines one and two sit at two prongs of a Wye.
Machine one takes as param a list of ints, and pushes it down stream.
Machine two takes as param a list of chars, and pushes it down stream.
Machine three keeps an internal state, beginning at mempty, then mappends the elements it receives from either machines from the Wye base on some condition (so not both). Machine three then gets the current state and pushes it downtream.
Machine four prints the elements it receives to console.
So far I've got this:
y1 :: PlanT k [Int] m ()
y1 = yield
y2 :: PlanT k [Char] m ()
y2 = yield
But I'm not sure how to combine y1 and y2; or roll an arbitrary process that hides a state, as opposed to using one of the stock combinators exported by Process.hs.
Per suggestion, the link to the machines package:
http://hackage.haskell.org/package/machines
And a pdf giving a very high level description of what it does: https://dl.dropboxusercontent.com/u/4588997/Machines.pdf
I'm also a beginer with machines, here is my result:
import Control.Monad
import Data.Char (intToDigit)
import Data.Machine
import Data.Machine.Plan
import Data.Machine.Source
-- | Produces integers from a list.
m1 :: [Int] -> Source Int
m1 = source
-- | Produces characters from a list.
m2 :: [Char] -> Source Char
m2 = source
-- | Reads a number from its left input. Then reads this many
-- characters from its right input. Outputs the resulting string,
-- together with the number of strings produced so far.
m3 :: Tee Int Char (Int, String)
m3 = construct (loop 0)
where
-- `loop` keeps internal state - the count of strings
-- produced so far.
loop count = do
-- Read a number from L.
n <- awaits L
-- Read this many characters from L.
s <- replicateM n (awaits R)
let count' = count + 1
-- Output the result.
yield (count', s)
loop count'
main = print . run $ tee (m1 [2,3,4,5])
(m2 "Lorem ipsum dolor sit amet") m3
I haven't used a monoid in m3, I used plain numbers instead, but the idea is the same. I also used Tee instead of Wye, because my example needs deterministic input - it chooses if it reads from L or R. But using Wye for a similar purpose would be just the same.
Update: Surely it's possible to use State instead of Identity to keep track of the count. For example:
m3State :: TeeT (State Int) Int Char (Int, String)
m3State = repeatedly $ do
n <- awaits L
s <- replicateM n (awaits R)
lift (modify (+ 1)) -- increment the counter
count <- lift get -- get the counter to output it
yield (count, s)
main = print . flip evalState 0 . runT $ input m3State
I suspect that using repeatedly on a plan is slightly faster than having an explicit monadic loop, but I think in this small example the difference is negligible.
Or if we wanted to just count the number of strings and output it only at the end, we could use Writer (Sum Int) instead. Full code here.

How to translate this python to Haskell?

I'm learning Haskell and as an exercise I'm trying to convert write the read_from function following code to Haskell. Taken from Peter Norvig's Scheme interpreter.
Is there a straightforward way do this?
def read(s):
"Read a Scheme expression from a string."
return read_from(tokenize(s))
parse = read
def tokenize(s):
"Convert a string into a list of tokens."
return s.replace('(',' ( ').replace(')',' ) ').split()
def read_from(tokens):
"Read an expression from a sequence of tokens."
if len(tokens) == 0:
raise SyntaxError('unexpected EOF while reading')
token = tokens.pop(0)
if '(' == token:
L = []
while tokens[0] != ')':
L.append(read_from(tokens))
tokens.pop(0) # pop off ')'
return L
elif ')' == token:
raise SyntaxError('unexpected )')
else:
return atom(token)
def atom(token):
"Numbers become numbers; every other token is a symbol."
try: return int(token)
except ValueError:
try: return float(token)
except ValueError:
return Symbol(token)
There is a straightforward way to "transliterate" Python into Haskell. This can be done by clever usage of monad transformers, which sounds scary, but it's really not. You see, due to purity, in Haskell when you want to use effects such as mutable state (e.g. the append and pop operations are performing mutation) or exceptions, you have to make it a little more explicit. Let's start at the top.
parse :: String -> SchemeExpr
parse s = readFrom (tokenize s)
The Python docstring said "Read a Scheme expression from a string", so I just took the liberty of encoding this as the type signature (String -> SchemeExpr). That docstring becomes obsolete because the type conveys the same information. Now... what is a SchemeExpr? According to your code, a scheme expression can be an int, float, symbol, or list of scheme expressions. Let's create a data type that represents these options.
data SchemeExpr
= SInt Int
| SFloat Float
| SSymbol String
| SList [SchemeExpr]
deriving (Eq, Show)
In order to tell Haskell that the Int we are dealing with should be treated as a SchemeExpr, we need to tag it with SInt. Likewise with the other possibilities. Let's move on to tokenize.
tokenize :: String -> [Token]
Again, the docstring turns into a type signature: turn a String into a list of Tokens. Well, what's a Token? If you look at the code, you'll notice that the left and right paren characters are apparently special tokens, which signal particular behaviors. Anything else is... unspecial. While we could create a data type to more clearly distinguish parens from other tokens, let's just use Strings, to stick a little closer to the original Python code.
type Token = String
Now let's try writing tokenize. First, let's write a quick little operator for making function chaining look a bit more like Python. In Haskell, you can define your own operators.
(|>) :: a -> (a -> b) -> b
x |> f = f x
tokenize s = s |> replace "(" " ( "
|> replace ")" " ) "
|> words
words is Haskell's version of split. However, Haskell has no pre-cooked version of replace that I know of. Here's one that should do the trick:
-- add imports to top of file
import Data.List.Split (splitOn)
import Data.List (intercalate)
replace :: String -> String -> String -> String
replace old new s = s |> splitOn old
|> intercalate new
If you read the docs for splitOn and intercalate, this simple algorithm should make perfect sense. Haskellers would typically write this as replace old new = intercalate new . splitOn old, but I used |> here for easier Python audience understanding.
Note that replace takes three arguments, but above I only invoked it with two. In Haskell you can partially apply any function, which is pretty neat. |> works sort of like the unix pipe, if you couldn't tell, except with more type safety.
Still with me? Let's skip over to atom. That nested logic is a bit ugly, so let's try a slightly different approach to clean it up. We'll use the Either type for a much nicer presentation.
atom :: Token -> SchemeExpr
atom s = Left s |> tryReadInto SInt
|> tryReadInto SFloat
|> orElse (SSymbol s)
Haskell doesn't have the automagical coersion functions int and float, so instead we will build tryReadInto. Here's how it works: we're going to thread Either values around. An Either value is either a Left or a Right. Conventionally, Left is used to signal error or failure, while Right signals success or completion. In Haskell, to simulate the Python-esque function call chaining, you just place the "self" argument as the last one.
tryReadInto :: Read a => (a -> b) -> Either String b -> Either String b
tryReadInto f (Right x) = Right x
tryReadInto f (Left s) = case readMay s of
Just x -> Right (f x)
Nothing -> Left s
orElse :: a -> Either err a -> a
orElse a (Left _) = a
orElse _ (Right a) = a
tryReadInto relies on type inference in order to determine which type it is trying to parse the string into. If the parse fails, it simply reproduces the same string in the Left position. If it succeeds, then it performs whatever function is desired and places the result in the Right position. orElse allows us to eliminate the Either by supplying a value in case the former computations failed. Can you see how Either acts as a replacement for exceptions here? Since the ValueExceptions in the Python code are always caught inside the function itself, we know that atom will never raise an exception. Similarly, in the Haskell code, even though we used Either on the inside of the function, the interface that we expose is pure: Token -> SchemeExpr, no outwardly-visible side effects.
OK, let's move on to read_from. First, ask yourself the question: what side effects does this function have? It mutates its argument tokens via pop, and it has internal mutation on the list named L. It also raises the SyntaxError exception. At this point, most Haskellers will be throwing up their hands saying "oh noes! side effects! gross!" But the truth is that Haskellers use side effects all the time as well. We just call them "monads" in order to scare people away and avoid success at all costs. Mutation can be accomplished with the State monad, and exceptions with the Either monad (surprise!). We will want to use both at the same time, so we'll in fact use "monad transformers", which I'll explain in a bit. It's not that scary, once you learn to see past the cruft.
First, a few utilities. These are just some simple plumbing operations. raise will let us "raise exceptions" as in Python, and whileM will let us write a while loop as in Python. For the latter, we simply have to make it explicit in what order the effects should happen: first perform the effect to compute the condition, then if it's True, perform the effects of the body and loop again.
import Control.Monad.Trans.State
import Control.Monad.Trans.Class (lift)
raise = lift . Left
whileM :: Monad m => m Bool -> m () -> m ()
whileM mb m = do
b <- mb
if b
then m >> whileM mb m
else return ()
We again want to expose a pure interface. However, there is a chance that there will be a SyntaxError, so we will indicate in the type signature that the result will be either a SchemeExpr or a SyntaxError. This is reminiscent of how in Java you can annotate which exceptions a method will raise. Note that the type signature of parse has to change as well, since it might raise the SyntaxError.
data SyntaxError = SyntaxError String
deriving (Show)
parse :: String -> Either SyntaxError SchemeExpr
readFrom :: [Token] -> Either SyntaxError SchemeExpr
readFrom = evalStateT readFrom'
We are going to perform a stateful computation on the token list that is passed in. Unlike the Python, however, we are not going to be rude to the caller and mutate the very list passed to us. Instead, we will establish our own state space and initialize it to the token list we are given. We will use do notation, which provides syntactic sugar to make it look like we're programming imperatively. The StateT monad transformer gives us the get, put, and modify state operations.
readFrom' :: StateT [Token] (Either SyntaxError) SchemeExpr
readFrom' = do
tokens <- get
case tokens of
[] -> raise (SyntaxError "unexpected EOF while reading")
(token:tokens') -> do
put tokens' -- here we overwrite the state with the "rest" of the tokens
case token of
"(" -> (SList . reverse) `fmap` execStateT readWithList []
")" -> raise (SyntaxError "unexpected close paren")
_ -> return (atom token)
I've broken out the readWithList portion into a separate chunk of code,
because I want you to see the type signature. This portion of code introduces
a new scope, so we simply layer another StateT on top of the monad stack
that we had before. Now, the get, put, and modify operations refer
to the thing called L in the Python code. If we want to perform these operations
on the tokens, then we can simply preface the operation with lift in order
to strip away one layer of the monad stack.
readWithList :: StateT [SchemeExpr] (StateT [Token] (Either SyntaxError)) ()
readWithList = do
whileM ((\toks -> toks !! 0 /= ")") `fmap` lift get) $ do
innerExpr <- lift readFrom'
modify (innerExpr:)
lift $ modify (drop 1) -- this seems to be missing from the Python
In Haskell, appending to the end of a list is inefficient, so I instead prepended, and then reversed the list afterwards. If you are interested in performance, then there are better list-like data structures you can use.
Here is the complete file: http://hpaste.org/77852
So if you're new to Haskell, then this probably looks terrifying. My advice is to just give it some time. The Monad abstraction is not nearly as scary as people make it out to be. You just have to learn that what most languages have baked in (mutation, exceptions, etc), Haskell instead provides via libraries. In Haskell, you must explicitly specify which effects you want, and controlling those effects is a little less convenient. In exchange, however, Haskell provides more safety so you don't accidentally mix up the wrong effects, and more power, because you are in complete control of how to combine and refactor effects.
In Haskell, you wouldn't use an algorithm that mutates the data it operates on. So no, there is no straightforward way to do that. However, the code can be rewritten using recursion to avoid updating variables. Solution below uses the MissingH package because Haskell annoyingly doesn't have a replace function that works on strings.
import Data.String.Utils (replace)
import Data.Tree
import System.Environment (getArgs)
data Atom = Sym String | NInt Int | NDouble Double | Para deriving (Eq, Show)
type ParserStack = (Tree Atom, Tree Atom)
tokenize = words . replace "(" " ( " . replace ")" " ) "
atom :: String -> Atom
atom tok =
case reads tok :: [(Int, String)] of
[(int, _)] -> NInt int
_ -> case reads tok :: [(Double, String)] of
[(dbl, _)] -> NDouble dbl
_ -> Sym tok
empty = Node $ Sym "dummy"
para = Node Para
parseToken (Node _ stack, Node _ out) "(" =
(empty $ stack ++ [empty out], empty [])
parseToken (Node _ stack, Node _ out) ")" =
(empty $ init stack, empty $ (subForest (last stack)) ++ [para out])
parseToken (stack, Node _ out) tok =
(stack, empty $ out ++ [Node (atom tok) []])
main = do
(file:_) <- getArgs
contents <- readFile file
let tokens = tokenize contents
parseStack = foldl parseToken (empty [], empty []) tokens
schemeTree = head $ subForest $ snd parseStack
putStrLn $ drawTree $ fmap show schemeTree
foldl is the haskeller's basic structured recursion tool and it serves the same purpose as your while loop and recursive call to read_from. I think the code can be improved a lot, but I'm not so used to Haskell. Below is an almost straight transliteration of the above to Python:
from pprint import pprint
from sys import argv
def atom(tok):
try:
return 'int', int(tok)
except ValueError:
try:
return 'float', float(tok)
except ValueError:
return 'sym', tok
def tokenize(s):
return s.replace('(',' ( ').replace(')',' ) ').split()
def handle_tok((stack, out), tok):
if tok == '(':
return stack + [out], []
if tok == ')':
return stack[:-1], stack[-1] + [out]
return stack, out + [atom(tok)]
if __name__ == '__main__':
tokens = tokenize(open(argv[1]).read())
tree = reduce(handle_tok, tokens, ([], []))[1][0]
pprint(tree)

How do I avoid memory problems when writing to file using the Writer monad?

I am building some moderately large DIMACS files, however with the method used below the memory usage is rather large compared to the size of the files generated, and on some of the larger files I need to generate I run in to out of memory problems.
import Control.Monad.State.Strict
import Control.Monad.Writer.Strict
import qualified Data.ByteString.Lazy.Char8 as B
import Control.Monad
import qualified Text.Show.ByteString as BS
import Data.List
main = printDIMACS "test.cnf" test
test = do
xs <- freshs 100000
forM_ (zip xs (tail xs))
(\(x,y) -> addAll [[negate x, negate y],[x,y]])
type Var = Int
type Clause = [Var]
data DIMACSS = DS{
nextFresh :: Int,
numClauses :: Int
} deriving (Show)
type DIMACSM a = StateT DIMACSS (Writer B.ByteString) a
freshs :: Int -> DIMACSM [Var]
freshs i = do
next <- gets nextFresh
let toRet = [next..next+i-1]
modify (\s -> s{nextFresh = next+i})
return toRet
fresh :: DIMACSM Int
fresh = do
i <- gets nextFresh
modify (\s -> s{nextFresh = i+1})
return i
addAll :: [Clause] -> DIMACSM ()
addAll c = do
tell
(B.concat .
intersperse (B.pack " 0\n") .
map (B.unwords . map BS.show) $ c)
tell (B.pack " 0\n")
modify (\s -> s{numClauses = numClauses s + length c})
add h = addAll [h]
printDIMACS :: FilePath -> DIMACSM a -> IO ()
printDIMACS file f = do
writeFile file ""
appendFile file (concat ["p cnf ", show i, " ", show j, "\n"])
B.appendFile file b
where
(s,b) = runWriter (execStateT f (DS 1 0))
i = nextFresh s - 1
j = numClauses s
I would like to keep the monadic building of clauses since it is very handy, but I need to overcome the memory problem. How do I optimize the above program so that it doesn't use too much memory?
If you want good memory behavior, you need to make sure that you write out the clauses as you generate them, instead of collecting them in memory and dumping them as such, either using lazyness or a more explicit approach such as conduits, enumerators, pipes or the like.
The main obstacle to that approach is that the DIMACS format expects the number of clauses and variables in the header. This prevents the naive implementation from being sufficiently lazy. There are two possibilities:
The pragmatic one is to write the clauses first to a temporary location. After that the numbers are known, so you write them to the real file and append the contents of the temporary file.
The prettier approach is possible if the generation of clauses has no side effects (besides the effects offered by your DIMACSM monad) and is sufficiently fast: Run it twice, first throwing away the clauses and just calculating the numbers, print the header line, run the generator again; now printing the clauses.
(This is from my experience with implementing SAT-Britney, where I took the second approach, because it fitted better with other requirements in that context.)
Also, in your code, addAll is not lazy enough: The list c needs to be retained even after writing (in the MonadWriter sense) the clauses. This is another space leak. I suggest you implement add as the primitive operation and then addAll = mapM_ add.
As explained in Joachim Breitner's answer the problem was that DIMACSM was not lazy enough, both because the strict versions of the monads was used and because the number of variables and clauses are needed before the ByteString can be written to the file. The solution is to use the lazy versions of the Monads and execute them twice. It turns out that it is also necessary to have WriterT be the outer monad:
import Control.Monad.State
import Control.Monad.Writer
...
type DIMACSM a = WriterT B.ByteString (State DIMACSS) a
...
printDIMACS :: FilePath -> DIMACSM a -> IO ()
printDIMACS file f = do
writeFile file ""
appendFile file (concat ["p cnf ", show i, " ", show j, "\n"])
B.appendFile file b
where
s = execState (execWriterT f) (DS 1 0)
b = evalState (execWriterT f) (DS 1 0)
i = nextFresh s - 1
j = numClauses s

Resources