Passing State from a Producer to a Parser - haskell

I'm using pipes, attoparsec, and pipes-attoparsec to write a database dump file converter. The general format of the file is to have a create table command followed by an optional insert command. In addition to transforming the statements in place, the table definitions have to be held in memory until the very end for additional processing (indexes, constraints, etc.).
This works fine, but now I need to allow some of my internal parsers to have access to my Producer's State in order to determine which parser needs to be run while processing the values from the insert command.
I tried something like this:
-- IO
import qualified Data.ByteString.Char8 as BS (putStrLn)
import System.Exit (ExitCode (..), exitSuccess, exitFailure)
import System.IO (hPutStrLn, stderr)
-- Pipes
import Pipes (runEffect, for, liftIO, Producer, Effect)
import Pipes.Attoparsec (parsed, ParsingError)
import Pipes.Lift (runStateP)
import Pipes.Safe (runSafeT)
import qualified Pipes.ByteString as PBS (stdin)
-- State
import Control.Monad.Trans.Class (lift)
import Control.Monad.Trans.State.Strict
dump' :: StateT ParserState Parser Command
dump' = fmap Create createStatements' <|> fmap Insert justData'
doStuff :: MonadIO m => Effect m (Either (ParsingError, Producer ByteString (StateT ParserState m) ()) (), ParserState)
doStuff = runStateP defaultParserState theStuff
theStuff :: MonadIO m => Effect (StateT ParserState m) (Either (ParsingError, Producer ByteString (StateT ParserState m) ()) ())
theStuff = for runParser (liftIO . BS.putStrLn <=< lift . processCommand)
runParser :: MonadIO m => Producer Command (StateT ParserState m) (Either (ParsingError, Producer ByteString (StateT ParserState m) ()) ())
runParser = do
s <- lift get
liftIO $ putStrLn "runParser"
liftIO $ putStrLn $ show s
parsed (evalStateT dump' s) PBS.stdin
processCommand :: MonadIO m => Command -> StateT ParserState m ByteString
processCommand (Create xs) = do
currentState <- get
liftIO $ putStrLn "processCommand"
liftIO $ putStrLn $ show currentState
_ <- put (currentState { constructs = xs ++ (constructs currentState)})
return $ P.firstPass $ P.transformConstructs xs
processCommand (Insert x) = return x
Complete source (including parsers): https://github.com/cimmanon/mysqlnothx/blob/parser-state/src/Main.hs
When I run it, I get a result that looks something like this:
runParser
ParserState {constructs = []}
processCommand
ParserState {constructs = []}
processCommand
ParserState {constructs = [ ... ]}
processCommand
ParserState {constructs = [ ..... ]}
I was expecting runParser (which would grab the latest contents from State) to be run every time processCommand runs, but that's clearly not the case based on the output. When I check the contents of State within the parser, it's always empty no matter how many commands are parsed.
How can I extend State from my Producers to my Parser (dump') so that they share the same State? If my Producer has 4 values in State, the Parser should also see those same 4 values.

I was expecting runParser (which would grab the latest contents from State) to be run every time processCommand runs, but that's clearly not the case.
Your main effect is for runParser (liftIO . BS.putStrLn <=< lift . processCommand). To understand what this effect does you need to understand what for does:
(for p body) loops over p replacing each yield with body
"Loops over p" is accurate if a bit confusing. It doesn't run p once for each value produced by p; that would explode! Instead for replaces every yield in p with body. By replacing yield with body it runs body once for every yielded value. Running the body once for each produced value is similar to how in other languages a for-loop over a list runs the body once for each value in the list.
Your runParser is
runParser = do
s <- lift get
liftIO $ putStrLn "runParser"
liftIO $ putStrLn $ show s
parsed (evalStateT dump' s) PBS.stdin
It reads the state, outputs it, and produces the Commands parsed from stdin. Pipes-autoparsec's parsed parses the source and yields once for each completely successfully parsed value. Your for then replaces each of parsed's yields with liftIO . BS.putStrLn <=< lift . processCommand. The complete effect runs runParser once and processCommand once for each yield, which is what you're observing in the output.

Related

Memoizing and repeating IO monads

EDITED 2015-11-29: see bottom
I'm trying to write an application that has a do-last-action-again button. The command in question can ask for input, and my thought for how to accomplish this was to just rerun the resulting monad with memoized IO.
There are lots of posts on SO with similar questions, but none of the solutions seem to work here.
I lifted the memoIO code from this SO answer, and changed the implementation to run over MonadIO.
-- Memoize an IO function
memoIO :: MonadIO m => m a -> m (m a)
memoIO action = do
ref <- liftIO $ newMVar Nothing
return $ do
x <- maybe action return =<< liftIO (takeMVar ref)
liftIO . putMVar ref $ Just x
return x
I've got a small repro of my app's approach, the only real difference being my app has a big transformer stack instead of just running in IO:
-- Global variable to contain the action we want to repeat
actionToRepeat :: IORef (IO String)
actionToRepeat = unsafePerformIO . newIORef $ return ""
-- Run an action and store it as the action to repeat
repeatable :: IO String -> IO String
repeatable action = do
writeIORef actionToRepeat action
action
-- Run the last action stored by repeatable
doRepeat :: IO String
doRepeat = do
x <- readIORef actionToRepeat
x
The idea being I can store an action with memoized IO in an IORef (via repeatable) when I record what was last done, and then do it again it out with doRepeat.
I test this via:
-- IO function to memoize
getName :: IO String
getName = do
putStr "name> "
getLine
main :: IO ()
main = do
repeatable $ do
memoized <- memoIO getName
name <- memoized
putStr "hello "
putStrLn name
return name
doRepeat
return ()
with expected output:
name> isovector
hello isovector
hello isovector
but actual output:
name> isovector
hello isovector
name> wasnt memoized
hello wasnt memoized
I'm not entirely sure what the issue is, or even how to go about debugging this. Gun to my head, I'd assume lazy evaluation is biting me somewhere, but I can't figure out where.
Thanks in advance!
EDIT 2015-11-29: My intended use case for this is to implement the repeat last change operator in a vim-clone. Each action can perform an arbitrary number of arbitrary IO calls, and I would like it to be able to specify which ones should be memoized (reading a file, probably not. asking the user for input, yes).
the problem is in main you are creating a new memo each time you call the action
you need to move memoized <- memoIO getName up above the action
main :: IO ()
main = do
memoized <- memoIO getName --moved above repeatable $ do
repeatable $ do
--it was here
name <- memoized
putStr "hello "
putStrLn name
return name
doRepeat
return ()
edit: is this acceptable
import Data.IORef
import System.IO.Unsafe
{-# NOINLINE actionToRepeat #-}
actionToRepeat :: IORef (IO String)
actionToRepeat = unsafePerformIO . newIORef $ return ""
type Repeatable a = IO (IO a)
-- Run an action and store the Repeatable part of the action
repeatable :: Repeatable String -> IO String
repeatable action = do
repeatAction <- action
writeIORef actionToRepeat repeatAction
repeatAction
-- Run the last action stored by repeatable
doRepeat :: IO String
doRepeat = do
x <- readIORef actionToRepeat
x
-- everything before (return $ do) is run just once
hello :: Repeatable String
hello = do
putStr "name> "
name <- getLine
return $ do
putStr "hello "
putStrLn name
return name
main :: IO ()
main = do
repeatable hello
doRepeat
return ()
I came up with a solution. It requires wrapping the original monad in a new transformer which records the results of IO and injects them the next time the underlying monad is run.
Posting it here so my answer is complete.
{-# LANGUAGE FlexibleContexts #-}
{-# LANGUAGE GeneralizedNewtypeDeriving #-}
{-# LANGUAGE LambdaCase #-}
import Control.Applicative (Applicative(..))
import Data.Dynamic
import Data.Maybe (fromJust)
import Control.Monad.RWS
-- | A monad transformer adding the ability to record the results
-- of IO actions and later replay them.
newtype ReplayT m a =
ReplayT { runReplayT :: RWST () [Dynamic] [Dynamic] m a }
deriving ( Functor
, Applicative
, Monad
, MonadIO
, MonadState [Dynamic]
, MonadWriter [Dynamic]
, MonadTrans
)
-- | Removes the first element from a list State and returns it.
dequeue :: MonadState [r] m
=> m (Maybe r)
dequeue = do
get >>= \case
[] -> return Nothing
(x:xs) -> do
put xs
return $ Just x
-- | Marks an IO action to be memoized after its first invocation.
sample :: ( MonadIO m
, Typeable r)
=> IO r
-> ReplayT m r
sample action = do
a <- dequeue >>= \case
Just x -> return . fromJust $ fromDynamic x
Nothing -> liftIO action
tell [toDyn a]
return a
-- | Runs an action and records all of its sampled IO. Returns a
-- action which when invoked will use the recorded IO.
record :: Monad m
=> ReplayT m a
-> m (m a)
record action = do
(a, w) <- evalRWST (runReplayT action) () []
return $ do
evalRWST (runReplayT action) () w
return a

Handling exceptions (ExceptT) in chain of actions

I am trying to use an exception to skip parts of the code here. Instead of getting caught by catcheE and resuming normal behavior all following actions in the mapM_ chain get skipped.
I looked at this question and it appears that catchE ~ main and checkMaybe ~ intercept.
I also checked the implementation of mapM_to be sure it does what i want it to, but i don't understand how the Left value can escape dlAsset to affect the behavior of mapM_.
I refactored this from a version where i simply used an empty string as an exception marker for the failed lookup. In that version checkMaybe just returned a Right value immediately and it worked (matching on "" to 'catch')
import Data.HashMap.Strict as HM hiding (map)
import qualified Data.ByteString.Lazy as BS
import qualified Data.ByteString.Char8 as BSC8
import qualified JSONParser as P -- my module
retrieveAssets :: (Text -> Text) -> ExceptT Text IO ()
retrieveAssets withName = withManager $ (lift ((HM.keys . P.assets)
<$> P.raw) ) >>= mapM_ f
where
f = \x -> dlAsset x "0.1246" (withName x)
dlAsset :: Text -> Text -> Text -> ReaderT Manager (ExceptT Text IO) ()
dlAsset name size dest = do
req <- lift $ (P.assetLookup name size <$> P.raw) >>= checkMaybe
name >>= parseUrl . unpack -- lookup of a url
res <- httpLbs req
lift $ (liftIO $ BS.writeFile (unpack dest) $ responseBody res)
`catchE` (\_ -> return ()) -- always a Right value?
where
checkMaybe name a = case a of
Nothing -> ExceptT $ fmap Left $ do
BSC8.appendFile "./resources/images/missingFiles.txt" $
BSC8.pack $ (unpack name) ++ "\n"
putStrLn $ "lookup of " ++ (unpack name) ++ " failed"
return name
Just x -> lift $ pure x
(had to reformat to become somewhat readable here)
edit: i'd like to understand what actually happens here, that would probably help me more than knowing which part of the code is wrong.
The problem is that your call to catchE only covered the very last line of dlAsset. It needs to be moved to the left of the do-notation indentation level to cover all of the do notation.

How to parse a large XML file in Haskell with limited amount of resources?

I want to extract information from a large XML file (around 20G) in Haskell. Since it is a large file, I used SAX parsing functions from Hexpath.
Here is a simple code I tested:
import qualified Data.ByteString.Lazy as L
import Text.XML.Expat.SAX as Sax
parse :: FilePath -> IO ()
parse path = do
inputText <- L.readFile path
let saxEvents = Sax.parse defaultParseOptions inputText :: [SAXEvent Text Text]
let txt = foldl' processEvent "" saxEvents
putStrLn txt
After activating profiling in Cabal, it says that parse.saxEvents took 85% of allocated memory. I also used foldr and the result is the same.
If processEvent becomes complex enough, the program crashes with a stack space overflow error.
What am I doing wrong?
You don't say what processEvent is like. In principle, it ought to be unproblematic to use lazy ByteString for a strict left fold over lazily generated input, so I'm not sure what is going wrong in your case. But one ought to use streaming-appropriate types when dealing with gigantic files!
In fact, hexpat does have 'streaming' interface (just like xml-conduit). It uses the not-too-well known List library and the rather ugly List class it defines. In principle the ListT type from the List package should work well. I gave up quickly because of a lack of combinators, and wrote an appropriate instance of the ugly List class for a wrapped version of Pipes.ListT which I then used to export ordinary Pipes.Producer functions like parseProduce. The trivial manipulations needed for this are appended below as PipesSax.hs
Once we have parseProducer we can convert a ByteString or Text Producer into a Producer of SaxEvents with Text or ByteString components. Here are some simple operations. I was using a 238M "input.xml"; the programs never need more than 6 mb of memory, to judge from looking at top.
-- Sax.hs Most of the IO actions use a registerIds pipe defined at the bottom which is tailored to a giant bit of xml of which this is a valid 1000 fragment http://sprunge.us/WaQK
{-#LANGUAGE OverloadedStrings #-}
import PipesSax ( parseProducer )
import Data.ByteString ( ByteString )
import Text.XML.Expat.SAX
import Pipes -- cabal install pipes pipes-bytestring
import Pipes.ByteString (toHandle, fromHandle, stdin, stdout )
import qualified Pipes.Prelude as P
import qualified System.IO as IO
import qualified Data.ByteString.Char8 as Char8
sax :: MonadIO m => Producer ByteString m ()
-> Producer (SAXEvent ByteString ByteString) m ()
sax = parseProducer defaultParseOptions
-- stream xml from stdin, yielding hexpat tagstream to stdout;
main0 :: IO ()
main0 = runEffect $ sax stdin >-> P.print
-- stream the extracted 'IDs' from stdin to stdout
main1 :: IO ()
main1 = runEffect $ sax stdin >-> registryIds >-> stdout
-- write all IDs to a file
main2 =
IO.withFile "input.xml" IO.ReadMode $ \inp ->
IO.withFile "output.txt" IO.WriteMode $ \out ->
runEffect $ sax (fromHandle inp) >-> registryIds >-> toHandle out
-- folds:
-- print number of IDs
main3 = IO.withFile "input.xml" IO.ReadMode $ \inp ->
do n <- P.length $ sax (fromHandle inp) >-> registryIds
print n
-- sum the meaningful part of the IDs - a dumb fold for illustration
main4 = IO.withFile "input.xml" IO.ReadMode $ \inp ->
do let pipeline = sax (fromHandle inp) >-> registryIds >-> P.map readIntId
n <- P.fold (+) 0 id pipeline
print n
where
readIntId :: ByteString -> Integer
readIntId = maybe 0 (fromIntegral.fst) . Char8.readInt . Char8.drop 2
-- my xml has tags with attributes that appear via hexpat thus:
-- StartElement "FacilitySite" [("registryId","110007915364")]
-- and the like. This is just an arbitrary demo stream manipulation.
registryIds :: Monad m => Pipe (SAXEvent ByteString ByteString) ByteString m ()
registryIds = do
e <- await -- we look for a 'SAXEvent'
case e of -- if it matches, we yield, else we go to the next event
StartElement "FacilitySite" [("registryId",a)] -> do yield a
yield "\n"
registryIds
_ -> registryIds
-- 'library': PipesSax.hs
This just newtypes Pipes.ListT to get the appropriate instances. We don't export anything to do with List or ListT but just use the standard Pipes.Producer concept.
{-#LANGUAGE TypeFamilies, GeneralizedNewtypeDeriving #-}
module PipesSax (parseProducerLocations, parseProducer) where
import Data.ByteString (ByteString)
import Text.XML.Expat.SAX
import Data.List.Class
import Control.Monad
import Control.Applicative
import Pipes
import qualified Pipes.Internal as I
parseProducer
:: (Monad m, GenericXMLString tag, GenericXMLString text)
=> ParseOptions tag text
-> Producer ByteString m ()
-> Producer (SAXEvent tag text) m ()
parseProducer opt = enumerate . enumerate_
. parseG opt
. Select_ . Select
parseProducerLocations
:: (Monad m, GenericXMLString tag, GenericXMLString text)
=> ParseOptions tag text
-> Producer ByteString m ()
-> Producer (SAXEvent tag text, XMLParseLocation) m ()
parseProducerLocations opt =
enumerate . enumerate_ . parseLocationsG opt . Select_ . Select
newtype ListT_ m a = Select_ { enumerate_ :: ListT m a }
deriving (Functor, Monad, MonadPlus, MonadIO
, Applicative, Alternative, Monoid, MonadTrans)
instance Monad m => List (ListT_ m) where
type ItemM (ListT_ m) = m
joinL = Select_ . Select . I.M . liftM (enumerate . enumerate_)
runList = liftM emend . next . enumerate . enumerate_
where
emend (Right (a,q)) = Cons a (Select_ (Select q))
emend _ = Nil

Composing IO Monads using do

I have code in the Reader Monad, so as to pass a file handle as an invisible parameter down the Reader chain.
In writeMail, I am trying to create a Reader, which, when run using runReader, produces an IO () output which is itself the result of a chain of IO monads
writeMail :: Reader Handle (IO ())
writeMail mail = do
wmh <- writeMailHeaders mail
wmb <- writeMailBody mail
return $ wmh >>= \_ -> wmb
However I am finding that only the last in the IO chain i.e. wmb, prints at the console.
Can anyone see what I should be doing to get wmh, then wmb to print?
With simpler example:
module Read where
import Data.Functor.Identity
write :: Monad m => m (IO ())
write = do
a <- return $ putStrLn "foo"
b <- return $ putStrLn "bar"
return $ a >> b
main :: IO ()
main = runIdentity write
main prints both "foo" and "bar". So I suspect the error is in writeMailHeaders.
What you need is not just a reader, but a ReaderT monad transformer with IO as a base monad.
Since your example was incomplete, I made some changes to show your options:
import Control.Monad.Reader
writeMail :: ReaderT Handle IO ()
writeMail = do
-- Here's how you get your handle to further do something to it:
handle <- ask
-- Here's how you do the IO actions.
-- Notice the `lift` function,
-- which allows us to run actions of the base monad,
-- which in that case is `IO`.
lift $ do
print "bla bla"
print "bla"

How do I break out of a loop in Haskell?

The current version of the Pipes tutorial, uses the following two functions in one of the example:
stdout :: () -> Consumer String IO r
stdout () = forever $ do
str <- request ()
lift $ putStrLn str
stdin :: () -> Producer String IO ()
stdin () = loop
where
loop = do
eof <- lift $ IO.hIsEOF IO.stdin
unless eof $ do
str <- lift getLine
respond str
loop
As is mentinoed in the tutorial itself, P.stdin is a bit more complicated due to the need to check for the end of input.
Are there any nice ways to rewrite P.stdin to not need a manual tail recursive loop and use higher order control flow combinators like P.stdout does? In an imperative language I would use a structured while loop or a break statement to do the same thing:
while(not IO.isEOF(IO.stdin) ){
str <- getLine()
respond(str)
}
forever(){
if(IO.isEOF(IO.stdin) ){ break }
str <- getLine()
respond(str)
}
I prefer the following:
import Control.Monad
import Control.Monad.Trans.Either
loop :: (Monad m) => EitherT e m a -> m e
loop = liftM (either id id) . runEitherT . forever
-- I'd prefer 'break', but that's in the Prelude
quit :: (Monad m) => e -> EitherT e m r
quit = left
You use it like this:
import Pipes
import qualified System.IO as IO
stdin :: () -> Producer String IO ()
stdin () = loop $ do
eof <- lift $ lift $ IO.hIsEOF IO.stdin
if eof
then quit ()
else do
str <- lift $ lift getLine
lift $ respond str
See this blog post where I explain this technique.
The only reason I don't use that in the tutorial is that I consider it less beginner-friendly.
Looks like a job for whileM_:
stdin () = whileM_ (lift . fmap not $ IO.hIsEOF IO.stdin) (lift getLine >>= respond)
or, using do-notation similarly to the original example:
stdin () =
whileM_ (lift . fmap not $ IO.hIsEOF IO.stdin) $ do
str <- lift getLine
respond str
The monad-loops package offers also whileM which returns a list of intermediate results instead of ignoring the results of the repeated action, and other useful combinators.
Since there is no implicit flow there is no such thing like "break". Moreover your sample already is small block which will be used in more complicated code.
If you want to stop "producing strings" it should be supported by your abstraction. I.e. some "managment" of "pipes" using special monad in Consumer and/or other monads that related with this one.
You can simply import System.Exit, and use exitWith ExitSuccess
Eg. if (input == 'q')
then exitWith ExitSuccess
else print 5 (anything)

Resources