I'm writing a program that creates a shell script containing one command for each image file in a directory. There are 667,944 images in the directory, so I need to handle the strictness/laziness issue properly.
Here's a simple example that gives me Stack space overflow. It does work if I give it more space using +RTS -Ksize -RTS, but it should be able run with little memory, producing output immediately. So I've been reading the stuff about strictness in the Haskell wiki and the wikibook on Haskell, trying to figure out how to fix the problem, and I think it's one of the mapM commands that is giving me grief, but I still don't understand enough about strictness to sort the problem.
I've found some other questions on SO that seem relevant (Is mapM in Haskell strict? Why does this program get a stack overflow? and Is Haskell's mapM not lazy?), but enlightenment still eludes me.
import System.Environment (getArgs)
import System.Directory (getDirectoryContents)
genCommand :: FilePath -> FilePath -> FilePath -> IO String
genCommand indir outdir file = do
let infile = indir ++ '/':file
let angle = 0 -- have to actually read the file to calculate this for real
let outfile = outdir ++ '/':file
return $! "convert " ++ infile ++ " -rotate " ++ show angle ++
" -crop 143x143+140+140 " ++ outfile
main :: IO ()
main = do
putStrLn "#!/bin/sh"
(indir:outdir:_) <- getArgs
files <- getDirectoryContents indir
let imageFiles = filter (`notElem` [".", ".."]) files
commands <- mapM (genCommand indir outdir) imageFiles
mapM_ putStrLn commands
EDIT: TEST #1
Here's the newest version of the example.
import System.Environment (getArgs)
import System.Directory (getDirectoryContents)
import Control.Monad ((>=>))
genCommand :: FilePath -> FilePath -> FilePath -> IO String
genCommand indir outdir file = do
let infile = indir ++ '/':file
let angle = 0 -- have to actually read the file to calculate this for real
let outfile = outdir ++ '/':file
return $! "convert " ++ infile ++ " -rotate " ++ show angle ++
" -crop 143x143+140+140 " ++ outfile
main :: IO ()
main = do
putStrLn "TEST 1"
(indir:outdir:_) <- getArgs
files <- getDirectoryContents indir
putStrLn $ show (length files)
let imageFiles = filter (`notElem` [".", ".."]) files
-- mapM_ (genCommand indir outdir >=> putStrLn) imageFiles
mapM_ (\filename -> genCommand indir outdir filename >>= putStrLn) imageFiles
I compile it with the command ghc --make -O2 amy2.hs -rtsopts. If I run it with the command ./amy2 ~/nosync/GalaxyZoo/table2/images/ wombat, I get
TEST 1
Stack space overflow: current size 8388608 bytes.
Use `+RTS -Ksize -RTS' to increase it.
If I instead run it with the command ./amy2 ~/nosync/GalaxyZoo/table2/images/ wombat +RTS -K20M, I get the correct output...eventually:
TEST 1
667946
convert /home/amy/nosync/GalaxyZoo/table2/images//587736546846572812.jpeg -rotate 0 -crop 143x143+140+140 wombat/587736546846572812.jpeg
convert /home/amy/nosync/GalaxyZoo/table2/images//587736542558617814.jpeg -rotate 0 -crop 143x143+140+140 wombat/587736542558617814.jpeg
...and so on.
This isn't really a strictness issue(*), but an order of evaluation issue. Unlike lazily evaluated pure values, monadic effects must happen in deterministic order. mapM executes every action in the given list and gathers the results, but it cannot return until the whole list of actions is executed, so you don't get the same streaming behavior as with pure list functions.
The easy fix in this case is to run both genCommand and putStrLn inside the same mapM_. Note that mapM_ doesn't suffer from the same issue since it is not building an intermediate list.
mapM_ (genCommand indir outdir >=> putStrLn) imageFiles
The above uses the "kleisli composition operator" >=> from Control.Monad which is like the function composition operator . except for monadic functions. You can also use the normal bind and a lambda.
mapM_ (\filename -> genCommand indir outdir filename >>= putStrLn) imageFiles
For more complex I/O applications where you want better composability between small, monadic stream processors, you should use a library such as conduit or pipes.
Also, make sure you are compiling with either -O or -O2.
(*) To be exact, it is also a strictness issue, because in addition to building a large, intermediate list in memory, laziness causes mapM to build unnecessary thunks and use up stack.
EDIT: So it seems the main culprit might be getDirectoryContents. Looking at the function's source code, it essentially does the same kind of list accumulation internally as mapM.
In order to do streaming directory listing, we need to use System.Posix.Directory which unfortunately makes the program incompatible with non-POSIX systems (like Windows). You can stream the directory contents by e.g. using continuation passing style
import System.Environment (getArgs)
import Control.Monad ((>=>))
import System.Posix.Directory (openDirStream, readDirStream, closeDirStream)
import Control.Exception (bracket)
genCommand :: FilePath -> FilePath -> FilePath -> IO String
genCommand indir outdir file = do
let infile = indir ++ '/':file
let angle = 0 -- have to actually read the file to calculate this for real
let outfile = outdir ++ '/':file
return $! "convert " ++ infile ++ " -rotate " ++ show angle ++
" -crop 143x143+140+140 " ++ outfile
streamingDirContents :: FilePath -> (FilePath -> IO ()) -> IO ()
streamingDirContents root cont = do
let loop stream = do
fp <- readDirStream stream
case fp of
[] -> return ()
_ | fp `notElem` [".", ".."] -> cont fp >> loop stream
| otherwise -> loop stream
bracket (openDirStream root) loop closeDirStream
main :: IO ()
main = do
putStrLn "TEST 1"
(indir:outdir:_) <- getArgs
streamingDirContents indir (genCommand indir outdir >=> putStrLn)
Here's how you could do the same thing using conduit:
import System.Environment (getArgs)
import System.Posix.Directory (openDirStream, readDirStream, closeDirStream)
import Data.Conduit
import qualified Data.Conduit.List as L
import Control.Monad.IO.Class (liftIO, MonadIO)
genCommand :: FilePath -> FilePath -> FilePath -> IO String
genCommand indir outdir file = do
let infile = indir ++ '/':file
let angle = 0 -- have to actually read the file to calculate this for real
let outfile = outdir ++ '/':file
return $! "convert " ++ infile ++ " -rotate " ++ show angle ++
" -crop 143x143+140+140 " ++ outfile
dirSource :: (MonadResource m, MonadIO m) => FilePath -> Source m FilePath
dirSource root = do
bracketP (openDirStream root) closeDirStream $ \stream -> do
let loop = do
fp <- liftIO $ readDirStream stream
case fp of
[] -> return ()
_ -> yield fp >> loop
loop
main :: IO ()
main = do
putStrLn "TEST 1"
(indir:outdir:_) <- getArgs
let files = dirSource indir $= L.filter (`notElem` [".", ".."])
commands = files $= L.mapM (liftIO . genCommand indir outdir)
runResourceT $ commands $$ L.mapM_ (liftIO . putStrLn)
The nice thing about conduit is that you regain the ability to compose pieces of functionality with things like conduit versions of filter and mapM. The $= operator streams stuff forward in the chain and $$ connects the stream to a consumer.
The not-so-nice thing is that real world is complicated and writing efficient and robust code requires us to jump through some hoops with resource management. That's why all the operations work in the ResourceT monad transformer which keeps track of e.g. open file handles and cleans them up promptly and deterministically when they are no longer needed or e.g. if the computation gets aborted by an exception (this is in contrast to using lazy I/O and relying on the garbage collector to eventually release any scarce resources).
However, this means that we a) need to run the final resulting conduit operation with runResourceT and b) we need to explicitly lift I/O operations to the transformed monad using liftIO instead of being able to directly write e.g. L.mapM_ putStrLn.
Related
I'm trying to learn how to work with IO in Haskell by writing a function that, if there is a flag, will take a list of points from a file, and if there is no flag, it asks the user to enter them.
dispatch :: [String] -> IO ()
dispatch argList = do
if "file" `elem` argList
then do
let (path : otherArgs) = argList
points <- getPointsFile path
else
print "Enter a point in the format: x;y"
input <- getLine
if (input == "exit")
then do
print "The user inputted list:"
print $ reverse xs
else (inputStrings (input:xs))
if "help" `elem` argList
then help
else return ()
dispatch [] = return ()
dispatch _ = error "Error: invalid args"
getPointsFile :: String -> IO ([(Double, Double)])
getPointsFile path = do
handle <- openFile path ReadMode
contents <- hGetContents handle
let points_str = lines contents
let points = foldl (\l d -> l ++ [tuplify2 $ splitOn ";" d]) [] points_str
hClose handle
return points
I get this: do-notation in pattern Possibly caused by a missing 'do'?` after `if "file" `elem` argList.
I'm also worried about the binding issue, assuming that I have another flag that says which method will be used to process the points. Obviously it waits for points, but I don't know how to make points visible not only in if then else, constructs. In imperative languages I would write something like:
init points
if ... { points = a}
else points = b
some actions with points
How I can do something similar in Haskell?
Here's a fairly minimal example that I've done half a dozen times when I'm writing something quick and dirty, don't have a complicated argument structure, and so can't be bothered to do a proper job of setting up one of the usual command-line parsing libraries. It doesn't explain what went wrong with your approach -- there's an existing good answer there -- it's just an attempt to show what this kind of thing looks like when done idiomatically.
import System.Environment
import System.Exit
import System.IO
main :: IO ()
main = do
args <- getArgs
pts <- case args of
["--help"] -> usage stdout ExitSuccess
["--file", f] -> getPointsFile f
[] -> getPointsNoFile
_ -> usage stderr (ExitFailure 1)
print (frobnicate pts)
usage :: Handle -> ExitCode -> IO a
usage h c = do
nm <- getProgName
hPutStrLn h $ "Usage: " ++ nm ++ " [--file FILE]"
hPutStrLn h $ "Frobnicate the points in FILE, or from stdin if no file is supplied."
exitWith c
getPointsFile :: FilePath -> IO [(Double, Double)]
getPointsFile = {- ... -}
getPointsNoFile :: IO [(Double, Double)]
getPointsNoFile = {- ... -}
frobnicate :: [(Double, Double)] -> Double
frobnicate = {- ... -}
if in Haskell doesn't inherently have anything to do with control flow, it just switches between expressions. Which, in Haskell, happen to include do blocks of statements (if we want to call them that), but you still always need to make that explicit, i.e. you need to say both then do and else do if there are multiple statements in each branch.
Also, all the statements in a do block need to be indented to the same level. So in your case
if "file" `elem` argList
...
if "help" `elem` argList
Or alternatively, if the help check should only happen in the else branch, it needs to be indented to the statements in that do block.
Independent of all that, I would recommend to avoid parsing anything in an IO context. It is usually much less hassle and easier testable to first parse the strings into a pure data structure, which can then easily be processed by the part of the code that does IO. There are libraries like cmdargs and optparse-applicative that help with the parsing part.
I have revisited Haskell lateley and constructed a toy programming language parser/interpreter. Using Parsec for lexing and parsing and a separate interpreter. I'm running in to some issues with feeding the result from the parser to my interpreter and handle the potential error from both the interpreter and parser. I end up with something like this:
main = do
fname <- getArgs
input <- readFile (head fname)
case lparse (head fname) input of
Left msg -> putStrLn $ show msg
Right p -> case intrp p of
Left msg -> putStrLn $ show msg
Right r -> putStrLn $ show r
This dosn't look pretty at all. My problem is that lparse returns Either ParseError [(String, Stmt)] and itrp returns the type Either ItrpError Stmt so I'm having a real hard time feeding the Right result from the parser to the interpreter and at the same time bail and print the possible ParseError or IntrpError.
The closest to what i want is something like this
main = do
fname <- getArgs
input <- readFile (head fname)
let prog = lparse (head fname) input
(putStrLn . show) (intrp <$> prog)
But this will not surprisingly yield a nested Either and not print pretty either.
So are there any nice Haskell ideomatic way of doing this threading results from one computation to another and handling errors (Lefts) in a nice way without nesting cases?
Edit
adding types of lparse and itrp
lparse :: Text.Parsec.Pos.SourceName -> String -> Either Text.Parsec.Error.ParseError [([Char], Stmt)]
intrp :: [([Char], Stmt)] -> Either IntrpError Stmt
While not perfect, I'd create a helper function for embedding any Showable error from Either into MonadError:
{-# LANGUAGE FlexibleContexts #-}
import Control.Monad.Except
strErr :: (MonadError String m, Show e) => Either e a -> m a
strErr = either (throwError . show) return
Then if you have a computation that can fail with errors, like
someFn :: ExceptT String IO ()
someFn = strErr (Left 42)
you can run it (printing errors to stdout) as
main :: IO ()
main = runExceptT someFn >>= either putStrLn return
In your case it'd be something like
main = either putStrLn return <=< runExceptT $ do
fname <- liftIO getArgs
input <- liftIO $ readFile (head fname)
prog <- strErr $ lparse (head fname) input
r <- strErr $ interp prog
print r
Well, if you want to chain successful computations, you can always use >>= to do that. For instance in your case:
lparse (head fname) input >>= intrp
And if you want to print out either your error message you can use the either class that takes two handler functions, one for the case when you have Left a (error in your case) and another for Right b (in your case a successful thing). An example:
either (putStrLn . show) (putStrLn . show) (lparse (head fname) input >>= intrp)
And if anything fails in your chain (any step of your monadic chain becomes Left a) it stops and can for instance print out the error message in the above case.
I want to read a file, process it, and write the results to another file; the input file name is to be supplied through a console argument, and the output file name is generated from the input file name.
The catch is I want it to transparently “fail over” to stdin/stdout if no arguments are supplied; essentially, in case a file name is supplied, I redirect stdin/stdout to the respective file names so I can transparently use interact whether the file name was supplied or not.
Here's the code hacked together with dummy output in a superfluous else. What will be the proper, idiomatic form of doing it?
It probably could have something to do with Control.Monad's when or guard, as was pointed out in a similar question, but maybe somebody wrote this already.
import System.IO
import Data.Char(toUpper)
import System.Environment
import GHC.IO.Handle
main :: IO ()
main = do
args <- getArgs
if(not $ null args) then
do
print $ "working with "++ (head args)
finHandle <- openFile (head args) ReadMode --open the supplied input file
hDuplicateTo finHandle stdin --bind stdin to finName's handle
foutHandle <- openFile ((head args) ++ ".out") WriteMode --open the output file for writing
hDuplicateTo foutHandle stdout --bind stdout to the outgoing file
else print "working through stdin/redirect" --get to know
interact ((++) "Here you go---\n" . map toUpper)
There's nothing very special about interact - here is its definition:
interact :: (String -> String) -> IO ()
interact f = do s <- getContents
putStr (f s)
How about something like this:
import System.Environment
import Data.Char
main = do
args <- getArgs
let (reader, writer) =
case args of
[] -> (getContents, putStr)
(path : _) -> let outpath = path ++ ".output"
in (readFile path, writeFile outpath)
contents <- reader
writer (process contents)
process :: String -> String
process = (++) "Here you go---\n" . map toUpper
Based on the command line arguments we set reader and writer to the IO-actions which will read the input and write the output.
This seems fairly idiomatic to me already. The one note I have is to avoid head, as it is an unsafe function (it can throw a runtime error). In this case it is fairly easy to do so by using case to pattern match.
main :: IO ()
main = do
args <- getArgs
case args of
fname:_ -> do
print $ "working with " ++ fname
finHandle <- openFile fname ReadMode
hDuplicateTo finHandle stdin
foutHandle <- openFile (fname ++ ".out") WriteMode
hDuplicateTo foutHandle stdout
[] -> do
print "working through stdin/redirect"
interact ((++) "Here you go---\n" . map toUpper)
Is it possible to split a Shell in Turtle library (Haskell) and do different things to either split of the shell, such that the original Shell is only run once ?
/---- shell2
---Shell1 --/
\
\-----shell3
For instance, how to do
do
let lstmp = lstree "/tmp"
view lstmp
view $ do
path <- lstmp
x <- liftIO $ testdir path
return x
such that lstree "/tmp" would only run once.
Specifically I would like to send Shell 2 and Shell 3 to different files using output.
You won't be able to split a Shell into two separate shells that run simultaneously, unless there's some magic I don't know. But file writing is a fold over the contents of a shell or some other succession of things. It is built into turtle that you can always combine many folds and make them run simultaneously using the Control.Foldl material - here
foldIO :: Shell a -> FoldM IO a r -> IO r -- specializing
A shell is secretly a FoldM IO a r -> IO r under the hood anyway, so this is basically runShell. To do this we need to get the right Shell and the right combined FoldM IO. The whole idea of the Fold a b and FoldM m a b types from the foldl package is simultaneous folding.
I think the easiest way to get the right shell is just to make the lstree fold return a FilePath together with the result of testdir. You basically wrote this:
withDirInfo :: FilePath -> Shell (Bool, FilePath)
withDirInfo tmp = do
let lstmp = lstree tmp
path <- lstmp
bool <- liftIO $ testdir path
return (bool, path)
So now we can get a Shell (Bool, FilePath) from /tmp This has all the information our two folds will need, and thus that our combined fold will need.
Next we might write a helper fold that prints the Text component of the FilePath to a given handle:
sinkFilePaths :: Handle -> FoldM IO FilePath ()
sinkFilePaths handle = L.sink (T.hPutStrLn handle . format fp)
Then we can use this Handle -> FoldM IO FilePath () to define two FoldM IO (Bool, FilePath) (). Each will write different stuff to different handles, and we can unite them into a single simultaneous fold with <*. This is an independent FoldM IO ... and can be applied e.g. to a pure list of type [(Bool, FilePath)] using L.fold and it will write different things from the list to the different handles. In our case, though, we will apply it to the Shell (Bool, FilePath) we defined.
The only subtle part of this is the use of L.handlesM to print only the second element, in both cases, and only those filtered as directories in the other. This uses the _2 lens and filtered from the lens libraries. This could probably be simplified, but see what you think:
{-#LANGUAGE OverloadedStrings #-}
import Turtle
import qualified Control.Foldl as L
import qualified System.IO as IO
import Control.Lens (_2,filtered)
import qualified Data.Text.IO as T
main = IO.withFile "tmpfiles.txt" IO.WriteMode $ \h ->
IO.withFile "tmpdirs.txt" IO.WriteMode $ \h' -> do
foldIO (withDirInfo "/tmp") (sinkFilesDirs h h')
withDirInfo :: Turtle.FilePath -> Shell (Bool, Turtle.FilePath)
withDirInfo tmp = do
let lstmp = lstree tmp
path <- lstmp
bool <- liftIO $ testdir path
return (bool, path)
sinkFilePaths :: Handle -> FoldM IO Turtle.FilePath ()
sinkFilePaths handle = L.sink (T.hPutStrLn handle . format fp)
sinkFilesDirs :: Handle -> Handle -> FoldM IO (Bool, Turtle.FilePath) ()
sinkFilesDirs h h' = allfiles <* alldirs where
allfiles :: L.FoldM IO (Bool, Turtle.FilePath) ()
allfiles = L.handlesM _2 (sinkFilePaths h)
-- handle the second element of pairs with sinkFilePaths
alldirs :: FoldM IO (Bool, Turtle.FilePath) ()
alldirs = L.handlesM (filtered (\(bool,file) -> bool) . _2) (sinkFilePaths h')
-- handle the second element of pairs where the first element
-- is true using sinkFilePaths
It sounds like you're looking for something like async to split off your shells from the first shell and then wait for them to return. async is a pretty capable library that can achieve much more than the below example, but it provides a pretty simple solution to what you're asking for:
import Control.Concurrent.Async
import Turtle.Shell
import Turtle.Prelude
main :: IO ()
main = do
let lstmp1 = lstree "/tmp"
let lstmp2 = lstree "/etc"
view lstmp1
view lstmp2
job1 <- async $ view $ do
path <- lstmp1
x <- liftIO $ testdir path
return x
job2 <- async $ view $ do
path <- lstmp2
x <- liftIO $ testdir path
return x
wait job1
wait job2
Is this what you're looking for?
In the example below, I'd like to be able to call the 'ls' function directly (see the last commented out line of the example) but I have not been able to figure out the correct syntax.
Thanks in advance.
module Main (main) where
import System.Directory
ls :: FilePath -> IO [FilePath]
ls dir = do
fileList <- getDirectoryContents dir
return fileList
main = do
fileList <- ls "."
mapM putStrLn fileList
-- How can I just use the ls call directly like in the following (which doesn't compile)?
-- mapM putStrLn (ls".")
You can't just use
mapM putStrLn (ls ".")
because ls "." has type IO [FilePath], and mapM putStrLn expects just [FilePath], so you need to use bind, or >>= in Haskell. So your actual line would be
main = ls "." >>= mapM_ putStrLn
Notice the mapM_ function, not just mapM. mapM will give you IO [()] type, but for main you need IO (), and that's what mapM_ is for.