Does GHC generate optimized code for `mapM_`?

Does GHC generate optimized code for `mapM_`? - haskell

I wrote this code,
import System.FilePath ((</>))
fp = "/Users/USER/Documents/Test/"
fpAcc = fp </> "acc.txt"
paths = map (fp </>) ["A.txt", "B.txt", "C.txt"]
main :: IO ()
main =
writeFile fpAcc ""
>> return paths
>>= mapM_ ((appendFile fpAcc =<<) . readFile)
>> readFile fpAcc >>= putStrLn
and this is the definition of mapM_:
mapM_ :: (Foldable t, Monad m) => (a -> m b) -> t a -> m ()
mapM_ f = foldr c (return ())
-- See Note [List fusion and continuations in 'c']
where c x k = f x >> k
{-# INLINE c #-}
Is the expression mapM_ ((appendFile fpAcc =<<) . readFile) producing 3 write operations to the disk, or due to some kind of GHC optimization, only one?
Would be great if the compiler could generate code that uses an intermediate memory to hold the appended data and then write only once. But, since mapM_ maps each element of the structure to a monadic action, perhaps each step performs one write operation.

No, this is not optimized. The code you wrote will open fpAcc, write some bytes, and close fpAcc once for each element in paths. Indeed, it would be incorrect for the compiler to optimize this: opening and closing a file is observable from outside the program, so an optimization that opened the file just once would be a behavioral change, not just a speed change.

Related

Lazy evaluation of IO actions

I'm trying to write code in source -> transform -> sink style, for example:
let (|>) = flip ($)
repeat 1 |> take 5 |> sum |> print
But would like to do that using IO. I have this impression that my source can be an infinite list of IO actions, and each one gets evaluated once it is needed downstream. Something like this:
-- prints the number of lines entered before "quit" is entered
[getLine..] >>= takeWhile (/= "quit") >>= length >>= print
I think this is possible with the streaming libraries, but can it be done along the lines of what I'm proposing?

Using the repeatM, takeWhile and length_ functions from the streaming library:
import Streaming
import qualified Streaming.Prelude as S
count :: IO ()
count = do r <- S.length_ . S.takeWhile (/= "quit") . S.repeatM $ getLine
print r

This seems to be in that spirit:
let (|>) = flip ($)
let (.>) = flip (.)
getContents >>= lines .> takeWhile (/= "quit") .> length .> print

The issue here is that Monad is not the right abstraction for this, and attempting to do something like this results in a situation where referential transparency is broken.
Firstly, we can do a lazy IO read like so:
module Main where
import System.IO.Unsafe (unsafePerformIO)
import Control.Monad(forM_)
lazyIOSequence :: [IO a] -> IO [a]
lazyIOSequence = pure . go where
go :: [IO a] -> [a]
go (l:ls) = (unsafePerformIO l):(go ls)
main :: IO ()
main = do
l <- lazyIOSequence (repeat getLine)
forM_ l putStrLn
This when run will perform cat. It will read lines and output them. Everything works fine.
But consider changing the main function to this:
main :: IO ()
main = do
l <- lazyIOSequence (map (putStrLn . show) [1..])
putStrLn "Hello World"
This outputs Hello World only, as we didn't need to evaluate any of l. But now consider replacing the last line like the following:
main :: IO ()
main = do
x <- lazyIOSequence (map (putStrLn . show) [1..])
seq (head x) putStrLn "Hello World"
Same program, but the output is now:
1
Hello World
This is bad, we've changed the results of a program just by evaluating a value. This is not supposed to happen in Haskell, when you evaluate something it should just evaluate it, not change the outside world.
So if you restrict your IO actions to something like reading from a file nothing else is reading from, then you might be able to sensibly lazily evaluate things, because when you read from it in relation to all the other IO actions your program is taking doesn't matter. But you don't want to allow this for IO in general, because skipping actions or performing them in a different order can matter (and above, certainly does). Even in the reading a file lazily case, if something else in your program writes to the file, then whether you evaluate that list before or after the write action will affect the output of your program, which again, breaks referential transparency (because evaluation order shouldn't matter).
So for a restricted subset of IO actions, you can sensibly define Functor, Applicative and Monad on a stream type to work in a lazy way, but doing so in the IO Monad in general is a minefield and often just plain incorrect. Instead you want a specialised streaming type, and indeed Conduit defines Functor, Applicative and Monad on a lot of it's types so you can still use all your favourite functions.

Parallel Haskell with HXT

I'm trying to get performance increases in a program I have that parses XML. The program can parse multiple XML files so I thought that I could make this run in parallel, but all my attempts have resulted in lower performance!
For XML parsing, I am using HXT.
I have a run function defined like this:
run printTasks xs = pExec xs >>= return . concat >>= doPrint printTasks 1
'pExec' is given a list of file names and is defined as:
pExec xs = do
ex <- mapM exec xs
as <- ex `usingIO` parList rdeepseq
return as
where 'exec' is defined as:
exec = runX . process
threadscope shows only one thread e ver being used (until the very end).
Can anyone explain why I have failed so miserably to parallelise this code?
In case it helps:
exec :: FilePath -> [CV_scene]
pExec :: [FilePath] -> IO [[CV_scene]]
data CV_scene = Scene [CV_layer] Time deriving (Show)
data CV_layer = Layer [DirtyRects] SourceCrop deriving (Show)
data Rect = Rect Int Int Int Int deriving (Show)-- Left Top Width Height
instance NFData CV_scene where
rnf = foldScene reduceScene
where reduceScene l t = rnf (seq t l)
instance NFData CV_layer where
rnf = foldLayer reduceLayer
where reduceLayer d s = rnf (seq s d)
instance NFData Rect where
rnf = foldRect reduceRect
where reduceRect l t w h = rnf [l,t,w,h]
type SourceCrop = Rect
type DirtyRect = Rect
type Time = Int64
Thanks in advance for your help!

First, it looks like you mislabeled the signature of exec, which should probably be:
exec :: FilePath -> IO [CV_scene]
Now for the important part. I've commented inline on what I think you think is going on.
pExec xs = do
-- A. Parse the file found at each location via exec.
ex <- mapM exec xs
-- B. Force the lazy parsing in parallel.
as <- ex `usingIO` parList rdeepseq
return as
Note that line A does not happen in paralell, which you might think is okay since it will just set up the parsing thunks which are forced in parallel in B. This is a fair assumption, and a clever use of laziness, but the results pull that into question for me.
I suspect that the implementation of exec forces most of the parsing before line B is even reached so that the deep seq doesn't do much. That fits pretty well with my experince parsing and the profiling supports that explanation.
Without the ability to test your code, I can only make the following suggestions. First try separating the parsing of the file from the IO and put the parsing in the parallel execution strategy. In that case lines A and B become something like:
ex <- mapM readFile xs
as <- ex `usingIO` parList (rdeepseq . exec')
with exec' the portion of exec after the file is read from disk.
exec' :: FilePath -> [CVScene]
Also, you may not even need rdeepSeq after this change.
As an alternative, you can do the IO and parsing in parallel using Software Transactional Memory. STM approaches are normally used for separate IO threads which act more like services, rather than pure computations. But if for some reason you cant get the strategies based approach to work, this might be worth a try.
import Control.Concurrent.STM.TChan --(from stm package)
import Control.Concurrent(forkIO)
pExec'' :: [FilePath] -> IO [[CVSene]]
pExec'' xs = do
-- A. create [(Filename,TChan [CVScene])]
tcx <- mapM (\x -> (x,) <$> newTChanIO) xs
-- B. do the reading/parsing in separate threads
mapM_ (forkIO . exec'') tcx
-- C. Collect the results
cvs <- mapM (atomically . readTChan . snd) tcx
exec'' :: [(FilePath,TChan [CVScene])] -> IO ()
exec'' (x,tch) = do
--D. The original exec function
cv <- exec x
--E. Put on the channel fifo buffer
atomically $ writeTChan tch cv
Good luck!

Join two consumers into a single consumer that returns multiple values?

I have been experimenting with the new pipes-http package and I had a thought. I have two parsers for a web page, one that returns line items and another a number from elsewhere in the page. When I grab the page, it'd be nice to string these parsers together and get their results at the same time from the same bytestring producer, rather than fetching the page twice or fetching all the html into memory and parsing it twice.
In other words, say you have two Consumers:
c1 :: Consumer a m r1
c2 :: Consumer a m r2
Is it possible to make a function like this:
combineConsumers :: Consumer a m r1 -> Consumer a m r2 -> Consumer a m (r1, r2)
combineConsumers = undefined
I have tried a few things, but I can't figure it out. I understand if it isn't possible, but it would be convenient.
Edit:
I'm sorry it turns out I was making an assumption about pipes-attoparsec, due to my experience with conduit-attoparsec that caused me to ask the wrong question. Pipes-attoparsec turns an attoparsec into a pipes Parser when I just assumed that it would return a pipes Consumer. That means that I can't actually turn two attoparsec parsers into consumers that take text and return a result, then use them with the plain old pipes ecosystem. I'm sorry but I just don't understand pipes-parse.
Even though it doesn't help me, Arthur's answer is pretty much what I envisioned when I asked the question, and I'll probably end up using his solution in the future. In the meantime I'm just going to use conduit.

It the results are "monoidal", you can use the tee function from the Pipes prelude, in combination with a WriterT.
{-# LANGUAGE OverloadedStrings #-}
import Data.Monoid
import Control.Monad
import Control.Monad.Writer
import Control.Monad.Writer.Class
import Pipes
import qualified Pipes.Prelude as P
import qualified Data.Text as T
textSource :: Producer T.Text IO ()
textSource = yield "foo" >> yield "bar" >> yield "foo" >> yield "nah"
counter :: Monoid w => T.Text
-> (T.Text -> w)
-> Consumer T.Text (WriterT w IO) ()
counter word inject = P.filter (==word) >-> P.mapM (tell . inject) >-> P.drain
main :: IO ()
main = do
result <-runWriterT $ runEffect $
hoist lift textSource >->
P.tee (counter "foo" inject1) >-> (counter "bar" inject2)
putStrLn . show $ result
where
inject1 _ = (,) (Sum 1) mempty
inject2 _ = (,) mempty (Sum 1)
Update: As mentioned in a comment, the real problem I see is that in pipes parsers aren't Consumers. And how can you run two parsers concurrently if they have different behaviours regarding leftovers? What happens if one of the parsers wants to "un-draw" some text and the other parser doesn't?
One possible solution is to run the parsers in a truly concurrent manner, in different threads. The primitives in the pipes-concurrency package let you "duplicate" a Producer by writing the same data to two different mailboxes. And then each parser can do whatever it wants with its own copy of the producer. Here's an example which also uses the pipes-parse, pipes-attoparsec and async packages:
{-# LANGUAGE OverloadedStrings #-}
import Data.Monoid
import qualified Data.Text as T
import Data.Attoparsec.Text hiding (takeWhile)
import Data.Attoparsec.Combinator
import Control.Applicative
import Control.Monad
import Control.Monad.State.Strict
import Pipes
import qualified Pipes.Prelude as P
import qualified Pipes.Attoparsec as P
import qualified Pipes.Concurrent as P
import qualified Control.Concurrent.Async as A
parseChars :: Char -> Parser [Char]
parseChars c = fmap mconcat $
many (notChar c) *> many1 (some (char c) <* many (notChar c))
textSource :: Producer T.Text IO ()
textSource = yield "foo" >> yield "bar" >> yield "foo" >> yield "nah"
parseConc :: Producer T.Text IO ()
-> Parser a
-> Parser b
-> IO (Either P.ParsingError a,Either P.ParsingError b)
parseConc producer parser1 parser2 = do
(outbox1,inbox1,seal1) <- P.spawn' P.Unbounded
(outbox2,inbox2,seal2) <- P.spawn' P.Unbounded
feeding <- A.async $ runEffect $ producer >-> P.tee (P.toOutput outbox1)
>-> P.toOutput outbox2
sealing <- A.async $ A.wait feeding >> P.atomically seal1 >> P.atomically seal2
r <- A.runConcurrently $
(,) <$> (A.Concurrently $ parseInbox parser1 inbox1)
<*> (A.Concurrently $ parseInbox parser2 inbox2)
A.wait sealing
return r
where
parseInbox parser inbox = evalStateT (P.parse parser) (P.fromInput inbox)
main :: IO ()
main = do
(Right a, Right b) <- parseConc textSource (parseChars 'o') (parseChars 'a')
putStrLn . show $ (a,b)
The result is:
("oooo","aa")
I'm not sure how much overhead this approach introduces.

I think something is wrong with the way you are going about this, for the reasons Davorak mentions in his remark. But if you really need such a function, you can define it.
import Pipes.Internal
import Pipes.Core
zipConsumers :: Monad m => Consumer a m r -> Consumer a m s -> Consumer a m (r,s)
zipConsumers p q = go (p,q) where
go (p,q) = case (p,q) of
(Pure r , Pure s) -> Pure (r,s)
(M mpr , ps) -> M (do pr <- mpr
return (go (pr, ps)))
(pr , M mps) -> M (do ps <- mps
return (go (pr, ps)))
(Request _ f, Request _ g) -> Request () (\a -> go (f a, g a))
(Request _ f, Pure s) -> Request () (\a -> do r <- f a
return (r, s))
(Pure r , Request _ g) -> Request () (\a -> do s <- g a
return (r,s))
(Respond x _, _ ) -> closed x
(_ , Respond y _) -> closed y
If you are 'zipping' consumers without using their return value, only their 'effects' you can just use tee consumer1 >-> consumer2

The idiomatic solution is to rewrite your Consumers as a Fold or FoldM from the foldl library and then combine them using Applicative style. You can then convert this combined fold to one that works on pipes.
Let's assume that you either have two Folds:
fold1 :: Fold a r1
fold2 :: Fold a r2
... or two FoldMs:
foldM1 :: Monad m => FoldM a m r1
foldM2 :: Monad m => FoldM a m r2
Then you combine these into a single Fold/FoldM using Applicative style:
import Control.Applicative
foldBoth :: Fold a (r1, r2)
foldBoth = (,) <$> fold1 <*> fold2
foldBothM :: Monad m => FoldM a m (r1, r2)
foldBothM = (,) <$> foldM1 <*> foldM2
-- or: foldBoth = liftA2 (,) fold1 fold2
-- foldMBoth = liftA2 (,) foldM1 foldM2
You can turn either fold into a Pipes.Prelude-style fold or a Parser. Here are the necessary conversion functions:
import Control.Foldl (purely, impurely)
import qualified Pipes.Prelude as Pipes
import qualified Pipes.Parse as Parse
purely Pipes.fold
:: Monad m => Fold a b -> Producer a m () -> m b
impurely Pipes.foldM
:: Monad m => FoldM m a b -> Producer a m () -> m b
purely Parse.foldAll
:: Monad m => Fold a b -> Parser a m r
impurely Parse.foldMAll
:: Monad m => FoldM a m b -> Parser a m r
The reason for the purely and impurely functions is so that foldl and pipes can interoperate without either one incurring a dependency on the other. Also, they allow libraries other than pipes (like conduit) to reuse foldl without a dependency, too (Hint hint, #MichaelSnoyman).
I apologize that this feature is not documented, mainly because it took me a while to figure out how to get pipes and foldl to interoperate in a dependency-free manner, and that was after I wrote the pipes tutorial. I will update the tutorial to point out this trick.
To learn how to use foldl, just read the documentation in the main module. It's a very small and easy-to-learn library.

For what it's worth, in the conduit world, the relevant function is zipSinks. There might be some way to adapt this function to work for pipes, but automatic termination may get in the way.

Consumer forms a Monad so
combineConsumers = liftM2 (,)
will type check. Unfortunately, the semantics might be unlike what you're expecting: the first consumer will run to completion and then the second.

How do I avoid memory problems when writing to file using the Writer monad?

I am building some moderately large DIMACS files, however with the method used below the memory usage is rather large compared to the size of the files generated, and on some of the larger files I need to generate I run in to out of memory problems.
import Control.Monad.State.Strict
import Control.Monad.Writer.Strict
import qualified Data.ByteString.Lazy.Char8 as B
import Control.Monad
import qualified Text.Show.ByteString as BS
import Data.List
main = printDIMACS "test.cnf" test
test = do
xs <- freshs 100000
forM_ (zip xs (tail xs))
(\(x,y) -> addAll [[negate x, negate y],[x,y]])
type Var = Int
type Clause = [Var]
data DIMACSS = DS{
nextFresh :: Int,
numClauses :: Int
} deriving (Show)
type DIMACSM a = StateT DIMACSS (Writer B.ByteString) a
freshs :: Int -> DIMACSM [Var]
freshs i = do
next <- gets nextFresh
let toRet = [next..next+i-1]
modify (\s -> s{nextFresh = next+i})
return toRet
fresh :: DIMACSM Int
fresh = do
i <- gets nextFresh
modify (\s -> s{nextFresh = i+1})
return i
addAll :: [Clause] -> DIMACSM ()
addAll c = do
tell
(B.concat .
intersperse (B.pack " 0\n") .
map (B.unwords . map BS.show) $ c)
tell (B.pack " 0\n")
modify (\s -> s{numClauses = numClauses s + length c})
add h = addAll [h]
printDIMACS :: FilePath -> DIMACSM a -> IO ()
printDIMACS file f = do
writeFile file ""
appendFile file (concat ["p cnf ", show i, " ", show j, "\n"])
B.appendFile file b
where
(s,b) = runWriter (execStateT f (DS 1 0))
i = nextFresh s - 1
j = numClauses s
I would like to keep the monadic building of clauses since it is very handy, but I need to overcome the memory problem. How do I optimize the above program so that it doesn't use too much memory?

If you want good memory behavior, you need to make sure that you write out the clauses as you generate them, instead of collecting them in memory and dumping them as such, either using lazyness or a more explicit approach such as conduits, enumerators, pipes or the like.
The main obstacle to that approach is that the DIMACS format expects the number of clauses and variables in the header. This prevents the naive implementation from being sufficiently lazy. There are two possibilities:
The pragmatic one is to write the clauses first to a temporary location. After that the numbers are known, so you write them to the real file and append the contents of the temporary file.
The prettier approach is possible if the generation of clauses has no side effects (besides the effects offered by your DIMACSM monad) and is sufficiently fast: Run it twice, first throwing away the clauses and just calculating the numbers, print the header line, run the generator again; now printing the clauses.
(This is from my experience with implementing SAT-Britney, where I took the second approach, because it fitted better with other requirements in that context.)
Also, in your code, addAll is not lazy enough: The list c needs to be retained even after writing (in the MonadWriter sense) the clauses. This is another space leak. I suggest you implement add as the primitive operation and then addAll = mapM_ add.

As explained in Joachim Breitner's answer the problem was that DIMACSM was not lazy enough, both because the strict versions of the monads was used and because the number of variables and clauses are needed before the ByteString can be written to the file. The solution is to use the lazy versions of the Monads and execute them twice. It turns out that it is also necessary to have WriterT be the outer monad:
import Control.Monad.State
import Control.Monad.Writer
...
type DIMACSM a = WriterT B.ByteString (State DIMACSS) a
...
printDIMACS :: FilePath -> DIMACSM a -> IO ()
printDIMACS file f = do
writeFile file ""
appendFile file (concat ["p cnf ", show i, " ", show j, "\n"])
B.appendFile file b
where
s = execState (execWriterT f) (DS 1 0)
b = evalState (execWriterT f) (DS 1 0)
i = nextFresh s - 1
j = numClauses s

dealing with IO vs pure code in haskell

I'm writing a shell script (my 1st non-example in haskell) which is supposed to list a directory, get every file size, do some string manipulation (pure code) and then rename some files. I'm not sure what i'm doing wrong, so 2 questions:
How should i arrange the code in such program?
I have a specific issue, i get the following error, what am i doing wrong?
error:
Couldn't match expected type `[FilePath]'
against inferred type `IO [FilePath]'
In the second argument of `mapM', namely `fileNames'
In a stmt of a 'do' expression:
files <- (mapM getFileNameAndSize fileNames)
In the expression:
do { fileNames <- getDirectoryContents;
files <- (mapM getFileNameAndSize fileNames);
sortBy cmpFilesBySize files }
code:
getFileNameAndSize fname = do (fname, (withFile fname ReadMode hFileSize))
getFilesWithSizes = do
fileNames <- getDirectoryContents
files <- (mapM getFileNameAndSize fileNames)
sortBy cmpFilesBySize files

Your second, specific, problem is with the types of your functions. However, your first issue (not really a type thing) is the do statement in getFileNameAndSize. While do is used with monads, it's not a monadic panacea; it's actually implemented as some simple translation rules. The Cliff's Notes version (which isn't exactly right, thanks to some details involving error handling, but is close enough) is:
do a ≡ a
do a ; b ; c ... ≡ a >> do b ; c ...
do x <- a ; b ; c ... ≡ a >>= \x -> do b ; c ...
In other words, getFileNameAndSize is equivalent to the version without the do block, and so you can get rid of the do. This leaves you with
getFileNameAndSize fname = (fname, withFile fname ReadMode hFileSize)
We can find the type for this: since fname is the first argument to withFile, it has type FilePath; and hFileSize returns an IO Integer, so that's the type of withFile .... Thus, we have getFileNameAndSize :: FilePath -> (FilePath, IO Integer). This may or may not be what you want; you might instead want FilePath -> IO (FilePath,Integer). To change it, you can write any of
getFileNameAndSize_do fname = do size <- withFile fname ReadMode hFileSize
return (fname, size)
getFileNameAndSize_fmap fname = fmap ((,) fname) $
withFile fname ReadMode hFileSize
-- With `import Control.Applicative ((<$>))`, which is a synonym for fmap.
getFileNameAndSize_fmap2 fname = ((,) fname)
<$> withFile fname ReadMode hFileSize
-- With {-# LANGUAGE TupleSections #-} at the top of the file
getFileNameAndSize_ts fname = (fname,) <$> withFile fname ReadMode hFileSize
Next, as KennyTM pointed out, you have fileNames <- getDirectoryContents; since getDirectoryContents has type FilePath -> IO FilePath, you need to give it an argument. (e.g. getFilesWithSizes dir = do fileNames <- getDirectoryContents dir ...). This is probably just a simple oversight.
Mext, we come to the heart of your error: files <- (mapM getFileNameAndSize fileNames). I'm not sure why it gives you the precise error it does, but I can tell you what's wrong. Remember what we know about getFileNameAndSize. In your code, it returns a (FilePath, IO Integer). However, mapM is of type Monad m => (a -> m b) -> [a] -> m [b], and so mapM getFileNameAndSize is ill-typed. You want getFileNameAndSize :: FilePath -> IO (FilePath,Integer), like I implemented above.
Finally, we need to fix your last line. First of all, although you don't give it to us, cmpFilesBySize is presumably a function of type (FilePath, Integer) -> (FilePath, Integer) -> Ordering, comparing on the second element. This is really simple, though: using Data.Ord.comparing :: Ord a => (b -> a) -> b -> b -> Ordering, you can write this comparing snd, which has type Ord b => (a, b) -> (a, b) -> Ordering. Second, you need to return your result wrapped up in the IO monad rather than just as a plain list; the function return :: Monad m => a -> m a will do the trick.
Thus, putting this all together, you'll get
import System.IO (FilePath, withFile, IOMode(ReadMode), hFileSize)
import System.Directory (getDirectoryContents)
import Control.Applicative ((<$>))
import Data.List (sortBy)
import Data.Ord (comparing)
getFileNameAndSize :: FilePath -> IO (FilePath, Integer)
getFileNameAndSize fname = ((,) fname) <$> withFile fname ReadMode hFileSize
getFilesWithSizes :: FilePath -> IO [(FilePath,Integer)]
getFilesWithSizes dir = do fileNames <- getDirectoryContents dir
files <- mapM getFileNameAndSize fileNames
return $ sortBy (comparing snd) files
This is all well and good, and will work fine. However, I might write it slightly differently. My version would probably look like this:
{-# LANGUAGE TupleSections #-}
import System.IO (FilePath, withFile, IOMode(ReadMode), hFileSize)
import System.Directory (getDirectoryContents)
import Control.Applicative ((<$>))
import Control.Monad ((<=<))
import Data.List (sortBy)
import Data.Ord (comparing)
preservingF :: Functor f => (a -> f b) -> a -> f (a,b)
preservingF f x = (x,) <$> f x
-- Or liftM2 (<$>) (,), but I am not entirely sure why.
fileSize :: FilePath -> IO Integer
fileSize fname = withFile fname ReadMode hFileSize
getFilesWithSizes :: FilePath -> IO [(FilePath,Integer)]
getFilesWithSizes = return . sortBy (comparing snd)
<=< mapM (preservingF fileSize)
<=< getDirectoryContents
(<=< is the monadic equivalent of ., the function composition operator.) First off: yes, my version is longer. However, I'd probably already have preservingF defined somewhere, making the two equivalent in length.* (I might even inline fileSize if it weren't used elsewhere.) Second, I like this version better because it involves chaining together simpler pure functions we've already written. While your version is similar, mine (I feel) is more streamlined and makes this aspect of things clearer.
So this is a bit of an answer to your first question of how to structure these things. I personally tend to lock my IO down into as few functions as possible—only functions which need to touch the outside world directly (e.g. main and anything which interacts with a file) get an IO. Everything else is an ordinary pure function (and is only monadic if it's monadic for general reasons, along the lines of preservingF). I then arrange things so that main, etc., are just compositions and chains of pure functions: main gets some values from IO-land; then it calls pure functions to fold, spindle, and mutilate the date; then it gets more IO values; then it operates more; etc. The idea is to separate the two domains as much as possible, so that the more compositional non-IO code is always free, and the black-box IO is only done precisely where necessary.
Operators like <=< really help with writing code in this style, as they let you operate on functions which interact with monadic values (such as the IO-world) just as you would operate on normal functions. You should also look at Control.Applicative's function <$> liftedArg1 <*> liftedArg2 <*> ... notation, which lets you apply ordinary functions to any number of monadic (really Applicative) arguments. This is really nice for getting rid of spurious <-s and just chaining pure functions over monadic code.
*: I feel like preservingF, or at least its sibling preserving :: (a -> b) -> a -> (a,b), should be in a package somewhere, but I've been unable to find either.

getDirectoryContents is a function. You should supply an argument to it, e.g.
fileNames <- getDirectoryContents "/usr/bin"
Also, the type of getFileNameAndSize is FilePath -> (FilePath, IO Integer), as you can check from ghci:
Prelude> :m + System.IO
Prelude System.IO> let getFileNameAndSize fname = do (fname, (withFile fname ReadMode hFileSize))
Prelude System.IO> :t getFileNameAndSize
getFileNameAndSize :: FilePath -> (FilePath, IO Integer)
But mapM requires the input function to return an IO stuff:
Prelude System.IO> :t mapM
mapM :: (Monad m) => (a -> m b) -> [a] -> m [b]
-- # ^^^^^^^^
You should change its type to FilePath -> IO (FilePath, Integer) to match the type.
getFileNameAndSize fname = do
fsize <- withFile fname ReadMode hFileSize
return (fname, fsize)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Does GHC generate optimized code for `mapM_`? - haskell

Related

Lazy evaluation of IO actions

Parallel Haskell with HXT

Join two consumers into a single consumer that returns multiple values?

How do I avoid memory problems when writing to file using the Writer monad?

dealing with IO vs pure code in haskell

Categories

Resources