I am trying to read first line of each file in current directory:
import System.IO(IOMode(ReadMode), withFile, hGetLine)
import System.Directory (getDirectoryContents, doesFileExist, getFileSize)
import System.FilePath ((</>))
import Control.Monad(filterM)
readFirstLine :: FilePath -> IO String
readFirstLine fp = withFile fp ReadMode System.IO.hGetLine
getAbsoluteDirContents :: String -> IO [FilePath]
getAbsoluteDirContents dir = do
contents <- getDirectoryContents dir
return $ map (dir </>) contents
main :: IO ()
main = do
-- get a list of all files & dirs
contents <- getAbsoluteDirContents "."
-- filter out dirs
files <- filterM doesFileExist contents
-- read first line of each file
d <- mapM readFirstLine files
print d
It is compiling and running but getting aborted with following error at a binary file:
mysrcfile: ./aBinaryFile: hGetLine: invalid argument (invalid byte sequence)
I want to detect and avoid such files and go on to next file.
A binary file is a file that contains byte sequences that can not be decoded to a valid string. But a binary file is not different from a text file if you do not inspect its content.
It might be better to use an "It's Easier to Ask Forgiveness than Permission (EAFP)" approach: we try to read the first line, and if that fails, we ignore the output.
import Control.Exception(catch, IOException)
import System.IO(IOMode(ReadMode), withFile, hGetLine)
readFirstLine :: FilePath -> IO (Maybe String)
readFirstLine fp = withFile fp ReadMode $
\h -> (catch (fmap Just (hGetLine h))
((const :: a -> IOException -> a) (return Nothing)))
For a FilePath this returns an IO (Maybe String). If we run the IO (Maybe String), it will return a Just x with x the first line if it can read such file, and Nothing if an IOException was encoutered.
We can then make use of catMaybes :: [Maybe a] -> [a] to obtain the Just xs:
import Data.Maybe(catMaybes)
main :: IO ()
main = do
-- get a list of all files & dirs
contents <- getAbsoluteDirContents "."
-- filter out dirs
files <- filterM doesFileExist contents
-- read first line of each file
d <- mapM readFirstLine files
print (catMaybes d)
or you can make use of mapMaybeM :: Monad m => (a -> m (Maybe b)) -> [a] -> m [b] in the extra package [Hackage] that will automate that work for you.
Related
I'm fairly new to Haskell and I'm working on an existing code base, that collects files from file shares. To parallelize the processing of the file shares Conduit is used. The scaffold is based on the this tutorial. To continuously read the file share I added the delay and a recursive call to the streamFile function. I am not sure if this is the problem, but the memory allocation is increasing constantly up to several gigabytes.
What could be the problem that causes the memory leak?
module FileScraper(runFileScraperFinal, FileScraper, watch, watchDirectories) where
import Actions (PostProcAction)
import Colog (LogAction, Msg, Severity)
import Conduit (ConduitM, ConduitT, MonadIO (..), MonadResource, MonadTrans (lift), MonadUnliftIO (withRunInIO), ResourceT, await, bracketP, mapMC, mapM_C, runConduit, runResourceT, yield, (.|), takeWhileC)
import Control.Concurrent (threadDelay)
import qualified Control.Concurrent.Async as Async
import qualified Control.Concurrent.STM as STM
import qualified Control.Concurrent.STM.TBMQueue as STM
import Data.ByteString (ByteString, readFile)
import Data.Conduit.Combinators (filterM, yieldMany)
import Data.Functor ((<&>))
import Data.Text (Text, unpack)
import Filters (FileFilter, DirectoryFilter)
import Polysemy (Final, Inspector (inspect), Member, Sem, makeSem)
import Polysemy.Final (bindS, getInitialStateS, getInspectorS, interpretFinal, liftS)
import Prelude hiding (filter, init, readFile)
import System.FilePath.Find (find, RecursionPredicate, (/~?), filePath, (&&?), (==?), fileType, FileType (RegularFile), always)
import System.Posix (raiseSignal, sigTERM)
data FileScraper m a where
Watch :: [(Text, Text, FileFilter, DirectoryFilter, PostProcAction)] -> (FilePath -> ByteString -> Text -> PostProcAction -> m Bool) -> FileScraper m ()
makeSem ''FileScraper
runFileScraperFinal :: forall m. (MonadUnliftIO m => forall r a. (Member (Final m) r) => LogAction m (Msg Severity) -> Sem (FileScraper ': r) a -> Sem r a)
runFileScraperFinal _ = do
interpretFinal #m (\case
Watch sources callback -> do
is <- getInitialStateS
ins <- getInspectorS
cb' <- bindS $ uncurry4 callback
liftS $ withRunInIO $ \runInIO -> liftIO $ do
runResourceT . runConduit $ watchDirectories sources .| mapMC (\(fp,fc,dest,ppa) -> lift $ do
eff <- runInIO $ cb' ((fp,fc,dest,ppa) <$ is)
case inspect ins eff of
Nothing -> do
raiseSignal sigTERM
pure False
Just v -> do
pure v
) .| takeWhileC id .| mapM_C (const $ pure ())
)
uncurry4 :: (a -> b -> c -> d -> e) -> ((a, b, c, d) -> e)
uncurry4 f ~(a,b,c,d) = f a b c d
watchDirectories :: MonadResource m => [(Text, Text, FileFilter, DirectoryFilter, PostProcAction)] -> ConduitM a (FilePath, ByteString, Text, PostProcAction) m ()
watchDirectories sourceToFilterMap = parSources (fmap (\(src, dest, filter, dirFilter, postProcActions) -> streamFile (unpack src) dest filter dirFilter postProcActions) sourceToFilterMap)
streamFile :: MonadResource m => FilePath -> Text -> FileFilter -> DirectoryFilter -> PostProcAction -> ConduitM a (FilePath, ByteString, Text, PostProcAction) m ()
streamFile baseDir destination filter dirFilter postProcActions = do
newFiles <- liftIO $ find (recursionPredicate dirFilter) (fileType ==? RegularFile) baseDir
yieldMany newFiles .| filterM (liftIO . filter) .| mapMC (\entry -> do
liftIO $ readFile entry <&> (entry,,destination,postProcActions))
let minutes :: Int = 60_000_000
liftIO $ threadDelay (5 * minutes)
streamFile baseDir destination filter dirFilter postProcActions
where
recursionPredicate :: DirectoryFilter -> RecursionPredicate
recursionPredicate df = case df of
[] -> always
excludes -> foldl1 (&&?) $ map ((/~?) filePath . unpack) excludes
parSources :: (MonadResource m, Foldable f) => f (ConduitM () o (ResourceT IO) ()) -> ConduitT i o m ()
parSources sources = bracketP init cleanup finalSource
where
init = do
-- create the queue where all sources will put their items
queue <- STM.newTBMQueueIO 100
-- In a separate thread, run concurrently all conduits
a <- Async.async $ do
Async.mapConcurrently_ (\source -> runResourceT $ runConduit (source .| sinkQueue queue)) sources
-- once all conduits are done, close the queue
STM.atomically (STM.closeTBMQueue queue)
pure (a, queue)
cleanup (async, queue) = do
-- upon exception or cancellation, close the queue and cancel the threads
STM.atomically (STM.closeTBMQueue queue)
Async.cancel async
finalSource (_, queue) = sourceQueue queue
sourceQueue :: MonadIO m => STM.TBMQueue o -> ConduitT i o m ()
sourceQueue queue = do
mbItem <- liftIO $ STM.atomically (STM.readTBMQueue queue)
case mbItem of
Nothing -> pure () -- queue closed
Just item -> yield item *> sourceQueue queue
sinkQueue :: MonadIO m => STM.TBMQueue a -> ConduitT a o m ()
sinkQueue queue = do
mbItem <- await
case mbItem of
Nothing -> pure () -- no more items to come
Just item -> do
liftIO $ STM.atomically (STM.writeTBMQueue queue item)
sinkQueue queue
Update (Added function that uses the callback):
...
void $ async $ watch normalisedPrefixedSources (\fp content dest ppa -> do
log Info $ "Sending file " <> pack fp
result <- await =<< send (unpack dest) content
case result of
Just True -> do
log Info $ "File sent " <> pack fp
res <- embed #m $ liftIO $ ppa fp
if res then pure True else do
log Error "Raise signal for graceful shutdown."
embed #m $ liftIO $ raiseSignal sigTERM
pure False
_ -> do
log Error $ "Error sending file " <> pack fp <> ". Raise signal for graceful shutdown."
embed #m $ liftIO $ raiseSignal sigTERM
pure False
)
...
Update 2:
After removing the idempotent filter from the configuration (the changes from #K. A. Buhr are still in place) the memory consumption is constant.
type FileFilter = FilePath -> IO Bool
createIdempotentFilter :: LogAction IO Message -> M.Idempotent -> IO FileFilter
createIdempotentFilter la filterConfig = do
cache <- newIORef []
let configuredCacheSize :: Int = fromIntegral $ M.lruCacheSize filterConfig
pure $ \path -> do
fileModificationEpoch <- getModificationTime path
cache' <- readIORef cache
if (path, fileModificationEpoch) `elem` cache' then do
la <& logText Debug ("File already in cache " <> pack path <> " | " <> pack (show fileModificationEpoch))
pure False
else do
la <& logText Debug ("File not in cache " <> pack path <> " | " <> pack (show fileModificationEpoch))
let alreadyScanned' = cache' <> [(path, fileModificationEpoch)]
writeIORef cache $ drop (length alreadyScanned' - configuredCacheSize) alreadyScanned'
pure True
Is there any problematic code - that causes a memory leak - in the function createIdempotentFilter?
First, make sure you rule out the ByteStrings of file contents as a source of the leak. You will have a maximum number of files in flight equal to the length of the bounded queue, and so your high watermark will be the contents of some arbitrary collection of 100 files from the input filesystems. If you're processing filesystems with large video/image files, you could see erratic, transient spikes from that. Also, if your callback is holding references to the pathnames and/or contents of (some or all of) those files, you'll see a very severe space leak as a result. Rule all this out by replacing readFile entry with return mempty and using a null callback (\_ _ _ _ -> return True).
After making a similar change myself, I was able to duplicate your space leak and tracked it down to two technical issues.
The first was:
.| takeWhileC id .| mapM_C (const $ pure ())
Replacing this with:
.| Control.Monad.void andC
reduced the maximum residency for a single pass through a test filesystem from 130MB to 15MB, but still with a characteristic linear increase in heap usage on a a heap profile.
The second was:
yield item *> sourceQueue queue
Replacing this with:
yield item >> sourceQueue queue
removed the leak entirely. Maximum residency was only 2MB, and there was no discernible leak on a heap profile for multiple passes through the test filesystem.
I'm not exactly sure what's going on here, for either issue. The *> versus >> issue is a problem I've seen before. While these are semantically equivalent, they don't necessarily have the same implementation, and sometimes *> leaks space where >> doesn't. However, the takeWhileC problem is a mystery to me.
I have a code that reads files and parses using UU.Parsing lib that returns an abstract sintax tree and shows on the screen.
I received the error message "No instance for Show" in my functions originated in tokensParserToByteString and applyParser using parseIO (of UU.Parsing lib) and inherited signatures until main. I fixed the signatures but my problem is in the main function. I added the instance Show in the signature but I have the next compilation error:
No instance for (Show (IO J2s)) arising from a use of ‘main’
In the expression: main
When checking the type of the IO action ‘main’
The complete error message is:
$ cabal build
Building java2scala-1.0...
Preprocessing library java2scala-1.0...
In-place registering java2scala-1.0...
Preprocessing executable 'java2scala' for java2scala-1.0...
Preprocessing executable 'test' for java2scala-1.0...
[5 of 5] Compiling Main ( test/Main.hs, dist/build/test/test-tmp/Main.o )
test/Main.hs:27:1:
No instance for (Show (IO J2s)) arising from a use of ‘main’
In the expression: main
When checking the type of the IO action ‘main’
Some idea, about the problem?
Main module
{-# LANGUAGE FlexibleContexts #-}
module Main where
import UU.Parsing
...
import Content
main :: (Show (IO J2s)) => IO()
main = do f <- getLine
let command = test f
command
test :: (Show (IO J2s)) => String -> IO()
test "testparser" = testParser
Test module
{-# LANGUAGE FlexibleContexts #-}
module J2s.Parser.Test where
import Content
import J2s.Ast.Sintax
import J2s.Parser
import UU.Parsing
...
testParser :: (Show (IO J2s)) => IO()
testParser = (runSafeIO $ runProxy $ runEitherK $
contentsRecursive "path/of/my/tests" />/ handlerParser) :: (Show (IO J2s)) => IO()
Content module
{-# LANGUAGE FlexibleContexts #-}
module Content where
import Control.Monad(forM, liftM)
import System.Directory (doesDirectoryExist, getDirectoryContents)
import System.FilePath ((</>), splitExtension, splitFileName)
import J2s.Parser
import J2s.Ast.Sintax
import UU.Parsing
import Control.Monad (when, unless)
import Control.Proxy
import Control.Proxy.Safe hiding (readFileS)
import J2s.Scanner.Token
import Text.Show
import UU.Parsing
contentsRecursive
:: (CheckP p)
=> FilePath -> () -> Producer (ExceptionP p) FilePath SafeIO ()
contentsRecursive path () = loop path
where
loop path = do
contents path () //> \newPath -> do
respond newPath
isDir <- tryIO $ doesDirectoryExist newPath
let isChild = not $ takeFileName newPath `elem` [".", ".."]
when (isDir && isChild) $ loop newPath
applyParser :: (Proxy p, Show (IO J2s)) => String -> Consumer p B.ByteString IO ()
applyParser path = runIdentityP loop
where
loop = do
bs <- request ()
let sc = classify (initPos path) (B8.unpack bs)
lift $ B8.putStrLn (tokensParserToByteString sc)
tokensParserToByteString :: (Show (IO J2s)) => [Token] -> B.ByteString
tokensParserToByteString tokens = B8.pack(show (parseIO pJ2s tokens))
handlerParser :: (CheckP p, Show (IO J2s)) => FilePath -> Session (ExceptionP p) SafeIO ()
handlerParser path = do
canRead <- tryIO $ fmap readable $ getPermissions path
isDir <- tryIO $ doesDirectoryExist path
isValidExtension <- tryIO $ evaluate ((snd (splitExtension path) == ".java" || snd (splitExtension path) == ".mora") && (snd (splitFileName path) /= "EncodeTest.java") && (snd (splitFileName path) /= "T6302184.java") && (snd (splitFileName path) /= "Unmappable.java"))
when (not isDir && canRead && isValidExtension) $
(readFileSP 10240 path >-> try . applyParser) path
readFileSP
:: (CheckP p)
=> Int -> FilePath -> () -> Producer (ExceptionP p) B.ByteString SafeIO ()
readFileSP chunkSize path () =
bracket id (openFile path ReadMode) hClose $ \handle -> do
let loop = do
eof <- tryIO $ hIsEOF handle
unless eof $ do
bs <- tryIO $ B.hGetSome handle chunkSize
respond bs
loop
loop
A signature like Show (IO J2s) => IO () almost never makes sense. What this expresses is basically “provided the universe is crafted such that IO J2s has a Show instance, I give you an IO () action”. Well, if the universe has that property, then give us IO () action right now. Keep nasty chipsconstraints!
Constraints only really make sense if you apply them to type variables, i.e. if you're writing code that's polymorphic over several different, but not all types. (Like with CheckP p). But a constraint applied to concrete types does little more than defer type errors.
IO J2s has no Show instance. And it can't have such an instance: this is an IO action. It could be a complete subprogram that might execute costly computations, call commercial third-party library code, launch some missiles... and only in the very end return a J2s value. How do you expect to pack all the information of something so possibly complex into a simple string?
What possibly does have a Show instance is J2s. If you're in the IO monad anyway and have an IO J2s action, you can at any point fetch the J2s value from it by monad-binding that action (i.e. executing the subprogram) and just showing the J2s value. In your case:
tokensParserToByteString :: [Token] -> IO B.ByteString
tokensParserToByteString tokens = fmap (B8.pack . show) $ parseIO pJ2s tokens
I case you're confused about fmapping in the IO functor, this is equivalent to
tokensParserToByteString :: [Token] -> IO B.ByteString
tokensParserToByteString tokens = do
j2sValue <- parseIO pJ2s tokens
return . B8.pack $ show j2sValue
Of course you then need to adapt applyParser because tokensParserToByteString is now an IO action. Easy enough with the =<< operator:
applyParser :: Proxy p => String -> Consumer p B.ByteString IO ()
applyParser path = runIdentityP loop
where
loop = do
bs <- request ()
let sc = classify (initPos path) (B8.unpack bs)
lift $ B8.putStrLn =<< tokensParserToByteString sc
Is it possible to split a Shell in Turtle library (Haskell) and do different things to either split of the shell, such that the original Shell is only run once ?
/---- shell2
---Shell1 --/
\
\-----shell3
For instance, how to do
do
let lstmp = lstree "/tmp"
view lstmp
view $ do
path <- lstmp
x <- liftIO $ testdir path
return x
such that lstree "/tmp" would only run once.
Specifically I would like to send Shell 2 and Shell 3 to different files using output.
You won't be able to split a Shell into two separate shells that run simultaneously, unless there's some magic I don't know. But file writing is a fold over the contents of a shell or some other succession of things. It is built into turtle that you can always combine many folds and make them run simultaneously using the Control.Foldl material - here
foldIO :: Shell a -> FoldM IO a r -> IO r -- specializing
A shell is secretly a FoldM IO a r -> IO r under the hood anyway, so this is basically runShell. To do this we need to get the right Shell and the right combined FoldM IO. The whole idea of the Fold a b and FoldM m a b types from the foldl package is simultaneous folding.
I think the easiest way to get the right shell is just to make the lstree fold return a FilePath together with the result of testdir. You basically wrote this:
withDirInfo :: FilePath -> Shell (Bool, FilePath)
withDirInfo tmp = do
let lstmp = lstree tmp
path <- lstmp
bool <- liftIO $ testdir path
return (bool, path)
So now we can get a Shell (Bool, FilePath) from /tmp This has all the information our two folds will need, and thus that our combined fold will need.
Next we might write a helper fold that prints the Text component of the FilePath to a given handle:
sinkFilePaths :: Handle -> FoldM IO FilePath ()
sinkFilePaths handle = L.sink (T.hPutStrLn handle . format fp)
Then we can use this Handle -> FoldM IO FilePath () to define two FoldM IO (Bool, FilePath) (). Each will write different stuff to different handles, and we can unite them into a single simultaneous fold with <*. This is an independent FoldM IO ... and can be applied e.g. to a pure list of type [(Bool, FilePath)] using L.fold and it will write different things from the list to the different handles. In our case, though, we will apply it to the Shell (Bool, FilePath) we defined.
The only subtle part of this is the use of L.handlesM to print only the second element, in both cases, and only those filtered as directories in the other. This uses the _2 lens and filtered from the lens libraries. This could probably be simplified, but see what you think:
{-#LANGUAGE OverloadedStrings #-}
import Turtle
import qualified Control.Foldl as L
import qualified System.IO as IO
import Control.Lens (_2,filtered)
import qualified Data.Text.IO as T
main = IO.withFile "tmpfiles.txt" IO.WriteMode $ \h ->
IO.withFile "tmpdirs.txt" IO.WriteMode $ \h' -> do
foldIO (withDirInfo "/tmp") (sinkFilesDirs h h')
withDirInfo :: Turtle.FilePath -> Shell (Bool, Turtle.FilePath)
withDirInfo tmp = do
let lstmp = lstree tmp
path <- lstmp
bool <- liftIO $ testdir path
return (bool, path)
sinkFilePaths :: Handle -> FoldM IO Turtle.FilePath ()
sinkFilePaths handle = L.sink (T.hPutStrLn handle . format fp)
sinkFilesDirs :: Handle -> Handle -> FoldM IO (Bool, Turtle.FilePath) ()
sinkFilesDirs h h' = allfiles <* alldirs where
allfiles :: L.FoldM IO (Bool, Turtle.FilePath) ()
allfiles = L.handlesM _2 (sinkFilePaths h)
-- handle the second element of pairs with sinkFilePaths
alldirs :: FoldM IO (Bool, Turtle.FilePath) ()
alldirs = L.handlesM (filtered (\(bool,file) -> bool) . _2) (sinkFilePaths h')
-- handle the second element of pairs where the first element
-- is true using sinkFilePaths
It sounds like you're looking for something like async to split off your shells from the first shell and then wait for them to return. async is a pretty capable library that can achieve much more than the below example, but it provides a pretty simple solution to what you're asking for:
import Control.Concurrent.Async
import Turtle.Shell
import Turtle.Prelude
main :: IO ()
main = do
let lstmp1 = lstree "/tmp"
let lstmp2 = lstree "/etc"
view lstmp1
view lstmp2
job1 <- async $ view $ do
path <- lstmp1
x <- liftIO $ testdir path
return x
job2 <- async $ view $ do
path <- lstmp2
x <- liftIO $ testdir path
return x
wait job1
wait job2
Is this what you're looking for?
I currently have this code which will perform the main' function on each of the filenames in the list files.
Ideally I have been trying to combine main and main' but I haven't made much progress. Is there a better way to simplify this or will I need to keep them separate?
{- Start here -}
main :: IO [()]
main = do
files <- getArgs
mapM main' files
{- Main's helper function -}
main' :: FilePath -> IO ()
main' file = do
contents <- readFile file
case (runParser parser 0 file $ lexer contents) of Left err -> print err
Right xs -> putStr xs
Thanks!
Edit: As most of you are suggesting; I was trying a lambda abstraction for this but wasn't getting it right. - Should've specified this above. With the examples I see this better.
The Control.Monad library defines the function forM which is mapM is reverse arguments. That makes it easier to use in your situation, i.e.
main :: IO ()
main = do
files <- getArgs
forM_ files $ \file -> do
contents <- readFile file
case (runParser f 0 file $ lexer contents) of
Left err -> print err
Right xs -> putStr xs
The version with the underscore at the end of the name is used when you are not interested in the resulting list (like in this case), so main can simply have the type IO (). (mapM has a similar variant called mapM_).
You can use forM, which equals flip mapM, i.e. mapM with its arguments flipped, like this:
forM_ files $ \file -> do
contents <- readFile file
...
Also notice that I used forM_ instead of forM. This is more efficient when you are not interested in the result of the computation.
I'm writing a shell script (my 1st non-example in haskell) which is supposed to list a directory, get every file size, do some string manipulation (pure code) and then rename some files. I'm not sure what i'm doing wrong, so 2 questions:
How should i arrange the code in such program?
I have a specific issue, i get the following error, what am i doing wrong?
error:
Couldn't match expected type `[FilePath]'
against inferred type `IO [FilePath]'
In the second argument of `mapM', namely `fileNames'
In a stmt of a 'do' expression:
files <- (mapM getFileNameAndSize fileNames)
In the expression:
do { fileNames <- getDirectoryContents;
files <- (mapM getFileNameAndSize fileNames);
sortBy cmpFilesBySize files }
code:
getFileNameAndSize fname = do (fname, (withFile fname ReadMode hFileSize))
getFilesWithSizes = do
fileNames <- getDirectoryContents
files <- (mapM getFileNameAndSize fileNames)
sortBy cmpFilesBySize files
Your second, specific, problem is with the types of your functions. However, your first issue (not really a type thing) is the do statement in getFileNameAndSize. While do is used with monads, it's not a monadic panacea; it's actually implemented as some simple translation rules. The Cliff's Notes version (which isn't exactly right, thanks to some details involving error handling, but is close enough) is:
do a ≡ a
do a ; b ; c ... ≡ a >> do b ; c ...
do x <- a ; b ; c ... ≡ a >>= \x -> do b ; c ...
In other words, getFileNameAndSize is equivalent to the version without the do block, and so you can get rid of the do. This leaves you with
getFileNameAndSize fname = (fname, withFile fname ReadMode hFileSize)
We can find the type for this: since fname is the first argument to withFile, it has type FilePath; and hFileSize returns an IO Integer, so that's the type of withFile .... Thus, we have getFileNameAndSize :: FilePath -> (FilePath, IO Integer). This may or may not be what you want; you might instead want FilePath -> IO (FilePath,Integer). To change it, you can write any of
getFileNameAndSize_do fname = do size <- withFile fname ReadMode hFileSize
return (fname, size)
getFileNameAndSize_fmap fname = fmap ((,) fname) $
withFile fname ReadMode hFileSize
-- With `import Control.Applicative ((<$>))`, which is a synonym for fmap.
getFileNameAndSize_fmap2 fname = ((,) fname)
<$> withFile fname ReadMode hFileSize
-- With {-# LANGUAGE TupleSections #-} at the top of the file
getFileNameAndSize_ts fname = (fname,) <$> withFile fname ReadMode hFileSize
Next, as KennyTM pointed out, you have fileNames <- getDirectoryContents; since getDirectoryContents has type FilePath -> IO FilePath, you need to give it an argument. (e.g. getFilesWithSizes dir = do fileNames <- getDirectoryContents dir ...). This is probably just a simple oversight.
Mext, we come to the heart of your error: files <- (mapM getFileNameAndSize fileNames). I'm not sure why it gives you the precise error it does, but I can tell you what's wrong. Remember what we know about getFileNameAndSize. In your code, it returns a (FilePath, IO Integer). However, mapM is of type Monad m => (a -> m b) -> [a] -> m [b], and so mapM getFileNameAndSize is ill-typed. You want getFileNameAndSize :: FilePath -> IO (FilePath,Integer), like I implemented above.
Finally, we need to fix your last line. First of all, although you don't give it to us, cmpFilesBySize is presumably a function of type (FilePath, Integer) -> (FilePath, Integer) -> Ordering, comparing on the second element. This is really simple, though: using Data.Ord.comparing :: Ord a => (b -> a) -> b -> b -> Ordering, you can write this comparing snd, which has type Ord b => (a, b) -> (a, b) -> Ordering. Second, you need to return your result wrapped up in the IO monad rather than just as a plain list; the function return :: Monad m => a -> m a will do the trick.
Thus, putting this all together, you'll get
import System.IO (FilePath, withFile, IOMode(ReadMode), hFileSize)
import System.Directory (getDirectoryContents)
import Control.Applicative ((<$>))
import Data.List (sortBy)
import Data.Ord (comparing)
getFileNameAndSize :: FilePath -> IO (FilePath, Integer)
getFileNameAndSize fname = ((,) fname) <$> withFile fname ReadMode hFileSize
getFilesWithSizes :: FilePath -> IO [(FilePath,Integer)]
getFilesWithSizes dir = do fileNames <- getDirectoryContents dir
files <- mapM getFileNameAndSize fileNames
return $ sortBy (comparing snd) files
This is all well and good, and will work fine. However, I might write it slightly differently. My version would probably look like this:
{-# LANGUAGE TupleSections #-}
import System.IO (FilePath, withFile, IOMode(ReadMode), hFileSize)
import System.Directory (getDirectoryContents)
import Control.Applicative ((<$>))
import Control.Monad ((<=<))
import Data.List (sortBy)
import Data.Ord (comparing)
preservingF :: Functor f => (a -> f b) -> a -> f (a,b)
preservingF f x = (x,) <$> f x
-- Or liftM2 (<$>) (,), but I am not entirely sure why.
fileSize :: FilePath -> IO Integer
fileSize fname = withFile fname ReadMode hFileSize
getFilesWithSizes :: FilePath -> IO [(FilePath,Integer)]
getFilesWithSizes = return . sortBy (comparing snd)
<=< mapM (preservingF fileSize)
<=< getDirectoryContents
(<=< is the monadic equivalent of ., the function composition operator.) First off: yes, my version is longer. However, I'd probably already have preservingF defined somewhere, making the two equivalent in length.* (I might even inline fileSize if it weren't used elsewhere.) Second, I like this version better because it involves chaining together simpler pure functions we've already written. While your version is similar, mine (I feel) is more streamlined and makes this aspect of things clearer.
So this is a bit of an answer to your first question of how to structure these things. I personally tend to lock my IO down into as few functions as possible—only functions which need to touch the outside world directly (e.g. main and anything which interacts with a file) get an IO. Everything else is an ordinary pure function (and is only monadic if it's monadic for general reasons, along the lines of preservingF). I then arrange things so that main, etc., are just compositions and chains of pure functions: main gets some values from IO-land; then it calls pure functions to fold, spindle, and mutilate the date; then it gets more IO values; then it operates more; etc. The idea is to separate the two domains as much as possible, so that the more compositional non-IO code is always free, and the black-box IO is only done precisely where necessary.
Operators like <=< really help with writing code in this style, as they let you operate on functions which interact with monadic values (such as the IO-world) just as you would operate on normal functions. You should also look at Control.Applicative's function <$> liftedArg1 <*> liftedArg2 <*> ... notation, which lets you apply ordinary functions to any number of monadic (really Applicative) arguments. This is really nice for getting rid of spurious <-s and just chaining pure functions over monadic code.
*: I feel like preservingF, or at least its sibling preserving :: (a -> b) -> a -> (a,b), should be in a package somewhere, but I've been unable to find either.
getDirectoryContents is a function. You should supply an argument to it, e.g.
fileNames <- getDirectoryContents "/usr/bin"
Also, the type of getFileNameAndSize is FilePath -> (FilePath, IO Integer), as you can check from ghci:
Prelude> :m + System.IO
Prelude System.IO> let getFileNameAndSize fname = do (fname, (withFile fname ReadMode hFileSize))
Prelude System.IO> :t getFileNameAndSize
getFileNameAndSize :: FilePath -> (FilePath, IO Integer)
But mapM requires the input function to return an IO stuff:
Prelude System.IO> :t mapM
mapM :: (Monad m) => (a -> m b) -> [a] -> m [b]
-- # ^^^^^^^^
You should change its type to FilePath -> IO (FilePath, Integer) to match the type.
getFileNameAndSize fname = do
fsize <- withFile fname ReadMode hFileSize
return (fname, fsize)