Understanding `withFile` with Example - haskell

I implemented withFile in Haskell:
withFile' :: FilePath -> IOMode -> (Handle -> IO a) -> IO a
withFile' path iomode f = do
handle <- openFile path iomode
result <- f handle
hClose handle
return result
When I ran the main provided by Learn You a Haskell, it printed out the content of "girlfriend.txt," as expected:
import System.IO
main = do
withFile' "girlfriend.txt" ReadMode (\handle -> do
contents <- hGetContents handle
putStr contents)
I wasn't sure if my withFile' would've worked with the last 2 lines: (1) close the handle and (2) returning the result as anIO a.
Why didn't the following happen?
result gets lazily bound to f handle
hClose handle closes the file handle
result gets return'd, which results in the actual evaluate of f handle. Since handle was closed, an error gets thrown.

Lazy IO is popularly known as confusing.
It depends on whether putStr executes before hClose or not.
Notice the difference between the first and second uses (the brackets are unnecessary but clarifying in the second example).
ghci> withFile' "temp.hs" ReadMode (hGetContents >=> putStr) -- putStr
import System.IO
import Control.Monad
withFile' :: FilePath -> IOMode -> (Handle -> IO a) -> IO a
withFile' path iomode f = do
handle <- openFile path iomode
result <- f handle
hClose handle
return result
ghci> (withFile' "temp.hs" ReadMode hGetContents) >>= putStr
ghci>
In both cases, the f passed in gets a chance to run before the handle is closed. Because of lazy evaluation, hGetContents only reads the file if it needs to, i.e. is forced to in order to produce output for some other function.
In the first example, since f is (hGetContents >=> putStr), the full contents of the file must be read in order to execute putStr.
In the second example, nothing needs to be evaluated after hGetContents in order to return result, which is a lazy list. (I can quite happily return (show [1..]) which will only fail to terminate if I choose to use the entire output.) This is seen as a problem for lazy IO, which is fixed by alternatives such as strict IO, pipes or conduit.
Maybe returning the empty string for a file when the handle was closed prematurely is a bug, but certainly running the entirety of f before closing it is not.

Equational reasoning means that you can reason about Haskell code by just inlining and substituting things (with certain caveats, but they don't apply here).
This means that all I need to do to understand your code is to take the withFile' here:
import System.IO
main = do
withFile' "girlfriend.txt" ReadMode (\handle -> do
contents <- hGetContents handle
putStr contents)
... and inline its definition:
main = do
handle <- openFile "girlfriend.txt" ReadMode
contents <- hGetContents handle
result <- putStr contents
hClose handle
return result
Once you inline its definition, it's easier to see what is going on. putStr evaluates the entire contents of the file before you close the handle, so there is no error. Also, result is not what you think it is: it's the return value of putStr, which is just (), not the contents of the file.

Most IO actions are not lazily executed.
IO action execution is different from normal Haskell evaluation of values. IO execution is only ever carried out by the outer driver that is trying to execute all the effects of main; it does so in the correct order implied by the monadic sequencing of IO actions.
The driver's need to know what the next IO action is ultimately triggers all evaluation of lazy values in Haskell; if it were happy with an unevaluated lazy value and moved on to the next thing without fully evaluating and executing it, then it would just leave main unevaluated and no Haskell program could ever do anything.
The Haskell value resulting from executing an IO action may of course be an unevaluated lazy value, but each IO action itself is evaluated and executed by the driver (including all sub-actions sequenced with do blocks or binds).
So result doesn't get lazily bound to f handle completely unevaluated; f handle is evaluated to come up with the sub actions hGetContents handle and putStr contents. These are both fully executed before the outer driver moves on to hClose handle, so everything's okay.
Note however that hGetContents is special. Quoting from the documentation:
Computation hGetContents hdl returns the list of characters corresponding to the unread portion of the channel or file managed by hdl, which is put into an intermediate state, semi-closed. In this state, hdl is effectively closed, but items are read from hdl on demand and accumulated in a special list returned by hGetContents hdl.
Any operation that fails because a handle is closed, also fails if a handle is semi-closed. The only exception is hClose. A semi-closed handle becomes closed:
if hClose is applied to it;
if an I/O error occurs when reading an item from the handle;
or once the entire contents of the handle has been read.
Once a semi-closed handle becomes closed, the contents of the associated list becomes fixed. The contents of this final list is only partially specified: it will contain at least all the items of the stream that were evaluated prior to the handle becoming closed.
So executing hGetContents handle actually results in a partially evaluated list, whose lazy evaluation is tied to further IO operations under the hood. This is impossible to do yourself without using the Unsafe family of operations, since it is essentially bypassing the type system and can result in exactly the sort of problem you were concerned about; if you had attempted the following code:
main = do
text <- withFile' "girlfriend.txt" ReadMode (\handle -> do
contents <- hGetContents handle
return contents)
putStr text
(where the function passed to withFile' tries to return the file contents, and they are passed to putStr after the withFile' call), then the putStr would be executed after hClose, and the file may well not have been fully read before it was closed.

Related

Haskell hGetContents error

Is the following error due to lazy evaluation?
epubParsing :: FilePath -> IO [String]
epubParsing f = do
h <- openFile f ReadMode
hSetEncoding h utf8
content <- hGetContents h
hClose h
return . fromJust $ scrapeStringLike content paragraphS
I get an error: hGetContents: illegal operation (delayed read on closed handle)
Why?
Calling hGetContents puts the handle into a special "semi-closed" state. You cannot perform any explicit operations on it after that. In particular, you don't manually close it; it automatically gets closed in the background when you read to the end of the string. You can just remove that hClose and it will work.
This is one of the pitfalls of lazy I/O, and one of the reasons people advise to avoid it; it makes the timing of your I/O operations kind of unpredictable.

How to reuse efficiently input from stdin in Haskell

I understand that I should not try to re-read from stdin because of errors about Haskell IO - handle closed
For example, in below:
main = do
x <- getContents
putStrLn $ map id x
x <- getContents --problem line
putStrLn x
the second call x <- getContents will cause the error:
test: <stdin>: hGetContents: illegal operation (handle is closed)
Of course, I can omit the second line to read from getContents.
main = do
x <- getContents
putStrLn $ map id x
putStrLn x
But will this become a performance/memory issue? Will GHC have to keep all of the contents read from stdin in the main memory?
I imagine the first time around when x is consumed, GHC can throw away the portions of x that are already processed. So theoretically, GHC could only use a small amount of constant memory for the processing. But since we are going to use x again (and again), it seems that GHC cannot throw away anything. (Nor can it read again from stdin).
Is my understanding about the memory implications here correct? And if so, is there a fix?
Yes, your understanding is correct: If you reuse x, ghc has to keep it all in memory.
I think a possible fix is to consume it lazily (once).
Let's say you want to output x to several output handles hdls :: [Handle]. The naive approach is:
main :: IO ()
main = do
x <- getContents
forM_ hdls $ \hdl -> do
hPutStr hdl x
This will read stdin into x as the first hPutStr traverses the string (at least for unbuffered handles, hPutStr is simply a loop that calls hPutChar for each character in the string). From then on it'll be kept in memory for all following hdls.
Alternatively:
main :: IO ()
main = do
x <- getContents
forM_ x $ \c -> do
forM_ hdls $ \hdl -> do
hPutChar hdl c
Here we've transposed the loops: Instead of iterating over the handles (and for each handle iterating over the input characters), we iterate over the input characters, and for each character, we print it to each handle.
I haven't tested it, but this form should guarantee that we don't need a lot of memory because each input character c is used once and then discarded.

Haskell return lazy string from file IO

Here I'm back again with a (for me) really strange behaviour of my newest masterpiece...
This code should read a file, but it doesn't:
readCsvContents :: String -> IO ( String )
readCsvContents fileName = do
withFile fileName ReadMode (\handle -> do
contents <- hGetContents handle
return contents
)
main = do
contents <- readCsvContents "src\\EURUSD60.csv"
putStrLn ("Read " ++ show (length contents) ++ " Bytes input data.")
The result is
Read 0 Bytes input data.
Now I changed the first function and added a putStrLn:
readCsvContents :: String -> IO ( String )
readCsvContents fileName = do
withFile fileName ReadMode (\handle -> do
contents <- hGetContents handle
putStrLn ("hGetContents gave " ++ show (length contents) ++ " Bytes of input data.")
return contents
)
and the result is
hGetContents gave 3479360 Bytes of input data.
Read 3479360 Bytes input data.
WTF ??? Well, I know, Haskell is lazy. But I didn't know I had to kick it in the butt like this.
You're right, this is a pain. Avoid using the old standard file IO module, for this reason – except to simply read an entire file that won't change, as you did; this can be done just fine with readFile.
readCsvContents :: Filepath -> IO String
readCsvContents fileName = do
contents <- readFile fileName
return contents
Note that, by the monad laws, this is exactly the same1 as
readCsvContents = readFile
The problem with what you tried is that the handle is closed unconditionally when the monad exits withFile, without checking whether lazy-evaluation of contents has actually forced the file reads. That is of course horrible; I would never bother to use handles myself. readFile avoids the problem by linking the closing of the handle to garbage-collection of the original result thunk; this isn't altogether nice either but often works quite well.
For proper work with file IO, check out either the conduit or pipes library. The former focuses a bit more on performance, the latter more on elegance (but really, the difference isn't that big).
1And your first try is the same as readCsvContents fn = withFile fn ReadMode hGetContents.
This is a problem with lazy IO. What happens in your code is that withFile opens the file, passes the handle to the lambda. This lambda returns a lazy list containing the contents of the file. Then withFile notices that the callback finished and closes the file.
Since the returned list is lazy, the file contents will only be read when the list is evaluated. This happens in the call to length. However, at this point the file handle is already closed and therefore you can't read anything from the file.
The modified version of your call forces the file contents in the withFile argument, at which point the file is still available, and therefore it works.

Haskell getContents wait for EOF

I want to wait until user input terminates with EOF and then output it all whole. Isn't that what getContents supposed to do? The following code outputs each time user hits enter, what am I doing wrong?
import System.IO
main = do
hSetBuffering stdin NoBuffering
contents <- getContents
putStrLn contents
The fundamental problem is that getContents is an instances of Lazy IO. This means that getContents produces a thunk that can be evaluated like a normal Haskell value, and only does the relevant IO when it's forced.
contents is a lazy list that putStr tries to print, which forces the list and causes getContents to read as much as it can. putStr then prints everything that's forced, and continues trying to force the rest of the list until it hits []. As getContents can read more and more of the stream—the exact behavior depends on buffering—putStr can print more and more of it immediately, giving you the behavior you see.
While this behavior is useful for very simple scripts, it ties in Haskell's evaluation order into observable effects—something it was never meant to do. This means that controlling exactly when parts of contents get printed is awkward because you have to break the normal Haskell abstraction and understand exactly how things are getting evaluated.
This leads to some potentially unintuitive behavior. For example, if you try to get the length of the input—and actually use it—the list is forced before you get to printing it, giving you the behavior you want:
main = do
contents <- getContents
let n = length contents
print n
putStr contents
but if you move the print n after the putStr, you go back to the original behavior because n does not get forced until after printing the input (even though n still got defined before putStr was used):
main = do
contents <- getContents
let n = length contents
putStr contents
print n
Normally, this sort of thing is not a problem because it won't change the behavior of your code (although it can affect performance). Lazy IO just brings it into the realm of correctness by piercing the abstraction layer.
This also gives us a hint on how we can fix your issue: we need some way of forcing contents before printing it. As we saw, we can do this with length because length needs to traverse the whole list before computing its result. Instead of printing it, we can use seq which forces the lefthand expression to be evaluated at the same time as the righthand one, but throws away the actual value:
main = do
contents <- getContents
let n = length contents
n `seq` putStr contents
At the same time, this is still a bit ugly because we're using length just to traverse the list, not because we actually care about it. What we would really like is a function that just traverses the list enough to evaluate it, without doing anything else. Happily, this is exactly what deepseq does (for many data structures, not just lists):
import Control.DeepSeq
import System.IO
main = do
contents <- getContents
contents `deepseq` putStr contents
This is a problem of lazy I/O. One simple solution is to use strict I/O, such as via ByteStrings:
import qualified Data.ByteString as S
main :: IO ()
main = S.getContents >>= S.putStr
You can use the replacement functions from the strict package (link):
import qualified System.IO.Strict as S
main = do
contents <- S.getContents
putStrLn contents
Note that for reading there isn't a need to set buffering. Buffering really only helps when writing to files. See this answer (link) for more details.
The definition of the strict version of hGetContents in System.IO.Strict is pretty simple:
hGetContents :: IO.Handle -> IO.IO String
hGetContents h = IO.hGetContents h >>= \s -> length s `seq` return s
I.e., it forces everything to read into memory by calling length on the string returned by the standard/lazy version of hGetContents.

reading files with references to other files in haskell

I am trying to expand regular markdown with the ability to have references to other files, such that the content in the referenced files is rendered at the corresponding places in the "master" file.
But the furthest I've come is to implement
createF :: FTree -> IO String
createF Null = return ""
createF (Node f children) = ifNExists f (_id f)
(do childStrings <- mapM createF children
withFile (_path f) ReadMode $ \handle ->
do fc <- lines <$> hGetContents handle
return $ merge fc childStrings)
ifNExists is just a helper that can be ignored, the real problem happens in the reading of the handle, it just returns the empty string, I assume this is due to lazy IO.
I thought that the use of withFile filepath ReadMode $ \handle -> {-do stutff-}hGetContents handle would be the right solution as I've read fcontent <- withFile filepath ReadMode hGetContents is a bad idea.
Another thing that confuses me is that the function
createFT :: File -> IO FTree
createFT f = ifNExists f Null
(withFile (_path f) ReadMode $ \handle ->
do let thisParse = fparse (_id f :_parents f)
children <-rights . map ( thisParse . trim) . lines <$> hGetContents handle
c <- mapM createFT children
return $ Node f c)
works like a charm.
So why does createF return just an empty string?
the whole project and a directory/file to test can be found at github
Here are the datatype definitions
type ID = String
data File = File {_id :: ID, _path :: FilePath, _parents :: [ID]}
deriving (Show)
data FTree = Null
| Node { _file :: File
, _children :: [FTree]} deriving (Show)
As you suspected, lazy IO is probably the problem. Here's the (awful) rule you have to follow to use it properly without going totally nuts:
A withFile computation must not complete until all (lazy) I/O required to fully evaluate its result has been performed.
If something forces I/O after the handle is closed, you are not guaranteed to get an error, even though that would be very nice. Instead, you get completely undefined behavior.
You break this rule with return $ merge fc childStrings, because this value is returned before it's been fully evaluated. What you can do instead is something vaguely like
let retVal = merge fc childStrings
deepSeq retVal $ return retVal
An arguably cleaner alternative is to put all the rest of the code that relies on those results into the withFile argument. The only real reason not to do that is if you do a bunch of other work with the results after you're finished with that file. For example, if you're processing a bunch of different files and accumulating their results, then you want to be sure to close each of them when you're done with it. If you're just reading in one file and then acting on it, you can leave it open till you're finished.
By the way, I just submitted a feature request to the GHC team to see if they might be willing to make these kinds of programs more likely to fail early with useful error messages.
Update
The feature request was accepted, and such programs are now much more likely to produce useful error messages. See What caused this "delayed read on closed handle" error? for details.
I'd strongly suggest you to avoid lazy IO as it always creates problems like this, as described in What's so bad about Lazy I/O? As in your case, where you need to keep the file open until it's fully read, but this would mean closing the file somewhere in pure code, when the content is actually consumed.
One possibility would be to use strict ByteStrings and read files using readFile. This would also make many operations more efficient.
Another option would be to use one of the libraries that address the lazy IO problem (see What are the pros and cons of Enumerators vs. Conduits vs. Pipes?). These libraries allow you to separate content production from its processing or consumption. So you could have a producer that reads input files and produces a stream of some tokens, and a pure consumer (not depending on IO) that consumes the stream and produces some result. For example, in conduit-extra there is a module that converts an atto-parsec parser into a consumer.
See also Is there a better way to walk a directory tree?

Resources