hGetContents being too lazy

hGetContents being too lazy - haskell

I have the following snippet of code, which I pass to withFile:
text <- hGetContents hand
let code = parseCode text
return code
Here hand is a valid file handle, opened with ReadMode and parseCode is my own function that reads the input and returns a Maybe. As it is, the function fails and returns Nothing. If, instead I write:
text <- hGetContents hand
putStrLn text
let code = parseCode text
return code
I get a Just, as I should.
If I do openFile and hClose myself, I have the same problem. Why is this happening? How can I cleanly solve it?
Thanks

hGetContents isn't too lazy, it just needs to be composed with other things appropriately to get the desired effect. Maybe the situation would be clearer if it were were renamed exposeContentsToEvaluationAsNeededForTheRestOfTheAction or just listen.
withFile opens the file, does something (or nothing, as you please -- exactly what you require of it in any case), and closes the file.
It will hardly suffice to bring out all the mysteries of 'lazy IO', but consider now this difference in bracketing
good file operation = withFile file ReadMode (hGetContents >=> operation >=> print)
bad file operation = (withFile file ReadMode hGetContents) >>= operation >>= print
-- *Main> good "lazyio.hs" (return . length)
-- 503
-- *Main> bad "lazyio.hs" (return . length)
-- 0
Crudely put, bad opens and closes the file before it does anything; good does everything in between opening and closing the file. Your first action was akin to bad. withFile should govern all of the action you want done that that depends on the handle.
You don't need a strictness enforcer if you are working with String, small files, etc., just an idea how the composition works. Again, in bad all I 'do' before closing the file is exposeContentsToEvaluationAsNeededForTheRestOfTheAction. In good I compose exposeContentsToEvaluationAsNeededForTheRestOfTheAction with the rest of the action I have in mind, then close the file.
The familiar length + seq trick mentioned by Patrick, or length + evaluate is worth knowing; your second action with putStrLn txt was a variant. But reorganization is better, unless lazy IO is wrong for your case.
$ time ./bad
bad: Prelude.last: empty list
-- no, lots of Chars there
real 0m0.087s
$ time ./good
'\n' -- right
()
real 0m15.977s
$ time ./seqing
Killed -- hopeless, attempting to represent the file contents
real 1m54.065s -- in memory as a linked list, before finding out the last char
It goes without saying that ByteString and Text are worth knowing about, but reorganization with evaluation in mind is better, since even with them the Lazy variants are often what you need, and they then involve grasping the same distinctions between forms of composition. If you are dealing with one of the (immense) class of cases where this sort of IO is inappropriate, take a look at enumerator, conduit and co., all wonderful.

hGetContents uses lazy IO; it only reads from the file as you force more of the string, and it only closes the file handle when you evaluate the entire string it returns. The problem is that you're enclosing it in withFile; instead, just use openFile and hGetContents directly (or, more simply, readFile). The file will still get closed once you fully evaluate the string. Something like this should do the trick, to ensure that the file is fully read and closed immediately by forcing the entire string beforehand:
import Control.Exception (evaluate)
readCode :: FilePath -> IO Code
readCode fileName = do
text <- readFile fileName
evaluate (length text)
return (parseCode text)
Unintuitive situations like this are one of the reasons people tend to avoid lazy IO these days, but unfortunately you can't change the definition of hGetContents. A strict IO version of hGetContents is available in the strict package, but it's probably not worth depending on the package just for that one function.
If you want to avoid the overhead that comes from traversing the string twice here, then you should probably look into using a more efficient type than String, anyway; the Text type has strict IO equivalents for much of the String-based IO functionality, as does ByteString (if you're dealing with binary data, rather than Unicode text).

You can force the contents of text to be evaluated using
length text `seq` return code
as the last line.

Related

IO vs referential transparency

Sorry newb question here, but how does Haskell know not to apply referential transparency to e.g. readLn or when putStrLn-ing a same string twice? Is it because IO is involved? IOW, will not the compiler apply referential transparency to functions returning IO?

You need to distinguish between evaluation and execution.
If you evaluate 2 + 7, the result is 9. If you replace one expression that evaluates to 9 with another, different expression that also evaluates to 9, then the meaning of the program has not changed. This is what referential transparency guarantees. We could common up several shared expressions that reduce to 9, or duplicate a shared expression into multiple copies, and the meaning of the program does not change. (Performance might, but not the end result.)
If you evaluate readLn, it evaluates to an I/O command object. You can imagine it as being a data structure that describes what I/O operation(s) you want performed. But the object itself is just data. If you evaluate readLn twice, it returns the same I/O command object twice. You can merge several copies into one; you can split one copy into several. It doesn't change the meaning of the program.
Now if you want to execute the I/O action, that's a different thing. Clearly, I/O operations need to be executed in exactly the way the program specifies, without being randomly duplicated or rearranged. But that's OK, because it's not the Haskell expression evaluation engine that does that. You can pretend that the Haskell runtime runs main, which builds a giant I/O command object representing the entire program. The Haskell runtime then reads this data structure and executes the I/O operations it requests, in the order specified. (Not actually how it works, but a useful mental model.)
Ordinarily you don't need to bother thinking about this strict separation between evaluating readLn to get an I/O command object and then executing the resulting I/O command object to get a result. But strictly that's notionally what it does.
(You may also have heard that I/O "forms a monad". That's merely a fancy way of saying that there's a particular set of operators for changing I/O command objects together into bigger I/O command objects. It's not central to understanding the separation between evaluate and execute.)

The IO type is defined as:
newtype IO a = IO (State# RealWorld -> (# State# RealWorld, a #))
Notice is very similar to the State Monad where the state is the state of the real world, so imho you can think of IO as referentially transparent and pure, what's impure is the Haskell runtime (interpreter) that runs your IO actions (the algebra).
Take a look at the Haskell wiki, it explains IO in greater detail: IO Inside

Due to return values are wrapped into IO, you can't reuse them unless you "pull" them out, effectively running the IO action:
readLn :: IO String
twoLines = readLn ++ readLn -- can't do this, because (++) works on String's, not IO String's
twoLines' = do
l1 <- readLn
l2 <- readLn -- "pulling" out of readLn causes console input to be read again
return (l1 ++ l2) -- so l1 and l2 have different values, and this works

Sort of. You’ve probably heard that IO is a monad, which means a value wrapped in it has to use monadic operations, such as bind (>>=), sequential composition (>>) and return. So, you could write a greeting program like this:
prompt :: IO String
prompt = putStrLn "Who are you?" >>
getLine >>=
\name ->
return $ "Hello, " ++ name ++ "!"
main :: IO ()
main = prompt >>=
putStrLn
You’re more likely to see this with the equivalent do notation, which is just another way of writing the exact same program. In this case, though, I think the unsugared version makes it clearer that the computation is a series of statements chained together with >> and >>=, where we use >> when we want to throw out the result of the previous step, and >>= when we want to pass a result to the next function in the chain. If we need to give that result a name, we can capture it as a parameter, like in the lambda expression \name -> inside prompt. If we need to lift a simple String into an IO String, we use return.
The equivalent in do notation, by the way, is:
prompt :: IO String
prompt = do
putStrLn "Who are you?"
name <- getLine
return $ "Hello, " ++ name ++ "!"
main :: IO ()
main = do
message <- prompt
putStrLn message
So how does it know that main, which returns nothing, is not referentially transparent, and that prompt, which returns an IO String, is not either? There is something special to IO, or at least something IO lacks: with many other monads, such as State and Maybe, there is a way to do a lazy computation inside the monad and discard the wrapper, getting a pure value back out. You can declare a State Int monad, do deterministic, sequential, stateful computations inside it for a while, then use evalState to get the pure Int result back. You can do a Maybe Char computation, like searching for a character in a string, check that it worked, and if so, read the pure Char back out.
With IO, you cannot do this. If you have an IO String, all you can do with it is bind it to an IO function that takes a String argument, such as PutStrLn, or pass it to a function that takes an IO String argument. If you call prompt a second time, it won’t silently give you the same result; it will actually run again. If you tell it to pause for a few milliseconds, it won’t lazily wait until you need some return value later in the program to do that. If it returns an empty value like IO (), the compiler will not optimize it by just returning that constant.
The way this works internally is to wrap the object together with a state-of-the-world parameter that is different for each call. That means that two different calls to getLine depend on different states of the world, and the return value of main requires computing the final state of the world, which depends on all previous IO operations.

Is it possible to have a lazy version of `sequence` without using `unsafeIO`s?

If I have a possibly infinite list of IO-monads, and I am guaranteed that their sequential execution will not be affected by other IOs, can I somehow make it lazily sequenced (evaluated)?
To clarify my point, here is some pseudo-Haskell code demonstrating what I had in mind:
main = do
inputs <- sequence . repeat $ getLine -- we are forever stuck here
mapM_ putStrLn inputs -- not going to run
Now, I know that in the particular example above, we can just use getContents to get the effect I want
main = do
inputs <- return . lines =<< getContents
mapM_ putStrLn inputs
but in my application the IO monads are not getLine but an external function get1 :: IO (Maybe Record). However, this actually brings my point, because apparently getContents internally uses unsafeIOs to achieve this lazy-effect. My question is, is that necessary? (If you are interested in what exactly I want to do, please refer to this question.)

Maybe you are looking for this?
main = do
let inputs = repeat getLine
mapM_ (>>=putStrLn) inputs

Just to bell the cat: No.
If you have something of type IO a, there is no way to get any information out of that a without fully executing that IO block. Conceptually, that act :: IO a could be defined as an arbitrary complicated set of actions, followed by then just producing a result dependent on a call to the system random number generator.
The entire purpose of unsafeInterleaveIO is perform what you're asking, and it can't be done without it.

What is pipes/conduit trying to solve

I have seen people recommending pipes/conduit library for various lazy IO related tasks. What problem do these libraries solve exactly?
Also, when I try to use some hackage related libraries, it is highly likely there are three different versions. Example:
attoparsec
pipes-attoparsec
attoparsec-conduit
This confuses me. For my parsing tasks should I use attoparsec or pipes-attoparsec/attoparsec-conduit? What benefit do the pipes/conduit version give me as compared to the plain vanilla attoparsec?

Lazy IO
Lazy IO works like this
readFile :: FilePath -> IO ByteString
where ByteString is guaranteed to only be read chunk-by-chunk. To do so we could (almost) write
-- given `readChunk` which reads a chunk beginning at n
readChunk :: FilePath -> Int -> IO (Int, ByteString)
readFile fp = readChunks 0 where
readChunks n = do
(n', chunk) <- readChunk fp n
chunks <- readChunks n'
return (chunk <> chunks)
but here we note that the IO action readChunks n' is performed prior to returning even the partial result available as chunk. This means we're not lazy at all. To combat this we use unsafeInterleaveIO
readFile fp = readChunks 0 where
readChunks n = do
(n', chunk) <- readChunk fp n
chunks <- unsafeInterleaveIO (readChunks n')
return (chunk <> chunks)
which causes readChunks n' to return immediately, thunking an IO action to be performed only when that thunk is forced.
That's the dangerous part: by using unsafeInterleaveIO we've delayed a bunch of IO actions to non-deterministic points in the future that depend upon how we consume our chunks of ByteString.
Fixing the problem with coroutines
What we'd like to do is slide a chunk processing step in between the call to readChunk and the recursion on readChunks.
readFileCo :: Monoid a => FilePath -> (ByteString -> IO a) -> IO a
readFileCo fp action = readChunks 0 where
readChunks n = do
(n', chunk) <- readChunk fp n
a <- action chunk
as <- readChunks n'
return (a <> as)
Now we've got the chance to perform arbitrary IO actions after each small chunk is loaded. This lets us do much more work incrementally without completely loading the ByteString into memory. Unfortunately, it's not terrifically compositional--we need to build our consumption action and pass it to our ByteString producer in order for it to run.
Pipes-based IO
This is essentially what pipes solves--it allows us to compose effectful co-routines with ease. For instance, we now write our file reader as a Producer which can be thought of as "streaming" the chunks of the file when its effect gets run finally.
produceFile :: FilePath -> Producer ByteString IO ()
produceFile fp = produce 0 where
produce n = do
(n', chunk) <- liftIO (readChunk fp n)
yield chunk
produce n'
Note the similarities between this code and readFileCo above—we simply replace the call to the coroutine action with yielding the chunk we've produced so far. This call to yield builds a Producer type instead of a raw IO action which we can compose with other Pipes types in order to build a nice consumption pipeline called an Effect IO ().
All of this pipe building gets done statically without actually invoking any of the IO actions. This is how pipes lets you write your coroutines more easily. All of the effects get triggered at once when we call runEffect in our main IO action.
runEffect :: Effect IO () -> IO ()
Attoparsec
So why would you want to plug attoparsec into pipes? Well, attoparsec is optimized for lazy parsing. If you are producing the chunks fed to an attoparsec parser in an effectful way then you'll be at an impasse. You could
Use strict IO and load the entire string into memory only to consume it lazily with your parser. This is simple, predictable, but inefficient.
Use lazy IO and lose the ability to reason about when your production IO effects will actually get run causing possible resource leaks or closed handle exceptions according to the consumption schedule of your parsed items. This is more efficient than (1) but can easily become unpredictable; or,
Use pipes (or conduit) to build up a system of coroutines which include your lazy attoparsec parser allowing it to operate on as little input as it needs while producing parsed values as lazily as possible across the entire stream.

If you want to use attoparsec, use attoparsec
For my parsing tasks should I use attoparsec or pipes-attoparsec/attoparsec-conduit?
Both pipes-attoparsec and attoparsec-conduit transform a given attoparsec Parser into a sink/conduit or pipe. Therefore you have to use attoparsec either way.
What benefit do the pipes/conduit version give me as compared to the plain vanilla attoparsec?
They work with pipes and conduit, where the vanilla one won't (at least not out-of-the-box).
If you don't use conduit or pipes, and you're satisfied with the current performance of your lazy IO, there's no need to change your current flow, especially if you're not writing a big application or process large files. You can simply use attoparsec.
However, that assumes that you know the drawbacks of lazy IO.
What's the matter with lazy IO? (Problem study withFile)
Lets not forget your first question:
What problem do these libraries solve exactly ?
They solve the streaming data problem (see 1 and 3), that occurs within functional languages with lazy IO. Lazy IO sometimes gives you not what you want (see example below), and sometimes it's hard to determine the actual system resources needed by a specific lazy operation (is the data read/written in chunks/bytes/buffered/onclose/onopen…).
Example for over-laziness
import System.IO
main = withFile "myfile" ReadMode hGetContents
>>= return . (take 5)
>>= putStrLn
This won't print anything, since the evaluation of the data happens in putStrLn, but the handle has been closed already at this point.
Fixing fire with poisonous acid
While the following snippet fixes this, it has another nasty feature:
main = withFile "myfile" ReadMode $ \handle ->
hGetContents handle
>>= return . (take 5)
>>= putStrLn
In this case hGetContents will read all of the file, something you didn't expect at first. If you just want to check the magic bytes of a file which could be several GB in size, this is not the way to go.
Using withFile correctly
The solution is, obviously, to take the things in the withFile context:
main = withFile "myfile" ReadMode $ \handle ->
fmap (take 5) (hGetContents handle)
>>= putStrLn
This is by the way, also the solution mentioned by the author of pipes:
This [..] answers a question people sometimes ask me about pipes, which I will paraphase here:
If resource management is not a core focus of pipes, why should I use pipes instead of lazy IO?
Many people who ask this question discovered stream programming through Oleg, who framed the lazy IO problem in terms of resource management. However, I never found this argument compelling in isolation; you can solve most resource management issues simply by separating resource acquisition from the lazy IO, like this: [see last example above]
Which brings us back to my previous statement:
You can simply use attoparsec [...][with lazy IO, assuming] that you know the drawbacks of lazy IO.
References
Iteratee I/O, which explains the example better and provides a better overview
Gabriel Gonzalez (maintainer/author of pipes): Reasoning about stream programming
Michael Snoyman (maintainer/author of conduit): Conduit versus Enumerator

Here's a great podcast with authors of both libraries:
http://www.haskellcast.com/episode/006-gabriel-gonzalez-and-michael-snoyman-on-pipes-and-conduit/
It'll answer most of your questions.
In short, both of those libraries approach the problem of streaming, which is very important when dealing with IO. In essence they manage transferring of data in chunks,
thus allowing you to e.g. transfer a 1GB file cosuming just 64KB of RAM on both the server and the client. Without streaming you would have had to allocate as much memory on both ends.
An older alternative to those libraries is lazy IO, but it is filled with issues and makes applications error-prone. Those issues are discussed in the podcast.
Concerning which one of those libraries to use, it's more of a matter of taste. I prefer "pipes". The detailed differences are discussed in the podcast too.

Works in GHCi> but not when loaded?

I cant figure out why I get two different results but I'm sure it has to do with IO, which I am beginning to hate!
For example:
ghci> x <- readFile "foo.txt"
ghci> let y = read x :: [Int]
ghci> :t y
y :: [Int]
Now when I create that file and do the same thing it comes out as IO [Int] ?
foo.txt is a txt file containing only this: 12345
Someone that can explain this to me? As I'm about to snap it!
Thanks for any insight!

Read about ghci. To quote
The syntax of a statement accepted at the GHCi prompt is exactly the same as the syntax of a statement in a Haskell do expression. However, there's no monad overloading here: statements typed at the prompt must be in the IO monad.
Basically you are inside the IO Monad when you are writing anything in ghci.

Main idea
Be clear on the distinction in Haskell between an IO operation that produces a value, and the value itself.
Recommended Reading
A good reference to IO in Haskell that doesn't expect you to want to know the theoretical basis for monads is sigfpe's
The IO Monad for People who Simply Don't Care.
Note it's the monad-theoretical bit that he's assuming you don't care about. He's assuming you do care about doing IO.
It's not very long, and I think it's very much worth a read for you, because it makes explicit some 'rules' that you're not aware of so are causing you irritation.
Your code
Anyway, in your code
x <- readFile "foo.txt"
the readFile "foo.txt" bit has type IO String, which means it's an operation that produces a String.
When you do x <- readFile "foo.txt", you use x to refer to the String it produces.
Notice the distinction between the output, x and the operation that produced it, readFile "foo.txt".
Next let's look at y. You define let y = read x :: [Int], so y is a list of Ints, as you specified.
However, y isn't the same as the whole chunk that defines it.
example = do
x <- readFile "foo.txt"
let y = read x :: [Int]
return y
Here example :: IO [Int], whereas y itself has type [Int].
The cause of your frustration
If you come from an imperative language, this is frustrating at first -
you're used to being able to use functions that produce values wherever you'd use values,
but you're used to those functions also being allowed to execute arbitrary IO operations.
In Haskell, you can do whatever you like with 'pure' functions (that don't use IO) but not IO operations.
Haskell programmers see a whole world of difference between an IO operation that returns a value,
which can only be reused in other IO operations, and a pure function which can be used anywhere.
This means that you can end up trapped in the awkward IO monad all the time,
and all your functions are full of IO datatypes. This is inconvenient and you write messy code.
How to avoid IO mess
First solve the problem you have in its entirety without using external (file or user) data:
Write sample data that you normally read from a file or user as values in your source code.
Write your functions with any data you need in the definition as parameters to the function.
This is the only way you can get data when you're writing pure code. Write pure code first.
Test your functions on the specimen data. (If you like, you can reload in ghci every time you write a new function, making sure it does what you expect.)
Once your program is complete without the IO, you can introduce it as a wrapper around the pure code at the end.
This means that in your program, I don't think you should be writing any readFile or other IO code until you're nearly finished.
It's a completely different workflow - in an imperative language you'd write code to read your data, then do stuff, then write your data.
In Haskell it's better to write the code that does stuff first, then the code to read and write the data at the end, once you know the functionality is right.

You can't actually do exactly the same thing in a Haskell source file, so I suspect what you did actually looks like this:
readFoo = do
x <- readFile "foo.txt"
let y = read x :: [Int]
return y
And are surprised that the type of readFoo comes out as IO [Int], even though it's returning y which is of type [Int].
If this is the case, the source of your confusion is that return in Haskell isn't a return statement from imperative languages.
return in Haskell is a function. A perfectly ordinary function. Like any other function it takes a value of some type as input and gives you a value of some other type as output. Specialised to the case of IO (return can be used with any monad, but we'll keep it simple here), it has this type:
a -> IO a
So it takes a value of any type and gives you a value in the same type wrapped up in the IO monad. So if y has type [Int], then return y has type IO [Int], and that's what you get as the result of readFoo.
There's no way to get the [Int] "out" of the IO [Int]. This is deliberate. The whole point of IO is that any value which is dependent on anything "outside" the program can only appear in an IO x type. So a function which internally reads "foo.txt" and returns a list of the integers in it must be impossible to write in Haskell, or the whole house of cards falls down. If we saw a function with a type like readFoo :: [Int], as you were trying to write, then we know that readFoo can only be one particular list of integers; it can't be a list that depends on the contents of a file on disk.

GHCi does a little bit of magic to make it easier to quickly test stuff at the command prompt. If this were part of a program, then IO [Int] would indeed be the correct type. GHCi is letting you treat it as just Int to save you a bit of typing.
So that's the "why".
Now, if you have a specific question like "how do I do X given that the type signature is this?"...

Haskell: read a file by line

I recently did the Waterloo CCC and I feel that Haskell is the perfect language for answering these types of questions. I am still learning it. I am struggling a bit with the input, though.
Here's what I'm using:
import IO
import System.Environment
import System.FilePath
…
main = do
name <- getProgName
args <- getArgs
input <- readFile $
if not (null args)
then head args
else dropExtension name ++ ".in"
let (k:code:_) = lines input
putStrLn $ decode (read k) code
As you can see, I'm reading from the command-line given file path or from j1.in for example, if this program is called j1.hs and compiled to j1.
I am only interested in the first two lines of the file, so I have used pattern matching to get those lines and bind them to k and code, in this example. And I then read k as an integer and pass it and the code string to my decode function, which I output.
I'm wondering if readFile is loading the entire file into memory, which would be bad. But then I started thinking, maybe since Haskell is lazy, it only ever reads the first two lines because that's all it's asked for later on. Am I right?
Also, if there is anything with that code sample that could be better or more idiomatic, please let me know.

The documentation for readFile says:
The readFile function reads a file and returns the contents of the file as a string. The file is read lazily, on demand, as with getContents.
So yes, it will only necessarily read the first two lines of the file (buffering means it will probably read more behind the scenes). But this is a property of readFile specifically, not of all Haskell I/O functions in general.
Lazy I/O is a bad idea for I/O-heavy programs (e.g. webservers) but it works nicely for simple programs that don't do much I/O.

Yes, readFile is lazy. If you want to be explicit about it, you could use:
import Control.Monad (replicateM)
import System.IO
readLines n f = withFile f ReadMode $ replicateM n . hGetLine
-- in main
(k:code:_) <- readLines 2 filename
This will ensure the file is closed as soon as possible.
But the way you've done it is fine.

readFile reads the file lazily, so it won't read the entire file into memory unless you use the entire file. It will not usually read exactly the first two lines, since it reads in blocks, but it will only read as many blocks as needed to find the second newline.

I/O in Haskell isn't usually lazy. However, the readFile function specifically is lazy.
Others have said the same thing. What I haven't seen anybody point out yet is that the file you've opened won't get closed until either the program ends or the garbage collector runs. That just means that the OS file handle might be kept open longer than necessary. In your program that's probably no big deal. But in a more complicated project, it could be.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string