How to reuse efficiently input from stdin in Haskell - haskell

I understand that I should not try to re-read from stdin because of errors about Haskell IO - handle closed
For example, in below:
main = do
x <- getContents
putStrLn $ map id x
x <- getContents --problem line
putStrLn x
the second call x <- getContents will cause the error:
test: <stdin>: hGetContents: illegal operation (handle is closed)
Of course, I can omit the second line to read from getContents.
main = do
x <- getContents
putStrLn $ map id x
putStrLn x
But will this become a performance/memory issue? Will GHC have to keep all of the contents read from stdin in the main memory?
I imagine the first time around when x is consumed, GHC can throw away the portions of x that are already processed. So theoretically, GHC could only use a small amount of constant memory for the processing. But since we are going to use x again (and again), it seems that GHC cannot throw away anything. (Nor can it read again from stdin).
Is my understanding about the memory implications here correct? And if so, is there a fix?

Yes, your understanding is correct: If you reuse x, ghc has to keep it all in memory.
I think a possible fix is to consume it lazily (once).
Let's say you want to output x to several output handles hdls :: [Handle]. The naive approach is:
main :: IO ()
main = do
x <- getContents
forM_ hdls $ \hdl -> do
hPutStr hdl x
This will read stdin into x as the first hPutStr traverses the string (at least for unbuffered handles, hPutStr is simply a loop that calls hPutChar for each character in the string). From then on it'll be kept in memory for all following hdls.
Alternatively:
main :: IO ()
main = do
x <- getContents
forM_ x $ \c -> do
forM_ hdls $ \hdl -> do
hPutChar hdl c
Here we've transposed the loops: Instead of iterating over the handles (and for each handle iterating over the input characters), we iterate over the input characters, and for each character, we print it to each handle.
I haven't tested it, but this form should guarantee that we don't need a lot of memory because each input character c is used once and then discarded.

Related

Why is sequence [getLine, getLine, getLine] not evaluated lazily?

main = do
input <- sequence [getLine, getLine, getLine]
mapM_ print input
Let's see this program in action:
m#m-X555LJ:~$ runhaskell wtf.hs
asdf
jkl
powe
"asdf"
"jkl"
"powe"
Surprisingly to me, there seems to be no laziness here. Instead, all 3 getLines are evaluated eagerly, the read values are stored in memory and then, not before, all are printed.
Compare to this:
main = do
input <- fmap lines getContents
mapM_ print input
Let's see this in action:
m#m-X555LJ:~$ runhaskell wtf.hs
asdf
"asdf"
lkj
"lkj"
power
"power"
Totally different stuff. Lines are read one by one and printed one by one. Which is odd to me because I don't really see any differences between these two programs.
From LearnYouAHaskell:
When used with I/O actions, sequenceA is the same thing as sequence!
It takes a list of I/O actions and returns an I/O action that will
perform each of those actions and have as its result a list of the
results of those I/O actions. That's because to turn an [IO a] value
into an IO [a] value, to make an I/O action that yields a list of
results when performed, all those I/O actions have to be sequenced so
that they're then performed one after the other when evaluation is
forced. You can't get the result of an I/O action without performing
it.
I'm confused. I don't need to perform ALL IO actions to get the results of just one.
A few paragraphs earlier the book shows a definition of sequence:
sequenceA :: (Applicative f) => [f a] -> f [a]
sequenceA [] = pure []
sequenceA (x:xs) = (:) <$> x <*> sequenceA xs
Nice recursion; nothing here hints me that this recursion should not be lazy;just like in any other recursion, to get the head of the returned list Haskell doesn't have to go down through ALL steps of recursion!
Compare:
rec :: Int -> [Int]
rec n = n:(rec (n+1))
main = print (head (rec 5))
In action:
m#m-X555LJ:~$ runhaskell wtf.hs
5
m#m-X555LJ:~$
Clearly, the recursion here is performed lazily, not eagerly.
Then why is the recursion in the sequence [getLine, getLine, getLine] example performed eagerly?
As to why it is important that IO actions are run in order
regardless of the results: Imagine an action createFile :: IO () and
writeToFile :: IO (). When I do a sequence [createFile,
writeToFile] I'd hope that they're both done and in order, even
though I don't care about their actual results (which are both the
very boring value ()) at all!
I'm not sure how this applies to this Q.
Maybe I'll word my Q this way...
In my mind this:
do
input <- sequence [getLine, getLine, getLine]
mapM_ print input
should detoriate to something like this:
do
input <- do
input <- concat ( map (fmap (:[])) [getLine, getLine, getLine] )
return input
mapM_ print input
Which, in turn, should detoriate to something like this (pseudocode, sorry):
do
[ perform print on the result of getLine,
perform print on the result of getLine,
perform print on the result of getLine
] and discard the results of those prints since print was applied with mapM_ which discards the results unlike mapM
getContents is lazy, getLine isn't. Lazy IO isn't a feature of Haskell per se, it's a feature of some particular IO actions.
I'm confused. I don't need to perform ALL IO actions to get the results of just one.
Yes you do! That is one of the most important features of IO, that if you write a >> b or equivalently,
do a
b
then you can be sure that a is definitely "run" before b (see footnote). getContents is actually the same, it "runs" before whatever comes after it... but the result it returns is a sneaky result that sneakily does more IO when you try to evaluate it. That is actually the surprising bit, and it can lead to some very interesting results in practice (like the file you're reading the contents of being deleted or changed while you're processing the results of getContents), so in practical programs you probably shouldn't be using it, it mostly exists for convenience in programs where you don't care about such things (Code Golf, throwaway scripts or teaching for instance).
As to why it is important that IO actions are run in order regardless of the results: Imagine an action createFile :: IO () and writeToFile :: IO (). When I do a sequence [createFile, writeToFile] I'd hope that they're both done and in order, even though I don't care about their actual results (which are both the very boring value ()) at all!
Addressing the edit:
should detoriate to something like this:
do
input <- do
input <- concat ( map (fmap (:[])) [getLine, getLine, getLine] )
return input
mapM_ print input
No, it actually turns into something like this:
do
input <- do
x <- getLine
y <- getLine
z <- getLine
return [x,y,z]
mapM_ print input
(the actual definition of sequence is more or less this:
sequence [] = return []
sequence (a:as) = do
x <- a
fmap (x:) $ sequence as
Technically, in
sequenceA (x:xs) = (:) <$> x <*> sequenceA xs
we find <*>, which first runs the action on the left, then the action on the right, and finally applies their result together. This is what makes the first effect in the list to be occur first, and so on.
Indeed, on monads, f <*> x is equivalent to
do theF <- f
theX <- x
return (theF theX)
More in general, note that all the IO actions are generally executed in order, first to last (see below for a few rare exceptions). Doing IO in a completely lazy way would be a nightmare for the programmer. For instance, consider:
do let aX = print "x" >> return 4
aY = print "y" >> return 10
x <- aX
y <- aY
print (x+y)
Haskell guarantees that the output is x y 14, in that order. If we had completely lazy IO we could also get y x 14, depending on which argument is forced first by +. In such case, we would need to know exactly the order in which the lazy thunks are demanded by every operation, which is something the programmer definitely does not want to care about. Under such detailed semantics, x + y is no longer equivalent to y + x, breaking equational reasoning in many cases.
Now, if we wanted to force IO to be lazy we could use one of the forbidden functions, e.g.
do let aX = unsafeInterleaveIO (print "x" >> return 4)
aY = unsafeInterleaveIO (print "y" >> return 10)
x <- aX
y <- aY
print (x+y)
The above code makes aX and aY lazy IO actions, and the order of the output is now at the whim of the compiler and the library implementation of +. This is in general dangerous, hence the unsafeness of lazy IO.
Now, about the exceptions. Some IO actions which only read from the environment, like getContents were implemented with lazy IO (unsafeInterleaveIO). The designers felt that for such reads, lazy IO can be acceptable, and that the precise timing of the reads is not that important in many cases.
Nowadays, this is controversial. While it can be convenient, lazy IO can be too unpredictable in many cases. For instance, we can't know where the file will be closed, and that could matter if we're reading from a socket. We also need to be very careful not to force the reads too early: that often leads to a deadlock when reading from a pipe. Today, it is usually preferred to avoid lazy IO, and resort to some library like pipes or conduit for "streaming"-like operations, where there is no ambiguity.

Where is the space leak in this code?

I am trying to split a file into two separate files by alternating lines. (i.e. lines 1,3,5,7.. written to file 1 and lines 2,4,6,8... written to file 2).
The file I am working with is ~700MB, so when I seen the memory usage balloon over 6GB, I know something is wrong.
main :: IO()
main = withFile splitFile ReadMode splitData
where
splitData h = do
dataSet <- lines <$> hGetContents h
let (s1,s2) = foldl' (\(l,r) x -> (x:r,l)) ([],[]) dataSet
writeFile testFile $ unlines s1
writeFile trainingFile $ unlines s2
I initially was using the lazy version of foldl, but after some research it seemed that using the strict version would help. But alas, it made no noticeable difference. I also tried compiling with -O2, but that did nothing either.
I am using GHC 7.10.2
I'm not getting a stack overflow, so what is it using all that memory for?
As mentioned in a comment by #dfeuer, the use of writeFile will force the entire string to be written to be computed, which also forces the entire input to be read. The space leak is caused by the fact that the entire second file must be kept in memory while the first file is being written, when it is obvious that one must only keep in memory one line at a time. And indeed the solution is to write line by line:
import Control.Monad
import System.IO
main :: IO ()
main =
withFile splitFile ReadMode $ \hIn ->
withFile testFile WriteMode $ \hOdd ->
withFile trainingFile WriteMode $ \hEven ->
zipWithM_ hPutStrLn (cycle [hOdd, hEven]) . lines =<< hGetContents hIn
This program runs in constant space.

Haskell getContents wait for EOF

I want to wait until user input terminates with EOF and then output it all whole. Isn't that what getContents supposed to do? The following code outputs each time user hits enter, what am I doing wrong?
import System.IO
main = do
hSetBuffering stdin NoBuffering
contents <- getContents
putStrLn contents
The fundamental problem is that getContents is an instances of Lazy IO. This means that getContents produces a thunk that can be evaluated like a normal Haskell value, and only does the relevant IO when it's forced.
contents is a lazy list that putStr tries to print, which forces the list and causes getContents to read as much as it can. putStr then prints everything that's forced, and continues trying to force the rest of the list until it hits []. As getContents can read more and more of the stream—the exact behavior depends on buffering—putStr can print more and more of it immediately, giving you the behavior you see.
While this behavior is useful for very simple scripts, it ties in Haskell's evaluation order into observable effects—something it was never meant to do. This means that controlling exactly when parts of contents get printed is awkward because you have to break the normal Haskell abstraction and understand exactly how things are getting evaluated.
This leads to some potentially unintuitive behavior. For example, if you try to get the length of the input—and actually use it—the list is forced before you get to printing it, giving you the behavior you want:
main = do
contents <- getContents
let n = length contents
print n
putStr contents
but if you move the print n after the putStr, you go back to the original behavior because n does not get forced until after printing the input (even though n still got defined before putStr was used):
main = do
contents <- getContents
let n = length contents
putStr contents
print n
Normally, this sort of thing is not a problem because it won't change the behavior of your code (although it can affect performance). Lazy IO just brings it into the realm of correctness by piercing the abstraction layer.
This also gives us a hint on how we can fix your issue: we need some way of forcing contents before printing it. As we saw, we can do this with length because length needs to traverse the whole list before computing its result. Instead of printing it, we can use seq which forces the lefthand expression to be evaluated at the same time as the righthand one, but throws away the actual value:
main = do
contents <- getContents
let n = length contents
n `seq` putStr contents
At the same time, this is still a bit ugly because we're using length just to traverse the list, not because we actually care about it. What we would really like is a function that just traverses the list enough to evaluate it, without doing anything else. Happily, this is exactly what deepseq does (for many data structures, not just lists):
import Control.DeepSeq
import System.IO
main = do
contents <- getContents
contents `deepseq` putStr contents
This is a problem of lazy I/O. One simple solution is to use strict I/O, such as via ByteStrings:
import qualified Data.ByteString as S
main :: IO ()
main = S.getContents >>= S.putStr
You can use the replacement functions from the strict package (link):
import qualified System.IO.Strict as S
main = do
contents <- S.getContents
putStrLn contents
Note that for reading there isn't a need to set buffering. Buffering really only helps when writing to files. See this answer (link) for more details.
The definition of the strict version of hGetContents in System.IO.Strict is pretty simple:
hGetContents :: IO.Handle -> IO.IO String
hGetContents h = IO.hGetContents h >>= \s -> length s `seq` return s
I.e., it forces everything to read into memory by calling length on the string returned by the standard/lazy version of hGetContents.

Understanding `withFile` with Example

I implemented withFile in Haskell:
withFile' :: FilePath -> IOMode -> (Handle -> IO a) -> IO a
withFile' path iomode f = do
handle <- openFile path iomode
result <- f handle
hClose handle
return result
When I ran the main provided by Learn You a Haskell, it printed out the content of "girlfriend.txt," as expected:
import System.IO
main = do
withFile' "girlfriend.txt" ReadMode (\handle -> do
contents <- hGetContents handle
putStr contents)
I wasn't sure if my withFile' would've worked with the last 2 lines: (1) close the handle and (2) returning the result as anIO a.
Why didn't the following happen?
result gets lazily bound to f handle
hClose handle closes the file handle
result gets return'd, which results in the actual evaluate of f handle. Since handle was closed, an error gets thrown.
Lazy IO is popularly known as confusing.
It depends on whether putStr executes before hClose or not.
Notice the difference between the first and second uses (the brackets are unnecessary but clarifying in the second example).
ghci> withFile' "temp.hs" ReadMode (hGetContents >=> putStr) -- putStr
import System.IO
import Control.Monad
withFile' :: FilePath -> IOMode -> (Handle -> IO a) -> IO a
withFile' path iomode f = do
handle <- openFile path iomode
result <- f handle
hClose handle
return result
ghci> (withFile' "temp.hs" ReadMode hGetContents) >>= putStr
ghci>
In both cases, the f passed in gets a chance to run before the handle is closed. Because of lazy evaluation, hGetContents only reads the file if it needs to, i.e. is forced to in order to produce output for some other function.
In the first example, since f is (hGetContents >=> putStr), the full contents of the file must be read in order to execute putStr.
In the second example, nothing needs to be evaluated after hGetContents in order to return result, which is a lazy list. (I can quite happily return (show [1..]) which will only fail to terminate if I choose to use the entire output.) This is seen as a problem for lazy IO, which is fixed by alternatives such as strict IO, pipes or conduit.
Maybe returning the empty string for a file when the handle was closed prematurely is a bug, but certainly running the entirety of f before closing it is not.
Equational reasoning means that you can reason about Haskell code by just inlining and substituting things (with certain caveats, but they don't apply here).
This means that all I need to do to understand your code is to take the withFile' here:
import System.IO
main = do
withFile' "girlfriend.txt" ReadMode (\handle -> do
contents <- hGetContents handle
putStr contents)
... and inline its definition:
main = do
handle <- openFile "girlfriend.txt" ReadMode
contents <- hGetContents handle
result <- putStr contents
hClose handle
return result
Once you inline its definition, it's easier to see what is going on. putStr evaluates the entire contents of the file before you close the handle, so there is no error. Also, result is not what you think it is: it's the return value of putStr, which is just (), not the contents of the file.
Most IO actions are not lazily executed.
IO action execution is different from normal Haskell evaluation of values. IO execution is only ever carried out by the outer driver that is trying to execute all the effects of main; it does so in the correct order implied by the monadic sequencing of IO actions.
The driver's need to know what the next IO action is ultimately triggers all evaluation of lazy values in Haskell; if it were happy with an unevaluated lazy value and moved on to the next thing without fully evaluating and executing it, then it would just leave main unevaluated and no Haskell program could ever do anything.
The Haskell value resulting from executing an IO action may of course be an unevaluated lazy value, but each IO action itself is evaluated and executed by the driver (including all sub-actions sequenced with do blocks or binds).
So result doesn't get lazily bound to f handle completely unevaluated; f handle is evaluated to come up with the sub actions hGetContents handle and putStr contents. These are both fully executed before the outer driver moves on to hClose handle, so everything's okay.
Note however that hGetContents is special. Quoting from the documentation:
Computation hGetContents hdl returns the list of characters corresponding to the unread portion of the channel or file managed by hdl, which is put into an intermediate state, semi-closed. In this state, hdl is effectively closed, but items are read from hdl on demand and accumulated in a special list returned by hGetContents hdl.
Any operation that fails because a handle is closed, also fails if a handle is semi-closed. The only exception is hClose. A semi-closed handle becomes closed:
if hClose is applied to it;
if an I/O error occurs when reading an item from the handle;
or once the entire contents of the handle has been read.
Once a semi-closed handle becomes closed, the contents of the associated list becomes fixed. The contents of this final list is only partially specified: it will contain at least all the items of the stream that were evaluated prior to the handle becoming closed.
So executing hGetContents handle actually results in a partially evaluated list, whose lazy evaluation is tied to further IO operations under the hood. This is impossible to do yourself without using the Unsafe family of operations, since it is essentially bypassing the type system and can result in exactly the sort of problem you were concerned about; if you had attempted the following code:
main = do
text <- withFile' "girlfriend.txt" ReadMode (\handle -> do
contents <- hGetContents handle
return contents)
putStr text
(where the function passed to withFile' tries to return the file contents, and they are passed to putStr after the withFile' call), then the putStr would be executed after hClose, and the file may well not have been fully read before it was closed.

Better data stream reading in Haskell

I am trying to parse an input stream where the first line tells me how many lines of data there are. I'm ending up with the following code, and it works, but I think there is a better way. Is there?
main = do
numCases <- getLine
proc $ read numCases
proc :: Integer -> IO ()
proc numCases
| numCases == 0 = return ()
| otherwise = do
str <- getLine
putStrLn $ findNextPalin str
proc (numCases - 1)
Note: The code solves the Sphere problem https://www.spoj.pl/problems/PALIN/ but I didn't think posting the rest of the code would impact the discussion of what to do here.
Use replicate and sequence_.
main, proc :: IO ()
main = do numCases <- getLine
sequence_ $ replicate (read numCases) proc
proc = do str <- getLine
putStrLn $ findNextPalin str
sequence_ takes a list of actions, and runs them one after the other, in sequence. (Then it throws away the results; if you were interested in the return values from the actions, you'd use sequence.)
replicate n x makes a list of length n, with each element being x. So we use it to build up the list of actions we want to run.
Dave Hinton's answer is correct, but as an aside here's another way of writing the same code:
import Control.Applicative
main = (sequence_ . proc) =<< (read <$> getLine)
proc x = replicate x (putStrLn =<< (findNextPalin <$> getLine))
Just to remind everyone that do blocks aren't necessary! Note that in the above, both =<< and <$> stand in for plain old function application. If you ignore both operators, the code reads exactly the same as similarly-structured pure functions would. I've added some gratuitous parentheses to make things more explicit.
Their purpose is that <$> applies a regular function inside a monad, while =<< does the same but then compresses an extra layer of the monad (e.g., turning IO (IO a) into IO a).
The interesting part of looking at code this way is that you can mostly ignore where the monads and such are; typically there's very few ways to place the "function application" operators to make the types work.
You (and the previous answers) should work harder to divide up the IO from the logic. Make main gather the input and separately (purely, if possible) do the work.
import Control.Monad -- not needed, but cleans some things up
main = do
numCases <- liftM read getLine
lines <- replicateM numCases getLine
let results = map findNextPalin lines
mapM_ putStrLn results
When solving SPOJ problems in Haskell, try not to use standard strings at all. ByteStrings are much faster, and I've found you can usually ignore the number of tests and just run a map over everything but the first line, like so:
{-# OPTIONS_GHC -O2 -optc-O2 #-}
import qualified Data.ByteString.Lazy.Char8 as BS
main :: IO ()
main = do
(l:ls) <- BS.lines `fmap` BS.getContents
mapM_ findNextPalin ls
The SPOJ page in the Haskell Wiki gives a lot of good pointers about how to read Ints from ByteStrings, as well as how to deal with a large quantities of input. It'll help you avoid exceeding the time limit.

Resources