I'm reading 《Learn you a Haskell for Great Good》chapter 9: Input and Output. There is an example code for explaining stream:
main = do
withFile "something.txt" ReadMode (\handle -> do
contents <- hGetContents handle
putStr contents)
The books says :
That's why in this case it actually reads a line, prints it to the
output, reads the next line, prints it, etc.
But in previous content, for the same example, it also says:
That's really cool because we can treat contents as the whole contents
of the file, but it's not really loaded in memory.
I'm new to functional programming and I'm really confused about this, why we can take contents as the whole contents if it reads one line in one time? I though the contents in contents <- hGetContents handle is just the content of one line, does Haskell save the content of every line into temporary memory or something else?
How to understand stream in Haskell
You can think of it as a function which when invoked, returns some of the result (not all of it) along with a call back function to get the rest when you need it. So technically it gives you the entire content, but one chunk at a time, and only if you ask for the rest of it.
If Haskell did not have non-strict semantics, you could implement this concept by something like:
data Stream a = Stream [a] (() -> Stream a)
instance (Show a) => Show (Stream a) where
show (Stream xs _) = show xs ++ " ..."
rest :: Stream a -> Stream a -- ask for the rest of the stream
rest (Stream _ f) = f ()
Then say you want a stream which iterates integers. You can return the first 3 and postpone the rest until the user asks for it:
iter :: Int -> Stream Int
iter x = Stream [x, x + 1, x + 2] (\_ -> iter (x + 3))
then,
\> iter 0
[0,1,2] ...
but if you keep asking for the rest, you get the entire content
\> take 5 $ iterate rest (iter 0)
[[0,1,2] ...,[3,4,5] ...,[6,7,8] ...,[9,10,11] ...,[12,13,14] ...]
or
\> let go (Stream [i, j, k] _) acc = i:j:k:acc
\> take 20 . foldr go [] $ iterate rest (iter 0)
[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19]
That is the same story with line buffering under the hood. It reads and returns the first line, but then you can ask for the next line, and the next line, ... So technically you get the entire content even though it only reads one line at a time.
why we can take contents as the whole contents if it reads one line in one time?
First note that it is not necessary that the content is being read line by line (although it can be possible, I will come to that later). What the author meant is that even though the entire file is not loaded to the memory, you can assume conceptually that the variable contents has the whole content of the file. This is possible because of the lazy streaming of the file (If you are more interested, you can see the source to see the low level details. It basically uses unsafeInterleaveIO to achieve this).
does Haskell save the content of every line into temporary memory or something else?
That depends on the type of buffering used. According to the documentation it depends on the underlying file system:
The default buffering mode when a handle is opened is
implementation-dependent and may depend on the file system object
which is attached to that handle. For most implementations, physical
files will normally be block-buffered and terminals will normally be
line-buffered.
But you can use hGetBuffering :: Handle -> IO BufferMode to see yourself what buffering mode you are in.
Note that hGetHandle itself never reads anything; it simply returns an IO action that, when ultimately executed, will read from the file. When combined with putStr with hGetContents handle >>= putStr (which is simply the do notation desugared), you get an IO action that takes a handle and outputs its contents to the screen. At no point in the Haskell program itself do you ever specify how that happens; it is entirely up to the Haskell runtime and how it executes the IO action you create.
Related
I want to select the n-th last line from a large text file (~10GB) in a Haskell program.
I found a way how to get the n-th last from an internal string:
myLen = 7
n = 3 -- one-based from the end
myLines = lines myText
idx = myLen - n
theLine = head (drop idx myLines)
main :: IO ()
main = do
putStrLn theLine
The documentation about the readFile function says it "reads the content lazily", so once readFile got to the n-th last line will it have stored all the lines before in memory (and then explodes because I don't have that much memory)?
So, is readFile the right approach here? Plus how do I get the IO String output from readFile "in a lazy way" into a list of lines so that I can then select the n-th last line?
The question has several parts:
The documentation about the readFile function says it "reads the content lazily", so once readFile got to the n-th last line will it have stored all the lines before in memory (and then explodes because I don't have that much memory)?
Not necessarily. If you only iterate over the contents and produce a result, then the garbage collector should deallocate the contents.
So, is readFile the right approach here?
My opinionated answer is that if it's for a serious tool, readFile isn't the right approach because "lazy IO" is a can of worms.
If it's for a quick and dirty script then go ahead, but if not, and if performance is important, then it is probably best to use lower level calls to read strict ByteStrings, and for your problem read directly from the end of the file and process that.
The following program will require only about as much memory as the longest n lines in the file being read:
-- like drop, but takes its number encoded as a lazy
-- unary number via the length of the first list
dropUnary :: [a] -> [b] -> [b]
dropUnary [] bs = bs
dropUnary (_:as) (_:bs) = dropUnary as bs
takeLast :: Int -> [a] -> [a]
takeLast n as = dropUnary (drop n as) as
main :: IO ()
main = putStrLn . head . takeLast 3 . lines =<< readFile
The Prelude's lines function is already suitably lazy, but some care was taken in writing takeLast here. You can think of this as operating in "one pass" of the file, looking at subsequent chunks of n consecutive lines until it finds the last chunk. Because it does not maintain any references to the contents of the file from before the chunk it's currently looking at, all of the file contents up to the current chunk can be garbage collected (and generally is, fairly soon).
In contrast to the information in "learn you a haskell", on my windows system, ghci translates CTRL-D to EOT, not EOF.
Thus, when I do something like:
input <- getContents
doSomething input
, where doSomething is a function which consumes the input.
Doing that, I have to press CTRL-Z to end my input text, which makes sense since getContents is intended for process piping...
But if I repeat the above steps a second time, it fails because stdin is closed.
So, while browsing System.IO, I could not find an alternative to getContents, which would react on EOT.
Do I have to write such a function myself or is it to be found in another package, maybe?
Btw, the version of GHCI, I use is 8.2.2.
Also, I do not want single line processing. I am aware of getLine but it is not what I want in this case.
Here is the function I was looking for:
getContentsEOT :: IO String
getContentsEOT =
getChar >>= \c ->
if c == '\EOT'
then return ""
else getContentsEOT >>= \s ->
return (c:s)
I'm playing with the interact function from Prelude, wanting to do a simple REPL evaluating my inputs line by line, and I cannot understand what's going on.
If I make it simple like this:
main :: IO ()
main = interact interaction
interaction :: String -> String
interaction (x:xs) = xs
interaction x = x
Then it behaves ok, and removes the first character from my input, or returns the input if it is only one character long.
What puzzles me is if add this line:
interaction :: String -> String
interaction x | length x > 10 = "long word" -- this line causes problem
interaction (x:xs) = xs
interaction x = x
Then, interact seems not to work correctly any longer.
It just awaits for my input, swallows it awaiting another input and so on, but never outputs anything.
It seems so simple however, but I cannot see what is going wrong.
Any idea ?
(On my path I have GHC 7.6.3, I don't know if it has some importance.)
With what you've written, you're trying to calculate the length of the whole input sequence, so your program has to wait for the entire sequence to be available.
You could try a lazy pattern-match like this:
interaction (x1:x2:x3:x4:x5:x6:x7:x8:x9:x10:x11:_) = "long word"
This allows you to ignore the rest of the input once you know you've reached 10 characters.
A cleaner/more general alternative (suggested by #amalloy) that scales for bigger lengths and allows a variable length guard would be something like:
interaction xs | not . null . drop 10 $ xs = "long word"
If what you really want to do is process your input a line at a time, and produce this message for an individual line longer than 10 characters, you can use lines and unlines to make your interaction function line-oriented rather than character-oriented, e.g.:
main :: IO ()
main = interact (unlines . interaction . lines)
interaction :: [String] -> [String]
interaction (x:_) | length x > 10 = "long word" -- this is just looking at the first line
...
or maybe if you want to do that for every line, not just the first:
main :: IO ()
main = interact (unlines . map interaction . lines)
interaction :: String -> String
interaction x | length x > 10 = "long word"
...
interact takes the entirety of standard input at once, as one big string. You call length on all of stdin, and so your function cannot return until stdin is exhausted entirely. You could, for example, hit ctrl-D (assuming Unix) to send EOF, and then your function will finally find out what stdin's length is.
I am trying to expand regular markdown with the ability to have references to other files, such that the content in the referenced files is rendered at the corresponding places in the "master" file.
But the furthest I've come is to implement
createF :: FTree -> IO String
createF Null = return ""
createF (Node f children) = ifNExists f (_id f)
(do childStrings <- mapM createF children
withFile (_path f) ReadMode $ \handle ->
do fc <- lines <$> hGetContents handle
return $ merge fc childStrings)
ifNExists is just a helper that can be ignored, the real problem happens in the reading of the handle, it just returns the empty string, I assume this is due to lazy IO.
I thought that the use of withFile filepath ReadMode $ \handle -> {-do stutff-}hGetContents handle would be the right solution as I've read fcontent <- withFile filepath ReadMode hGetContents is a bad idea.
Another thing that confuses me is that the function
createFT :: File -> IO FTree
createFT f = ifNExists f Null
(withFile (_path f) ReadMode $ \handle ->
do let thisParse = fparse (_id f :_parents f)
children <-rights . map ( thisParse . trim) . lines <$> hGetContents handle
c <- mapM createFT children
return $ Node f c)
works like a charm.
So why does createF return just an empty string?
the whole project and a directory/file to test can be found at github
Here are the datatype definitions
type ID = String
data File = File {_id :: ID, _path :: FilePath, _parents :: [ID]}
deriving (Show)
data FTree = Null
| Node { _file :: File
, _children :: [FTree]} deriving (Show)
As you suspected, lazy IO is probably the problem. Here's the (awful) rule you have to follow to use it properly without going totally nuts:
A withFile computation must not complete until all (lazy) I/O required to fully evaluate its result has been performed.
If something forces I/O after the handle is closed, you are not guaranteed to get an error, even though that would be very nice. Instead, you get completely undefined behavior.
You break this rule with return $ merge fc childStrings, because this value is returned before it's been fully evaluated. What you can do instead is something vaguely like
let retVal = merge fc childStrings
deepSeq retVal $ return retVal
An arguably cleaner alternative is to put all the rest of the code that relies on those results into the withFile argument. The only real reason not to do that is if you do a bunch of other work with the results after you're finished with that file. For example, if you're processing a bunch of different files and accumulating their results, then you want to be sure to close each of them when you're done with it. If you're just reading in one file and then acting on it, you can leave it open till you're finished.
By the way, I just submitted a feature request to the GHC team to see if they might be willing to make these kinds of programs more likely to fail early with useful error messages.
Update
The feature request was accepted, and such programs are now much more likely to produce useful error messages. See What caused this "delayed read on closed handle" error? for details.
I'd strongly suggest you to avoid lazy IO as it always creates problems like this, as described in What's so bad about Lazy I/O? As in your case, where you need to keep the file open until it's fully read, but this would mean closing the file somewhere in pure code, when the content is actually consumed.
One possibility would be to use strict ByteStrings and read files using readFile. This would also make many operations more efficient.
Another option would be to use one of the libraries that address the lazy IO problem (see What are the pros and cons of Enumerators vs. Conduits vs. Pipes?). These libraries allow you to separate content production from its processing or consumption. So you could have a producer that reads input files and produces a stream of some tokens, and a pure consumer (not depending on IO) that consumes the stream and produces some result. For example, in conduit-extra there is a module that converts an atto-parsec parser into a consumer.
See also Is there a better way to walk a directory tree?
I am trying to take user input and convert it in the form of a list of tuples.
What I want to do is that, I need to take the data from the user and convert it in the form of
[(Code,Name, Price)] and finally combine this user input with the previous list and write the new list to the same file.
The problem I am facing is that as soon as the program completes taking user input, WinHugs is showing an error like this Program error: Prelude.read: no parse.
Here is the code:
type Code=Int
type Price=Int
type Name=String
type ProductDatabase=(Code,Name,Price)
finaliser=do
a<-input_taker
b<-list_returner
let w=a++b
outh <- openFile "testx.txt" WriteMode
Print outh w
Close outh
The problem is that you're using lazy IO to read from a file while you're writing to it at the same time. This causes problems when read sees data that has been partially written.
We need to force the reading of the input data to be complete before you try writing to the file. One way of doing this is to use seq to force the list of products to be read into memory.
list_returner :: IO ([ProductDatabase])
list_returner = do
inh <- openFile "testx.txt" ReadMode
product_text <- hGetContents inh
let product :: [ProductDatabase]
product = read product_text
product `seq` hClose inh
return product
Also, this will fail if the file is empty. The file should contain at least [] before running your code the first time, so that it will parse as the empty list.
The code looks fine to me, except for certain style points. It should work like that. Try to separate concerns more. The exception "no parse" means that the read function was unable to convert its argument string to the desired type. The base library coming with Hugs may be more restrictive on spaces and line feeds. I would recommend using GHC instead of Hugs in general.
In case you're interested: One style point you may want to consider is using withFile instead of an openFile/hClose combination. You may also want to use the writeFile with show:
writeFile "testx.txt" (show w)
Another style point: Your input_taker action should not return a list. There is really no reason to return a list. Return a single tuple instead, so you can use (:) instead of (++). In general the usage of (++) indicates that you may be taking the wrong approach.
Further your ProductDatabase type name is misleading, because I would interpret [ProductDatabase] as a list of databases. Your tuple is a Product.
Final style point: This is really just about code beauty, so it's controversial. This is not C/C++, so you would really want to write f x instead of f(x):
...
return product
-- Since your `Product` is just a type alias, I would use
-- a smart constructor:
product :: Code -> Name -> Price -> Product
product = (,,)
readProduct :: IO Product
readProduct = do
...
code <- fmap read getLine
...
name <- getLine
...
price <- fmap read getLine
return (product code name price)