Haskell: read a file by line

Haskell: read a file by line - haskell

I recently did the Waterloo CCC and I feel that Haskell is the perfect language for answering these types of questions. I am still learning it. I am struggling a bit with the input, though.
Here's what I'm using:
import IO
import System.Environment
import System.FilePath
…
main = do
name <- getProgName
args <- getArgs
input <- readFile $
if not (null args)
then head args
else dropExtension name ++ ".in"
let (k:code:_) = lines input
putStrLn $ decode (read k) code
As you can see, I'm reading from the command-line given file path or from j1.in for example, if this program is called j1.hs and compiled to j1.
I am only interested in the first two lines of the file, so I have used pattern matching to get those lines and bind them to k and code, in this example. And I then read k as an integer and pass it and the code string to my decode function, which I output.
I'm wondering if readFile is loading the entire file into memory, which would be bad. But then I started thinking, maybe since Haskell is lazy, it only ever reads the first two lines because that's all it's asked for later on. Am I right?
Also, if there is anything with that code sample that could be better or more idiomatic, please let me know.

The documentation for readFile says:
The readFile function reads a file and returns the contents of the file as a string. The file is read lazily, on demand, as with getContents.
So yes, it will only necessarily read the first two lines of the file (buffering means it will probably read more behind the scenes). But this is a property of readFile specifically, not of all Haskell I/O functions in general.
Lazy I/O is a bad idea for I/O-heavy programs (e.g. webservers) but it works nicely for simple programs that don't do much I/O.

Yes, readFile is lazy. If you want to be explicit about it, you could use:
import Control.Monad (replicateM)
import System.IO
readLines n f = withFile f ReadMode $ replicateM n . hGetLine
-- in main
(k:code:_) <- readLines 2 filename
This will ensure the file is closed as soon as possible.
But the way you've done it is fine.

readFile reads the file lazily, so it won't read the entire file into memory unless you use the entire file. It will not usually read exactly the first two lines, since it reads in blocks, but it will only read as many blocks as needed to find the second newline.

I/O in Haskell isn't usually lazy. However, the readFile function specifically is lazy.
Others have said the same thing. What I haven't seen anybody point out yet is that the file you've opened won't get closed until either the program ends or the garbage collector runs. That just means that the OS file handle might be kept open longer than necessary. In your program that's probably no big deal. But in a more complicated project, it could be.

Related

Why does the main function in Haskell not have any parameters?

I'm somewhat new to Haskell. I've worked through one ore two tutorials, but I don't have much experience.
Every function in Haskell is pure, this is, why we can't have any I/O without the IO-monad.
What I don't understand is, why do the program parameters have to be an IO action as well?
The parameters are passed to the program like to a function.
Why can't the parameters be accessed like in a function?
To make things clear, I don't understand, why the main function has to look like this
main :: IO()
main = do
args <- getArgs
print args
Instead of this
main :: [String] -> IO()
main args = do
print args
I can't see any reason for it, and I haven't found an answer googling around.

It's a language design choice. Neither approach is strictly better than the other one.
Haskell could have been designed to have a main of either kind.
When one does need the program arguments, it would be more convenient to have them passed as function arguments to main.
When one does not need the program arguments, having them passed to main is slightly cumbersome, since we need to write a longer type, and an additional _ to discard the [String] argument.
Further, getArgs lets one access the program arguments anywhere in the program (inside IO), while having them passed to main, only, can be less convenient since one would then be forced to pass them around in the program, which can be inconvenient.
(Short digression) For what it's worth, I had a similar reaction to yours a long time ago when I discovered that in Java we have void main() instead of int main() as in C. Then I realized that in most programs I always wrote return 0; at the end, so it makes little sense to always require that. In Java that's the implicit default, and when we really need to return something else, we use System.exit(). Even if that is the way it's done in a previous language (C, in this case), new languages can choose a new way to make available the same functionality.

I tend to agree with chi's answer, that there's no clearly compelling reason it has to be done either way, so it really comes down to a somewhat subjective judgement call that was made by a small group of people a long time ago. There's no guarantee that there is going to be any particularly satisfying reason behind it.
Here is some reasoning that comes to my mind (which may or may not have been something the original designers thought of at the time, or even would agree with).
What we're really love to be able to do is something like:
main :: Integer -> String -> Set Flag -> IO ()
(for some hypothetical program that takes as command line arguments an integer, a string, and a set of flags)
Being able to write small command line programs as if they were just a function of their command line arguments would be great! But that would need the operating system (or at least the shell) to understand the types used in a Haskell program and know how to parse them (and what to do if parsing fails, or if there aren't enough arguments, or etc), which isn't going to happen.
Perhaps we could write a wrapper to do that. It could take care of parsing the raw string command line arguments into Haskell types and generating error messages (if needed), and then call main for us. But wait, we can do exactly that! We just have to call the wrapper main (and rename what we were previously calling main)!
The point is this: if you want to think of your program as a simple function of external inputs, that makes a lot of sense, but main is not that function. main works much better as a wrapper that takes care of the ugly details of receiving input over an untyped interface and calling the function that "really is" your program.
Forcing you to include a call to getArgs in your set up code makes it more apparent there's more to handling command line arguments than just getting access to them, and possibly nudges you to writing some of that extra handling code rather than just writing main (arg1 : arg2 : _) = do stuffWith arg1 arg2.
Also, it is super trivial to convert the interface we have to the one you want:
import System.Environment
main = real_main =<< getArgs
real_main :: [String] -> IO ()
real_main args = print args
So you can have it whichever way you prefer!

Is it possible to have a lazy version of `sequence` without using `unsafeIO`s?

If I have a possibly infinite list of IO-monads, and I am guaranteed that their sequential execution will not be affected by other IOs, can I somehow make it lazily sequenced (evaluated)?
To clarify my point, here is some pseudo-Haskell code demonstrating what I had in mind:
main = do
inputs <- sequence . repeat $ getLine -- we are forever stuck here
mapM_ putStrLn inputs -- not going to run
Now, I know that in the particular example above, we can just use getContents to get the effect I want
main = do
inputs <- return . lines =<< getContents
mapM_ putStrLn inputs
but in my application the IO monads are not getLine but an external function get1 :: IO (Maybe Record). However, this actually brings my point, because apparently getContents internally uses unsafeIOs to achieve this lazy-effect. My question is, is that necessary? (If you are interested in what exactly I want to do, please refer to this question.)

Maybe you are looking for this?
main = do
let inputs = repeat getLine
mapM_ (>>=putStrLn) inputs

Just to bell the cat: No.
If you have something of type IO a, there is no way to get any information out of that a without fully executing that IO block. Conceptually, that act :: IO a could be defined as an arbitrary complicated set of actions, followed by then just producing a result dependent on a call to the system random number generator.
The entire purpose of unsafeInterleaveIO is perform what you're asking, and it can't be done without it.

withFile handle is empty [duplicate]

This question already has answers here:
withFile vs. openFile
(5 answers)
Closed 7 years ago.
In my current working directory there is a file named test.txt, which contains "Test\n".
With System.IO.readFile, GHCI returns the content:
Prelude System.IO> readFile "test.txt"
"Test\n"
But not so with the following, which should be equal in my opinion:
Prelude System.IO> withFile "test.txt" ReadMode hGetContents
""
Why is it not the case? How to get the whole file contents within the withFile IO action?

TL;DR: Lazy IO is evil.
What happens is that hGetContents returns an IO-lazy list of the file contents. This means that the file handle will be read only when said list is actually accessed. Then the control passes to withFile which closes the file handle. Finally, the result is printed, and the list is demanded: only now a read is performed on the handle. Alas, it's too late.
As an ugly, manual "flush" of this laziness, you can try e.g.
hGetCont handle = do
c <- hGetContents handle
length c `seq` return c
The above forces the length of the list to be computed, hence forcing the whole file to be read. Reid Barton below suggest more beautiful alternatives, which avoid the use of the horribly evil lazy IO.

hGetContents being too lazy

I have the following snippet of code, which I pass to withFile:
text <- hGetContents hand
let code = parseCode text
return code
Here hand is a valid file handle, opened with ReadMode and parseCode is my own function that reads the input and returns a Maybe. As it is, the function fails and returns Nothing. If, instead I write:
text <- hGetContents hand
putStrLn text
let code = parseCode text
return code
I get a Just, as I should.
If I do openFile and hClose myself, I have the same problem. Why is this happening? How can I cleanly solve it?
Thanks

hGetContents isn't too lazy, it just needs to be composed with other things appropriately to get the desired effect. Maybe the situation would be clearer if it were were renamed exposeContentsToEvaluationAsNeededForTheRestOfTheAction or just listen.
withFile opens the file, does something (or nothing, as you please -- exactly what you require of it in any case), and closes the file.
It will hardly suffice to bring out all the mysteries of 'lazy IO', but consider now this difference in bracketing
good file operation = withFile file ReadMode (hGetContents >=> operation >=> print)
bad file operation = (withFile file ReadMode hGetContents) >>= operation >>= print
-- *Main> good "lazyio.hs" (return . length)
-- 503
-- *Main> bad "lazyio.hs" (return . length)
-- 0
Crudely put, bad opens and closes the file before it does anything; good does everything in between opening and closing the file. Your first action was akin to bad. withFile should govern all of the action you want done that that depends on the handle.
You don't need a strictness enforcer if you are working with String, small files, etc., just an idea how the composition works. Again, in bad all I 'do' before closing the file is exposeContentsToEvaluationAsNeededForTheRestOfTheAction. In good I compose exposeContentsToEvaluationAsNeededForTheRestOfTheAction with the rest of the action I have in mind, then close the file.
The familiar length + seq trick mentioned by Patrick, or length + evaluate is worth knowing; your second action with putStrLn txt was a variant. But reorganization is better, unless lazy IO is wrong for your case.
$ time ./bad
bad: Prelude.last: empty list
-- no, lots of Chars there
real 0m0.087s
$ time ./good
'\n' -- right
()
real 0m15.977s
$ time ./seqing
Killed -- hopeless, attempting to represent the file contents
real 1m54.065s -- in memory as a linked list, before finding out the last char
It goes without saying that ByteString and Text are worth knowing about, but reorganization with evaluation in mind is better, since even with them the Lazy variants are often what you need, and they then involve grasping the same distinctions between forms of composition. If you are dealing with one of the (immense) class of cases where this sort of IO is inappropriate, take a look at enumerator, conduit and co., all wonderful.

hGetContents uses lazy IO; it only reads from the file as you force more of the string, and it only closes the file handle when you evaluate the entire string it returns. The problem is that you're enclosing it in withFile; instead, just use openFile and hGetContents directly (or, more simply, readFile). The file will still get closed once you fully evaluate the string. Something like this should do the trick, to ensure that the file is fully read and closed immediately by forcing the entire string beforehand:
import Control.Exception (evaluate)
readCode :: FilePath -> IO Code
readCode fileName = do
text <- readFile fileName
evaluate (length text)
return (parseCode text)
Unintuitive situations like this are one of the reasons people tend to avoid lazy IO these days, but unfortunately you can't change the definition of hGetContents. A strict IO version of hGetContents is available in the strict package, but it's probably not worth depending on the package just for that one function.
If you want to avoid the overhead that comes from traversing the string twice here, then you should probably look into using a more efficient type than String, anyway; the Text type has strict IO equivalents for much of the String-based IO functionality, as does ByteString (if you're dealing with binary data, rather than Unicode text).

You can force the contents of text to be evaluated using
length text `seq` return code
as the last line.

How come my IO runs so slowly in Erlang?

I'm reading 512^2 whitespace delimited doubles written in a text file to my Erlang program by piping them to stdin.
In Erlang this takes 2m25s, in an equivalent Haskell program it takes 3s, so I must be going against the Erlang way of doing it in some way.
Am I using Erlang's IO primitives in a stupid way, or is there something else wrong with my program?
Note that I don't care about the order of the values in the resulting list, so no reverse operation.
Erlang:
-module(iotest).
-import(io).
-export([main/0]).
main() ->
Values = read(),
io:write(Values).
read() -> read([]).
read(Acc) ->
case io:fread("", "~f") of
{ok, Value} -> read([Value | Acc]);
eof -> Acc
end.
Haskell:
module IOTest (
main
) where
main :: IO ()
main = do
text <- getContents
let values = map read (words text) :: [Double]
putStrLn $ show values
return ()
Thanks very much for any help.

No, you are not using Erlang IO in stupid way. It's problem with Erlang IO which is not well known to be fast. Erlang is widely used for writing servers so socked oriented IO is excellent tuned. Block oriented file IO is not so bad, but using io module for working with stdin doesn't work well. Erlang is not widely used for this kind of work. If you need this kind of operations you should write your own specialized input routine. You have two options there:
use io for reading from file in raw and binary mode and then split input using binary module and then use list_to_float/1 for conversion.
use specialized port oriented stdin reading routine (as you can see for example in http://shootout.alioth.debian.org/u64q/program.php?test=regexdna&lang=hipe&id=7 note read/0 function and -noshell -noinput parameters for vm invocation) and then continue as in first option.
In mine opinion (and from mine previous experience) biggest impact in your case comes from using scan-like input routine for float decoding seconded by slow (repeated) io invocation, but it would need some nontrivial profiling to prove it.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Haskell: read a file by line - haskell

readFile reads the file lazily, so it won't read the entire file into memory unless you use the entire file. It will not usually read exactly the first two lines, since it reads in blocks, but it will only read as many blocks as needed to find the second newline.

Related

Why does the main function in Haskell not have any parameters?

Is it possible to have a lazy version of `sequence` without using `unsafeIO`s?

withFile handle is empty [duplicate]

hGetContents being too lazy

How come my IO runs so slowly in Erlang?

Categories

Resources