reading files with references to other files in haskell - haskell

I am trying to expand regular markdown with the ability to have references to other files, such that the content in the referenced files is rendered at the corresponding places in the "master" file.
But the furthest I've come is to implement
createF :: FTree -> IO String
createF Null = return ""
createF (Node f children) = ifNExists f (_id f)
(do childStrings <- mapM createF children
withFile (_path f) ReadMode $ \handle ->
do fc <- lines <$> hGetContents handle
return $ merge fc childStrings)
ifNExists is just a helper that can be ignored, the real problem happens in the reading of the handle, it just returns the empty string, I assume this is due to lazy IO.
I thought that the use of withFile filepath ReadMode $ \handle -> {-do stutff-}hGetContents handle would be the right solution as I've read fcontent <- withFile filepath ReadMode hGetContents is a bad idea.
Another thing that confuses me is that the function
createFT :: File -> IO FTree
createFT f = ifNExists f Null
(withFile (_path f) ReadMode $ \handle ->
do let thisParse = fparse (_id f :_parents f)
children <-rights . map ( thisParse . trim) . lines <$> hGetContents handle
c <- mapM createFT children
return $ Node f c)
works like a charm.
So why does createF return just an empty string?
the whole project and a directory/file to test can be found at github
Here are the datatype definitions
type ID = String
data File = File {_id :: ID, _path :: FilePath, _parents :: [ID]}
deriving (Show)
data FTree = Null
| Node { _file :: File
, _children :: [FTree]} deriving (Show)

As you suspected, lazy IO is probably the problem. Here's the (awful) rule you have to follow to use it properly without going totally nuts:
A withFile computation must not complete until all (lazy) I/O required to fully evaluate its result has been performed.
If something forces I/O after the handle is closed, you are not guaranteed to get an error, even though that would be very nice. Instead, you get completely undefined behavior.
You break this rule with return $ merge fc childStrings, because this value is returned before it's been fully evaluated. What you can do instead is something vaguely like
let retVal = merge fc childStrings
deepSeq retVal $ return retVal
An arguably cleaner alternative is to put all the rest of the code that relies on those results into the withFile argument. The only real reason not to do that is if you do a bunch of other work with the results after you're finished with that file. For example, if you're processing a bunch of different files and accumulating their results, then you want to be sure to close each of them when you're done with it. If you're just reading in one file and then acting on it, you can leave it open till you're finished.
By the way, I just submitted a feature request to the GHC team to see if they might be willing to make these kinds of programs more likely to fail early with useful error messages.
Update
The feature request was accepted, and such programs are now much more likely to produce useful error messages. See What caused this "delayed read on closed handle" error? for details.

I'd strongly suggest you to avoid lazy IO as it always creates problems like this, as described in What's so bad about Lazy I/O? As in your case, where you need to keep the file open until it's fully read, but this would mean closing the file somewhere in pure code, when the content is actually consumed.
One possibility would be to use strict ByteStrings and read files using readFile. This would also make many operations more efficient.
Another option would be to use one of the libraries that address the lazy IO problem (see What are the pros and cons of Enumerators vs. Conduits vs. Pipes?). These libraries allow you to separate content production from its processing or consumption. So you could have a producer that reads input files and produces a stream of some tokens, and a pure consumer (not depending on IO) that consumes the stream and produces some result. For example, in conduit-extra there is a module that converts an atto-parsec parser into a consumer.
See also Is there a better way to walk a directory tree?

Related

How to understand stream in Haskell

I'm reading 《Learn you a Haskell for Great Good》chapter 9: Input and Output. There is an example code for explaining stream:
main = do
withFile "something.txt" ReadMode (\handle -> do
contents <- hGetContents handle
putStr contents)
The books says :
That's why in this case it actually reads a line, prints it to the
output, reads the next line, prints it, etc.
But in previous content, for the same example, it also says:
That's really cool because we can treat contents as the whole contents
of the file, but it's not really loaded in memory.
I'm new to functional programming and I'm really confused about this, why we can take contents as the whole contents if it reads one line in one time? I though the contents in contents <- hGetContents handle is just the content of one line, does Haskell save the content of every line into temporary memory or something else?
How to understand stream in Haskell
You can think of it as a function which when invoked, returns some of the result (not all of it) along with a call back function to get the rest when you need it. So technically it gives you the entire content, but one chunk at a time, and only if you ask for the rest of it.
If Haskell did not have non-strict semantics, you could implement this concept by something like:
data Stream a = Stream [a] (() -> Stream a)
instance (Show a) => Show (Stream a) where
show (Stream xs _) = show xs ++ " ..."
rest :: Stream a -> Stream a -- ask for the rest of the stream
rest (Stream _ f) = f ()
Then say you want a stream which iterates integers. You can return the first 3 and postpone the rest until the user asks for it:
iter :: Int -> Stream Int
iter x = Stream [x, x + 1, x + 2] (\_ -> iter (x + 3))
then,
\> iter 0
[0,1,2] ...
but if you keep asking for the rest, you get the entire content
\> take 5 $ iterate rest (iter 0)
[[0,1,2] ...,[3,4,5] ...,[6,7,8] ...,[9,10,11] ...,[12,13,14] ...]
or
\> let go (Stream [i, j, k] _) acc = i:j:k:acc
\> take 20 . foldr go [] $ iterate rest (iter 0)
[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19]
That is the same story with line buffering under the hood. It reads and returns the first line, but then you can ask for the next line, and the next line, ... So technically you get the entire content even though it only reads one line at a time.
why we can take contents as the whole contents if it reads one line in one time?
First note that it is not necessary that the content is being read line by line (although it can be possible, I will come to that later). What the author meant is that even though the entire file is not loaded to the memory, you can assume conceptually that the variable contents has the whole content of the file. This is possible because of the lazy streaming of the file (If you are more interested, you can see the source to see the low level details. It basically uses unsafeInterleaveIO to achieve this).
does Haskell save the content of every line into temporary memory or something else?
That depends on the type of buffering used. According to the documentation it depends on the underlying file system:
The default buffering mode when a handle is opened is
implementation-dependent and may depend on the file system object
which is attached to that handle. For most implementations, physical
files will normally be block-buffered and terminals will normally be
line-buffered.
But you can use hGetBuffering :: Handle -> IO BufferMode to see yourself what buffering mode you are in.
Note that hGetHandle itself never reads anything; it simply returns an IO action that, when ultimately executed, will read from the file. When combined with putStr with hGetContents handle >>= putStr (which is simply the do notation desugared), you get an IO action that takes a handle and outputs its contents to the screen. At no point in the Haskell program itself do you ever specify how that happens; it is entirely up to the Haskell runtime and how it executes the IO action you create.

Is there something better than unsafePerformIO for this....?

I've so far avoided ever needing unsafePerformIO, but this might have to change today.... I would like to see if the community agrees, or if someone has a better solution.
I have a library which needs to use some config data stored in a bunch of files. This data is guaranteed static (during the run), but needs to be in files that can (on very rare occasions) be edited by an end user who can not compile Haskell programs. (The details are uninportant, but think of "/etc/mime.types" as a pretty good approximation. It is a large almost static data file used throughout many programs).
If this weren't a library I would just use the IO monad.... But because it is a library which is called throughout my code, it literally forces a bubbling up of the IO monad through pretty much everything I have written in multiple modules! Although I need to do a one time read of the data files, this low level call is effetively pure, so this is a pretty unacceptable outcome.
FYI, I plan to also wrap the call in unsafeInterleaveIO, so that only files that are needed will be loaded. My code will look something like this....
dataDir="<path to files>"
datafiles::[FilePath]
datafiles =
unsafePerformIO $
unsafeInterleaveIO $
map (dataDir </>)
<$> filter (not . ("." `isPrefixOf`))
<$> getDirectoryContents dataDir
fileData::[String]
fileData = unsafePerformIO $ unsafeInterleaveIO $ sequence $ readFile <$> datafiles
Given that the data read is referentially transparent, I am pretty sure that unsafePerformIO is safe (this has been discussed in many place, such as "Use of unsafePerformIO appropriate?"). Still, though, if there is a better way, I would love to hear about it.
UPDATE-
In response to Anupam's comment....
There are two reasons why I can't break up the lib into IO and non IO parts.
First, the amount of data is large, and I don't want to read it all into memory at once. Remember that IO is always read strictly.... This is the reason that I need to put in the unsafeInterleaveIO call, to make it lazy. IMHO, once you use unsafeInterleaveIO, you might as well use unsafePerformIO, as the risk is already there.
Second, breaking out the IO specific parts just substitutes the bubbling up of the IO monad with the bubbling up of the IO read code, as well as the passing around of the data (I might actually choose to pass around the data using the state monad anyway, so it really isn't an improvement to substitute the IO monad for the state monad everywhere). This wouldn't be so bad if the low level function itself wasn't effectively pure (ie- think of my /etc/mime.types example above, and imagine a Haskell extensionToMimeType function, which is basically pure, but needs to get the database data from the file.... Suddenly everything from low to high in the stack needs to call or pass through a readMimeData::IO String. Why should each main even need to care about the library choice of a submodule many levels deep?).
I agree with Anupam Jain, you would be better off reading these data files at a somewhat higher level, in IO, and then passing the data in them through the rest of your program purely.
You could, for example, put the functions that need the results of fileData into Reader [String], so that they can just ask for the results as needed (or some Reader Config, where Config holds these strings and whatever else you need).
A sketch of what I'm suggesting follows:
type AppResult = String
fileData :: IO [String]
fileData = undefined -- read the files
myApp :: String -> Reader [String] AppResult
myApp s = do
files <- ask
return undefined -- do whatever with s and config
main = do
config <- fileData
return $ runReader (myApp "test") config
I gather that you don't want to read all the data at once, because that would be costly. And maybe you don't really know up-front what files you will need to load, so loading all of them at the start would be wasteful.
Here's an attempt at a solution. It requires you to work inside a free monad and relegate the side-effecting operations to an interpreter. Some preliminary imports:
{-# LANGUAGE OverloadedStrings #-}
module Main where
import qualified Data.ByteString as B
import Data.Monoid
import Data.List
import Data.Functor.Compose
import Control.Applicative
import Control.Monad
import Control.Monad.Free
import System.IO
We define a functor for the free monad. It will offer a value p do the interpreter and continue the computation after receiving a value b:
type LazyLoad p b = Compose ((,) p) ((->) b)
A convenience function to request the loading of a file:
lazyLoad :: FilePath -> Free (LazyLoad FilePath B.ByteString) B.ByteString
lazyLoad path = liftF $ Compose (path,id)
A dummy interpreter function that reads "file contents" from stdin:
interpret :: Free (LazyLoad FilePath B.ByteString) a -> IO a
interpret = iterM $ \(Compose (path,next)) -> do
putStrLn $ "Enter the contents for file " <> path <> ":"
B.hGetLine stdin >>= next
Some silly example functions:
someComp :: B.ByteString -> B.ByteString
someComp b = "[" <> b <> "]"
takesAwhile :: Int
takesAwhile = foldl' (+) 0 $ take 400000000 $ intersperse (negate 1) $ repeat 1
An example program:
main :: IO ()
main = do
r <- interpret $ do
r1 <- someComp <$> lazyLoad "file1"
r2 <- return takesAwhile
if (r2 == 1)
then return r1
else someComp <$> lazyLoad "file2"
putStrLn . show $ r
When executed, this program will request a line, spend some time computing takesAwhile and only then request another line.
If want to allow different kinds of "requests", this solution could be extended with something like Data types à la carte so that each function only needs to know about about the precise effects it requires.
If you are content with allowing only one type of request, you could also use Clients and Servers from Pipes.Core instead of the free monad.

Create lazy IO list from a non-IO list

I have a lazy list of filenames created by find. I'd like to be able to load the metadata of these files lazily too. That means, that if i take 10 elements from metadata, it should only search the metadata of these ten files. The fact is find perfectly gives you 10 files if you ask for them without hanging your disk, whereas my script searches the metadata of all files.
main = do
files <- find always always /
metadata <- loadMetaList files
loadMetaList :: [String] -> IO [Metadata]
loadMetaList file:files = do
first <- loadMeta file
rest <- loadMetaList files
return (first:rest)
loadMeta :: String -> IO Metadata
As you can see, loadMetaList is not lazy. For it to be lazy, it should use tail recursion. Something like return (first:loadMetaList rest).
How do I make loadMetaList lazy?
The (>>=) of the IO monad is such that in
loadMetaList :: [String] -> IO [Metadata]
loadMetaList file:files = do
first <- loadMeta file
rest <- loadMetaList files
return (first:rest)
the action loadMetaList files has to be run before return (first:rest) can be executed.
You can avoid that by deferring the execution of loadMetaList files,
import System.IO.Unsafe
loadMetaList :: [String] -> IO [Metadata]
loadMetaList file:files = do
first <- loadMeta file
rest <- unsafeInterleaveIO $ loadMetaList files
return (first:rest)
with unsafeInterleaveIO (which find also uses). That way, the loadMetaList files is not executed until its result is needed, and if you require only the metadata of 10 files, only that will be loaded.
It's not quite as unsafe as its cousin unsafePerformIO, but should be handled with care too.
Here's how you do it the pipes way. I don't really know how you implement loadMeta and find, so I just made something up:
import Pipes
find :: Producer FilePath IO ()
find = each ["heavy.mp3", "metal.mp3"]
type MetaData = String
loadMeta :: String -> IO MetaData
loadMeta file = return $ "This song is " ++ takeWhile (/= '.') file
loadMetaList :: Pipe FilePath MetaData IO r
loadMetaList = mapM loadMeta
To run it, we just compose processing stages like a pipeline and run the pipeline using runEffect:
>>> runEffect $ find >-> loadMetaList >-> stdoutLn
This song is heavy
This song is metal
There are a couple of key things to point out:
You can make find a Producer so that it only searches the directory tree lazily, too. I know you don't need this feature because your file set is small now, but it's very easy to include later when your directory gets larger.
It's lazy, but without unsafeInterleaveIO. It generates each output immediately and doesn't wait to first collect the whole list of results.
For example, it will work even if we use an infinite list of files:
>>> import qualified Pipes.Prelude as Pipes
>>> runEffect $ each (cycle ["heavy.mp3", "metal.mp3"]) >-> loadMetaList >-> Pipes.stdoutLn
This song is heavy
This song is metal
This song is heavy
This song is metal
This song is heavy
This song is metal
...
It will only compute as much as necessary. If we specify we only want three results, it will do the minimum amount of loading necessary to return two results, even if we provide an infinite list of files.
For example, we can cap the number of results using take:
>>> runEffect $ each (cycle ["heavy.mp3", "metal.mp3"]) >-> loadMetaList >-> Pipes.take 3 >-> Pipes.stdoutLn
This song is heavy
This song is metal
This song is heavy
So you asked what is wrong with unsafeInterleaveIO. The main limitation of unsafeInterleaveIO is that you cannot guarantee when the IO actions actually occur, which leads to the following common pitfalls:
Handles accidentally being closed before the file is read
IO actions occurring late or never
Pure code having side effects and throwing IOExceptions
The biggest advantages of Haskell's IO system over other languages is that Haskell completely decouples the evaluation model from the order of side effects. When you use lazy IO, you lose that decoupling and then the order of side effects becomes tightly integrated with Haskell's evaluation model, which is a huge step backwards.
This is why it is generally not wise to use lazy IO, especially now that there are easy and elegant alternatives.
If you want to learn more about how to use pipes to implement lazy IO safely, then you can read the extensive pipes tutorial.

Frege's equivalent of Haskell's getLine and read

Is there any Frege's equivalent of Haskell's getLine and read to parse input from the console in the standard library?
Currently I am doing it like this:
import frege.IO
getLine :: IO String
getLine = do
isin <- stdin
isrin <- IO.InputStreamReader.new isin
brin <- IO.BufferedReader.fromISR isrin
line <- brin.readLine
return $ fromExceptionMaybe line
fromExceptionMaybe :: Exception (Maybe a) -> a
fromExceptionMaybe (Right (Just r)) = r
fromExceptionMaybe (Right _) = error "Parse error on input"
fromExceptionMaybe (Left l) = error l.getMessage
pure native parseInt java.lang.Integer.parseInt :: String -> Int
main _ = do
line <- getLine
println $ parseInt line
Update:
Frege has been evolved so now we have getLine in the standard library itself. As for read, we have conversion methods on String. Now the original problem is simply,
main _ = do
line <- getLine
println line.atoi
See Ingo's answer below for more details.
Update: I/O support in more recent versions of Frege
As of version 3.21.80, we have better I/O support in the standard libraries:
The runtime provides stdout and stderr (buffered, UTF8 encoding java.io.PrintWriters wrapped around java.lang.System.out and java.lang.System.err) and stdin (UTF8 decoding java.io.BufferedReader wrapped around java.lang.System.in)
Functions print, println, putStr, putChar write to stdout
getChar and getLine read from stdin and throw exceptions on end of file.
The Frege equivalents for Java classes like PrintWriter, BufferedWriter etc. are defined in module Java.IO, which is automatically imported. With this, more basic functionality is supported. For example, BufferedReader.readLine has a return type of IO (Maybe String) and does signal the end of file by returning Nothing, like its Java counterpart, who returns null in such cases.
Here is a short example program that implements a basic grep:
--- A simple grep
module examples.Grep where
--- exception thrown when an invalid regular expression is compiled
data PatternSyntax = native java.util.regex.PatternSyntaxException
derive Exceptional PatternSyntax
main [] = stderr.println "Usage: java examples.Grep regex [files ...]"
main (pat:xs) = do
rgx <- return (regforce pat)
case xs of
[] -> grepit rgx stdin
fs -> mapM_ (run rgx) fs
`catch` badpat where
badpat :: PatternSyntax -> IO ()
badpat pse = do
stderr.println "The regex is not valid."
stderr.println pse.getMessage
run regex file = do
rdr <- utf8Reader file
grepit regex rdr
`catch` fnf where
fnf :: FileNotFoundException -> IO ()
fnf _ = stderr.println ("Could not read " ++ file)
grepit :: Regex -> BufferedReader -> IO ()
grepit pat rdr = loop `catch` eof `finally` rdr.close
where
eof :: EOFException -> IO ()
eof _ = return ()
loop = do
line <- rdr.getLine
when (line ~ pat) (println line)
loop
Because Frege is still quite new, the library support is admittedly still lacking, despite the progress that is already done in the most basic areas, like Lists and Monads.
In addition, while the intent is to have a high degree of compatibility to Haskell, especially in the IO system and generally in the low level system related topics, there is a tension: Should we rather go the Java way or should we really try to emulate Haskell's way (which is in turn obviously influenced by what is available in the standard C/POSIX libraries).
Anyway, the IO thing is probably the most underdeveloped area of the Frege library, unfortunately. This is also because it is relatively easy to quickly write native function declarations for a handful of Java methods one would need in an ad hoc manner, instead of taking the time to develop a well though out library.
Also, a Read class does not exist up to now. As a substiutute until this has been fixed, the String type has functions to parse all number types (based on the Java parseXXX() methods).
(Side note: Because my days also have only 24h and I have a family, a dog and a job to care about, I would be very happy to have more contributors that help making the Frege system better.)
Regarding your code: Yes, I feel it is right to do all character based I/O through the Reader and Writer interfaces. Your example shows also that convenience functions for obtaining a standard input reader are needed. The same holds for standard output writer.
However, when you would need to read more than 1 line, I'd definitly create the reader in the main function and pass it to the input processing actions.

How to combine user input with list of tuples and write the complete list of tuples to the file?

I am trying to take user input and convert it in the form of a list of tuples.
What I want to do is that, I need to take the data from the user and convert it in the form of
[(Code,Name, Price)] and finally combine this user input with the previous list and write the new list to the same file.
The problem I am facing is that as soon as the program completes taking user input, WinHugs is showing an error like this Program error: Prelude.read: no parse.
Here is the code:
type Code=Int
type Price=Int
type Name=String
type ProductDatabase=(Code,Name,Price)
finaliser=do
a<-input_taker
b<-list_returner
let w=a++b
outh <- openFile "testx.txt" WriteMode
Print outh w
Close outh
The problem is that you're using lazy IO to read from a file while you're writing to it at the same time. This causes problems when read sees data that has been partially written.
We need to force the reading of the input data to be complete before you try writing to the file. One way of doing this is to use seq to force the list of products to be read into memory.
list_returner :: IO ([ProductDatabase])
list_returner = do
inh <- openFile "testx.txt" ReadMode
product_text <- hGetContents inh
let product :: [ProductDatabase]
product = read product_text
product `seq` hClose inh
return product
Also, this will fail if the file is empty. The file should contain at least [] before running your code the first time, so that it will parse as the empty list.
The code looks fine to me, except for certain style points. It should work like that. Try to separate concerns more. The exception "no parse" means that the read function was unable to convert its argument string to the desired type. The base library coming with Hugs may be more restrictive on spaces and line feeds. I would recommend using GHC instead of Hugs in general.
In case you're interested: One style point you may want to consider is using withFile instead of an openFile/hClose combination. You may also want to use the writeFile with show:
writeFile "testx.txt" (show w)
Another style point: Your input_taker action should not return a list. There is really no reason to return a list. Return a single tuple instead, so you can use (:) instead of (++). In general the usage of (++) indicates that you may be taking the wrong approach.
Further your ProductDatabase type name is misleading, because I would interpret [ProductDatabase] as a list of databases. Your tuple is a Product.
Final style point: This is really just about code beauty, so it's controversial. This is not C/C++, so you would really want to write f x instead of f(x):
...
return product
-- Since your `Product` is just a type alias, I would use
-- a smart constructor:
product :: Code -> Name -> Price -> Product
product = (,,)
readProduct :: IO Product
readProduct = do
...
code <- fmap read getLine
...
name <- getLine
...
price <- fmap read getLine
return (product code name price)

Resources