I want to select the n-th last line from a large text file (~10GB) in a Haskell program.
I found a way how to get the n-th last from an internal string:
myLen = 7
n = 3 -- one-based from the end
myLines = lines myText
idx = myLen - n
theLine = head (drop idx myLines)
main :: IO ()
main = do
putStrLn theLine
The documentation about the readFile function says it "reads the content lazily", so once readFile got to the n-th last line will it have stored all the lines before in memory (and then explodes because I don't have that much memory)?
So, is readFile the right approach here? Plus how do I get the IO String output from readFile "in a lazy way" into a list of lines so that I can then select the n-th last line?
The question has several parts:
The documentation about the readFile function says it "reads the content lazily", so once readFile got to the n-th last line will it have stored all the lines before in memory (and then explodes because I don't have that much memory)?
Not necessarily. If you only iterate over the contents and produce a result, then the garbage collector should deallocate the contents.
So, is readFile the right approach here?
My opinionated answer is that if it's for a serious tool, readFile isn't the right approach because "lazy IO" is a can of worms.
If it's for a quick and dirty script then go ahead, but if not, and if performance is important, then it is probably best to use lower level calls to read strict ByteStrings, and for your problem read directly from the end of the file and process that.
The following program will require only about as much memory as the longest n lines in the file being read:
-- like drop, but takes its number encoded as a lazy
-- unary number via the length of the first list
dropUnary :: [a] -> [b] -> [b]
dropUnary [] bs = bs
dropUnary (_:as) (_:bs) = dropUnary as bs
takeLast :: Int -> [a] -> [a]
takeLast n as = dropUnary (drop n as) as
main :: IO ()
main = putStrLn . head . takeLast 3 . lines =<< readFile
The Prelude's lines function is already suitably lazy, but some care was taken in writing takeLast here. You can think of this as operating in "one pass" of the file, looking at subsequent chunks of n consecutive lines until it finds the last chunk. Because it does not maintain any references to the contents of the file from before the chunk it's currently looking at, all of the file contents up to the current chunk can be garbage collected (and generally is, fairly soon).
Related
im trying to make a programm that should read line by line from a file and check if its a palindrom, if it is, then print.
I'm really new to haskell so the only thing i could do is just print out each line, with this code :
main :: IO()
main = do
filecontent <- readFile "palindrom.txt"
mapM_ putStrLn (lines filecontent)
isPalindrom w = w==reverse w
The thing is, i dont know how to go line by line and check if the line is a palindrom ( note that in my file, each line contains only one word). Thanks for any help.
Here is one suggested approach
main :: IO()
main = do
filecontent <- readFile "palindrom.txt"
putStrLn (unlines $ filter isPalindrome $ lines filecontent)
isPalindrome w = w==reverse w
The part in parens is pure code, it has type String->String. It is generally a good idea to isolate pure code as much as possible, because that code tends to be the easiest to reason about, and often is more easily reusable.
You can think of data as flowing from right to left in that section, broken apart by the ($) operators. First you split the content into separate lines, then filter only the palindromes, finally rebuild the full output as a string. Also, because Haskell is lazy, even though it looks like it is treating the input as a single String in memory, it actually is only pulling the data as needed.
Edited to add extra info....
OK, so the heart of the soln is the pure portion:
unlines $ filter isPalindrome $ lines filecontent
The way that ($) works is by evaluating the function to the right, then using that as the input of the stuff on the left. In this case, filecontent is the full input from the file (a String, including newline chars), and the output is STDOUT (also a full string including newline chars).
Let's follow sample input through this process, "abcba\n1234\nK"
unlines $ filter isPalindrome $ lines "abcba\n1234\nK"
First, lines will break this into an array of lines
unlines $ filter isPalindrome ["abcba", "1234", "K"]
Note that the output of lines is being fed into the input for filter.
So, what does filter do? Notice its type
filter :: (a -> Bool) -> [a] -> [a]
This takes 2 input params, the first is a function (which isPalendrome is), the second a list of items. It will test each item in the list using the function, and its output is the same list input, minus items that the function has chosen to remove (returned False on). In our case, the first and third items are in fact palendromes, the second not. Our expression evaluates as follows
unlines ["abcba", "K"]
Finally, unlines is the opposite of lines.... It will concatinate the items again, inserting newlines in between.
"abcba\nK"
Since STDIO itself is a String, this is ready for outputting.
Note that is it perfectly OK to output a list of Strings using non-pure functions, as follows
forM ["1", "2", "3"] $ \item -> do
putStrLn item
This method however mixes pure and impure code, and is considered slightly less idiomatic Haskell code than the former. You will still see this type of thing a lot though!
Have a look at the filter function. You may not want to put all processing on a single line, but use a let expression. Also, your indentation is off:
main :: IO ()
main = do
filecontent <- readFile "palindrom.txt"
let selected = filter ... filecontent
...
I'm reading 《Learn you a Haskell for Great Good》chapter 9: Input and Output. There is an example code for explaining stream:
main = do
withFile "something.txt" ReadMode (\handle -> do
contents <- hGetContents handle
putStr contents)
The books says :
That's why in this case it actually reads a line, prints it to the
output, reads the next line, prints it, etc.
But in previous content, for the same example, it also says:
That's really cool because we can treat contents as the whole contents
of the file, but it's not really loaded in memory.
I'm new to functional programming and I'm really confused about this, why we can take contents as the whole contents if it reads one line in one time? I though the contents in contents <- hGetContents handle is just the content of one line, does Haskell save the content of every line into temporary memory or something else?
How to understand stream in Haskell
You can think of it as a function which when invoked, returns some of the result (not all of it) along with a call back function to get the rest when you need it. So technically it gives you the entire content, but one chunk at a time, and only if you ask for the rest of it.
If Haskell did not have non-strict semantics, you could implement this concept by something like:
data Stream a = Stream [a] (() -> Stream a)
instance (Show a) => Show (Stream a) where
show (Stream xs _) = show xs ++ " ..."
rest :: Stream a -> Stream a -- ask for the rest of the stream
rest (Stream _ f) = f ()
Then say you want a stream which iterates integers. You can return the first 3 and postpone the rest until the user asks for it:
iter :: Int -> Stream Int
iter x = Stream [x, x + 1, x + 2] (\_ -> iter (x + 3))
then,
\> iter 0
[0,1,2] ...
but if you keep asking for the rest, you get the entire content
\> take 5 $ iterate rest (iter 0)
[[0,1,2] ...,[3,4,5] ...,[6,7,8] ...,[9,10,11] ...,[12,13,14] ...]
or
\> let go (Stream [i, j, k] _) acc = i:j:k:acc
\> take 20 . foldr go [] $ iterate rest (iter 0)
[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19]
That is the same story with line buffering under the hood. It reads and returns the first line, but then you can ask for the next line, and the next line, ... So technically you get the entire content even though it only reads one line at a time.
why we can take contents as the whole contents if it reads one line in one time?
First note that it is not necessary that the content is being read line by line (although it can be possible, I will come to that later). What the author meant is that even though the entire file is not loaded to the memory, you can assume conceptually that the variable contents has the whole content of the file. This is possible because of the lazy streaming of the file (If you are more interested, you can see the source to see the low level details. It basically uses unsafeInterleaveIO to achieve this).
does Haskell save the content of every line into temporary memory or something else?
That depends on the type of buffering used. According to the documentation it depends on the underlying file system:
The default buffering mode when a handle is opened is
implementation-dependent and may depend on the file system object
which is attached to that handle. For most implementations, physical
files will normally be block-buffered and terminals will normally be
line-buffered.
But you can use hGetBuffering :: Handle -> IO BufferMode to see yourself what buffering mode you are in.
Note that hGetHandle itself never reads anything; it simply returns an IO action that, when ultimately executed, will read from the file. When combined with putStr with hGetContents handle >>= putStr (which is simply the do notation desugared), you get an IO action that takes a handle and outputs its contents to the screen. At no point in the Haskell program itself do you ever specify how that happens; it is entirely up to the Haskell runtime and how it executes the IO action you create.
I have probably just spend a day of computation time in vain :)
The problem is that I (naively) wrote about 3.5GB of (compressed) [(Text, HashMap Text Int)] data to a file and at that point my program crashed. Of course there is no final ] at the end of the data and the sheer size of it makes editing it by hand impossible.
The data was formatted via Prelude.show and just at this point I realize that Prelude.read will need to the whole dataset into memory (impossible) before any data is returned.
Now ... is there a way to recover the data without resorting to write a parser manually?
Update 1
main = do
s <- getContents
let hs = read s :: [(String, M.Map String Integer)]
print $ head hs
This I tried ... but it just keeps consuming more memory until it gets killed by the OS.
Sort of. You will still be writing a parser manually... but it is a very short and very easy-to-write parser, because almost all of it will ship out to read. The idea is this: read is strict, but reads, when working on a single element, is lazyish. So we just need to strip out the bits that reads isn't expecting when working on a single element. Here's an example to get you started:
> let s = "[3,4,5," ++ undefined
> reads (drop 1 s) :: [(Int, String)]
[(3,",4,5,*** Exception: Prelude.undefined
I included the undefined at the end as evidence that it is in fact not reading the entire String before producing the parsed 3 at the head of the list.
Daniels answer can be extended to parse the whole list at once using this function. Then you can directly access it as a list the way you want
lazyread :: Read a => [Char] -> [a]
lazyread xs = go (tail xs)
where go xs = a : go (tail b)
where (a,b) = head $ reads xs
Manually delete the opening '['. After that you might be able to use reads (note the s) to incrementally access getContents.
I have a text file which contains two lists on each line. Each list can contain any number of alphanumeric arguments.
eg [t1,t2,...] [m1,m2,...]
I can read the file into ghc, but how can I read this into another main file and how can the main file recognise each argument separately to then process it?
I think it's best for you to figure out most of this for yourself, but I've got some pointers for you.
Firstly, try not to deal with the file access until you've got the rest of the code working, otherwise you might end up having IO all over the place. Start with some sample data:
sampleData = "[m1,m2,m3][x1,x2,x3,x4]\n[f3,f4,f5][y7,y8,y123]\n[m4,m5,m6][x5,x6,x7,x8]"
You should not mention sampleData anywhere else in your code, but you should use it in ghci for testing.
Once you have a function that does everything you want, eg processLists::String->[(String,String)], you can replcae readFile "data.txt" :: IO String with
readInLists :: FilePath -> IO [(String,String)]
readInLists filename = fmap processLists (readFile filename)
If fmap makes no sense to you, you could read a tutorial I accidentally wrote.
If they really are alphanumeric, you can split them quite easily. Here are some handy functions, with examples.
tail :: [a] -> [a]
tail "(This)" = "This)"
You can use that to throw away something you don't want at the front of your string.
break :: (Char->Bool) -> String -> (String,String)
break (== ' ') "Hello Mum" = ("Hello"," Mum")
So break uses a test to find the first character of the second string, and breaks the string just before it.
Notice that the break character is still there at the front of the next string. span is the same but uses a test for what to have in the first list, so
span :: (Char->Bool) -> String -> (String,String)
span (/= ' ') "Hello Mum" = ("Hello"," Mum")
You can use these functions with things like (==','), or isAlphaNum (you'll have to import Data.Char at the top of your file to use it).
You might want to look at the functions splitWith and splitOn that I have in this answer. They're based on the definitions of split and words from the Prelude.
Hello Stackoverflow Community.
I'm relativly new to Haskell and i have noticed writing large strings to a file with
writeFile or hPutStr is extremly slow.
For a 1.5 Mb String my Programm (compiled with ghc) takes about 2 seconds while the
"same" code in c++ only takes about 0.1 seconds.
The string is generated from a list with about 10000 elements and then dumped with writeFile. I have also tried to traverse the the list with mapM_ and hPutStr with the same result.
Is there a faster way to write a large string?
Update
As #applicative pointed out the following code finishes with a 2MB file in no time
main = readFile "input.txt" >>= writeFile "ouput.txt"
So my problem seems to be somewhere else. Here are my two implementations for
Writing the list (WordIndex and CoordList are typealiases for a Map and a List)
with hPutStrLn
-- Print to File
indexToFile :: String -> WordIndex -> IO ()
indexToFile filename index =
let
indexList = map (\(k, v) -> entryToString k v) (Map.toList index)
in do
output <- openFile filename WriteMode
mapM_ (\v -> hPutStrLn output v) indexList
hClose output
-- Convert Listelement to String
entryToString :: String -> CoordList -> String
entryToString key value = (embedString 25 key) ++ (coordListToString value) ++ "\n"
with writeFile
-- Print to File
indexToFile :: String -> WordIndex -> IO ()
indexToFile filename index = writeFile filename (indexToString "" index)
-- Index to String
indexToString :: String -> WordIndex -> String
indexToString lead index = Map.foldrWithKey (\k v r -> lead ++ (entryToString k v) ++ r) "" index
Maybe you guys can help me a little in finding a speed up here.
Thanks in advance
This is well-known problem. The default Haskell String type is simple [Char] and is slow by definition and is dead slow if it is constructed lazily (usual situation). However, as list, it allows simple and clean processing using list combinators and is useful when performance is not an issue. If it is, one should use ByteString or Text packages. ByteString is better as it is shipped with ghc, but does not provide unicode support. ByteString-based utf8 packages are available on hackage.
Yes. You could, for instance, use the Text type from the module Data.Text or Data.Text.Lazy, which internally represent text in a more efficient way (namely UTF-16) than lists of Chars do.
When writing binary data (which may or may not contain text encoded in some form) you can use ByteStrings or their lazy equivalents.
When modifying Text or ByteStrings, some operations to modify them are faster on the lazy versions. If you only want to read from such a string after creating it the non-lazy versions can generally be recommended.