Read large lines in huge file without buffering

Read large lines in huge file without buffering - haskell

I was wondering if there's an easy way to get lines one at a time out of a file without eventually loading the whole file in memory. I'd like to do a fold over the lines with an attoparsec parser. I tried using Data.Text.Lazy.IO with hGetLine and that blows through my memory. I read later that eventually loads the whole file.
I also tried using pipes-text with folds and view lines:
s <- Pipes.sum $
folds (\i _ -> (i+1)) 0 id (view Text.lines (Text.fromHandle handle))
print s
to just count the number of lines and it seems to be doing some wonky stuff "hGetChunk: invalid argument (invalid byte sequence)" and it takes 11 minutes where wc -l takes 1 minute. I heard that pipes-text might have some issues with gigantic lines? (Each line is about 1GB)
I'm really open to any suggestions, can't find much searching except for newbie readLine how-tos.
Thanks!

The following code uses Conduit, and will:
UTF8-decode standard input
Run the lineC combinator as long as there is more data available
For each line, simply yield the value 1 and discard the line content, without ever read the entire line into memory at once
Sum up the 1s yielded and print it
You can replace the yield 1 code with something which will do processing on the individual lines.
#!/usr/bin/env stack
-- stack --resolver lts-8.4 --install-ghc runghc --package conduit-combinators
import Conduit
main :: IO ()
main = (runConduit
$ stdinC
.| decodeUtf8C
.| peekForeverE (lineC (yield (1 :: Int)))
.| sumC) >>= print

This is probably easiest as a fold over the decoded text stream
{-#LANGUAGE BangPatterns #-}
import Pipes
import qualified Pipes.Prelude as P
import qualified Pipes.ByteString as PB
import qualified Pipes.Text.Encoding as PT
import qualified Control.Foldl as L
import qualified Control.Foldl.Text as LT
main = do
n <- L.purely P.fold (LT.count '\n') $ void $ PT.decodeUtf8 PB.stdin
print n
It takes about 14% longer than wc -l for the file I produced which was just long lines of commas and digits. IO should properly be done with Pipes.ByteString as the documentation says, the rest is conveniences of various sorts.
You can map an attoparsec parser over each line, distinguished by view lines, but keep in mind that an attoparsec parser can accumulate the whole text as it pleases and this might not be a great idea over a 1 gigabyte chunk of text. If there is a repeated figure on each line (e.g. word separated numbers) you can use Pipes.Attoparsec.parsed to stream them.

Related

main: Prelude.!!: index too large

I am trying to write a program that will copy a file and allow us to rename it, like a cp command in unix.
import System.IO
import System.Environment
import qualified Data.Text as T
import qualified Data.Text.IO as TI
main :: IO ()
main = do
args <- getArgs
let source = args !! 0
let dest = args !! 1
input <- TI.readFile source
TI.writeFile dest input
I tried this and got main: Prelude.!!: index too large

As the error says, this means that the index is too large, so that means that getArgs returns a list that contains less than two elements.
Using (!!) is however often not a good idea: there is no guarantee that the element exists, and it runs in 𝓞(k) to obtain the k-th element. You can work with:
main :: IO ()
main = do
args <- getArgs
case args of
(src:dst:_) -> do
input <- TI.readFile source
TI.writeFile dest input
_ -> putStrLn "You should provide a source and destination"
Also loading the entire content in a string is not a good idea, since the file can be larger than all (available) memory, thus crashing the system. Often copying is done through two file handlers: one that reads the source file, and one that writes to the target file, with only a small amount of memory used. Some file systems can also make copying more effective. For example by using two references to the same file, and only effectively making a copy if one of the two files is modified.

Adding the possibility to write a AST-file to my (rail-)compiler

I'm writing rail-compiler (rail is an esoteric language) in Haskell and I get some problems within the main-function of my mainmodule.
1) I want my program to ask wheter I want to run the compiling-pipeline or simply stop after the lexer and write the AST to a file so another compiler can deal with my AST (Abstract Synatx Tree). Here is my program:
module Main (
main -- main function to run the program
)
where
-- imports --
import InterfaceDT as IDT
import qualified Testing as Test
import qualified Preprocessor as PreProc
import qualified Lexer
import qualified SyntacticalAnalysis as SynAna
import qualified SemanticalAnalysis as SemAna
import qualified IntermediateCode as InterCode
import qualified CodeOptimization as CodeOpt
import qualified Backend
-- functions --
main :: IO()
main = do putStr "Enter inputfile (path): "
inputfile <- getLine
input <- readFile inputfile
putStr "Enter outputfile (path): "
outputfile <- getLine
input <- readFile inputfile
putStr "Only create AST (True/False): "
onlyAST <- getLine
when (onlyAST=="True") do putStrLn "Building AST..."
writeFile outputfile ((Lexer.process . PreProc.process) input)
when (onlyAST=="False") do putStrLn ("Compiling "++inputfile++" to "++outputfile)
writeFile outputfile ((Backend.process . CodeOpt.process . InterCode.process . SemAna.process . SynAna.process . Lexer.process . PreProc.process) input)
I get an error in Line 21 (input <- readFile inputfile) caused by the <-. Why?
How should I do it?
2) Next thing is that I want to refactor the program in that way, that I can call it from the terminal with parameters like runhaskell Main(AST) (in that way it should just create the AST) or like runhaskell Main.hs (in that way it should do the whole pipeline).
I hope for your help!

For your error in (1), your program doesn't look syntactically incorrect at line 21 to me. However an error at <- would happen if that line were indented differently from the previous one. I suspect that you are having an indentation error due to mixing tabs and spaces in a way that looks correct in your editor but disagrees with Haskell's interpretation of tabs. The simplest recommendation is to always use spaces and never tabs.
You also have an extra copy of that line later, which you might want to remove.
I also suspect you may need to use hFlush stdin after your putStr's, for them to work as prompts.
For (2), I'd suggest using a library for proper command line argument and option parsing, such as System.Console.GetOpt which is included with GHC, or one of the fancier ones which you can find on Hackage.

Processing a very large text file with lazy Texts and ByteStrings

I'm trying to process a very large unicode text file (6GB+). What I want is to count the frequency of each unique word. I use a strict Data.Map to keep track of the counts of each word as I traverse the file.
The process takes too much time and too much memory (20GB+). I suspect the Map is huge but I'm not sure it should reach 5x the size of the file!
The code is shown below. Please note that I tried the following:
Using Data.HashMap.Strict instead of Data.Map.Strict. Data.Map seems to perform better in terms of slower memory consumption increase rate.
Reading the files using lazy ByteString instead of lazy Text. And then I encode it to Text do some processing and then encode it back to ByteString for IO.
import Data.Text.Lazy (Text(..), cons, pack, append)
import qualified Data.Text.Lazy as T
import qualified Data.Text.Lazy.IO as TI
import Data.Map.Strict hiding (foldr, map, foldl')
import System.Environment
import System.IO
import Data.Word
dictionate :: [Text] -> Map Text Word16
dictionate = fromListWith (+) . (`zip` [1,1..])
main = do
[file,out] <- getArgs
h <- openFile file ReadMode
hO <- openFile out WriteMode
mapM_ (flip hSetEncoding utf8) [h,hO]
txt <- TI.hGetContents h
TI.hPutStr hO . T.unlines .
map (uncurry ((. cons '\t' . pack . show) . append)) .
toList . dictionate . T.words $ txt
hFlush hO
mapM_ hClose [h,hO]
print "success"
What's wrong with my approach? What's the best way to accomplish what I'm trying to do in terms of time and memory performance?

This memory usage is expected. Data.Map.Map consumes about 6N words of memory + size of keys & values (data taken from this excellent post by Johan Tibell). A lazy Text value takes up 7 words + 2*N bytes (rounded to the multiple of the machine word size), and a Word16 takes up two words (header + payload). We will assume a 64-bit machine, so the word size will be 8 bytes. We will also assume that the average string in the input is 8 characters long.
Taking this all into account, the final formula for the memory usage is 6*N + 7*N + 2*N + 2*N words.
In the worst case, all words will be different and there will be about (6 * 1024^3)/8 ~= 800 * 10^6 of them. Plugging that in the formula above we get the worst-case map size of approx. 102 GiB, which seems to agree with the experimental results. Solving this equation in the reverse direction tells us that your file contains about 200*10^6 different words.
As for alternative approaches to this problem, consider using a trie (as suggested by J.Abrahamson in the comments) or an approximate method, such as count-min sketch.

In the world of traditional data processing, this problem would have been done by sorting (externally on disk or magtape if needed), then scanning the sorted file to count the grouped-together runs of words. Of course you could do some partial reductions during the early phases of sorting, to save some space and time.

Haskell read first n lines

I'm trying to learn Haskell to get used to functional programming languages. I've decided to try a few problems at interviewstreet to start out. I'm having trouble reading from stdin and doing io in general with haskell's lazy io.
Most of the problems have data coming from stdin in the following form:
n
data line 1
data line 2
data line 3
...
data line n
where n is the number of following lines coming from stdin and the next lines are the data.
How do I run my program on each of the n lines one at a time and return the solution to stdout?
I know the stdin input won't be very large but I'm asking about evaluating each line one at a time pretending the input is larger than what can fit in memory just to learn how to use haskell.

You can use interact, in conjunction with lines to process data from stdin one line at a time. Here's an example program that uses interact to access stdin, lines to split the data on each newline, a list comprehension to apply the function perLine to each line of the input, and unlines to put the output from perLine back together again.
main = interact processInput
processInput input = unlines [perLine line | line <- lines input]
perLine line = reverse line -- do whatever you want to 'line' here!
You don't need to worry about the size of the data you're getting over stdin; Haskell's laziness ensures that you only keep the parts you're actually working on in memory at any time.
EDIT: if you still want to work on only the first n lines, you can use the take function in the above example, like this:
processInput input = unlines [perLine line | line <- take 10 (lines input)]
This will terminate the program after the first ten lines have been read and processed.

You can also use a simple recursion:
getMultipleLines :: Int -> IO [String]
getMultipleLines n
| n <= 0 = return []
| otherwise = do
x <- getLine
xs <- getMultipleLines (n-1)
return (x:xs)
And then use it in your main:
main :: IO ()
main = do
line <- getLine
let numLines = read line :: Int
inputs <- getMultipleLines numLines

How to get good performance when writing a list of integers from 1 to 10 million to a file?

question
I want a program that will write a sequence like,
1
...
10000000
to a file. What's the simplest code one can write, and get decent performance? My intuition is that there is some lack-of-buffering problem. My C code runs at 100 MB/s, whereas by reference the Linux command line utility dd runs at 9 GB/s 3 GB/s (sorry for the imprecision, see comments -- I'm more interested in the big picture orders-of-magnitude though).
One would think this would be a solved problem by now ... i.e. any modern compiler would make it immediate to write such programs that perform reasonably well ...
C code
#include <stdio.h>
int main(int argc, char **argv) {
int len = 10000000;
for (int a = 1; a <= len; a++) {
printf ("%d\n", a);
}
return 0;
}
I'm compiling with clang -O3. A performance skeleton which calls putchar('\n') 8 times gets comparable performance.
Haskell code
A naiive Haskell implementation runs at 13 MiB/sec, compiling with ghc -O2 -optc-O3 -optc-ffast-math -fllvm -fforce-recomp -funbox-strict-fields. (I haven't recompiled my libraries with -fllvm, perhaps I need to do that.) Code:
import Control.Monad
main = forM [1..10000000 :: Int] $ \j -> putStrLn (show j)
My best stab with Haskell runs even slower, at 17 MiB/sec. The problem is I can't find a good way to convert Vector's into ByteString's (perhaps there's a solution using iteratees?).
import qualified Data.Vector.Unboxed as V
import Data.Vector.Unboxed (Vector, Unbox, (!))
writeVector :: (Unbox a, Show a) => Vector a -> IO ()
writeVector v = V.mapM_ (System.IO.putStrLn . show) v
main = writeVector (V.generate 10000000 id)
It seems that writing ByteString's is fast, as demonstrated by this code, writing an equivalent number of characters,
import Data.ByteString.Char8 as B
main = B.putStrLn (B.replicate 76000000 '\n')
This gets 1.3 GB/s, which isn't as fast as dd, but obviously much better.

Some completely unscientific benchmarking first:
All programmes have been compiled with the default optimisation level (-O3 for gcc, -O2 for GHC) and run with
time ./prog > outfile
As a baseline, the C programme took 1.07s to produce a ~76MB (78888897 bytes) file, roughly 70MB/s throughput.
The "naive" Haskell programme (forM [1 .. 10000000] $ \j -> putStrLn (show j)) took 8.64s, about 8.8MB/s.
The same with forM_ instead of forM took 5.64s, about 13.5MB/s.
The ByteString version from dflemstr's answer took 9.13s, about 8.3MB/s.
The Text version from dflemstr's answer took 5.64s, about 13.5MB/s.
The Vector version from the question took 5.54s, about 13.7MB/s.
main = mapM_ (C.putStrLn . C.pack . show) $ [1 :: Int .. 10000000], where C is Data.ByteString.Char8, took 4.25s, about 17.9MB/s.
putStr . unlines . map show $ [1 :: Int .. 10000000] took 3.06s, about 24.8MB/s.
A manual loop,
main = putStr $ go 1
where
go :: Int -> String
go i
| i > 10000000 = ""
| otherwise = shows i . showChar '\n' $ go (i+1)
took 2.32s, about 32.75MB/s.
main = putStrLn $ replicate 78888896 'a' took 1.15s, about 66MB/s.
main = C.putStrLn $ C.replicate 78888896 'a' where C is Data.ByteString.Char8, took 0.143s, about 530MB/s, roughly the same figures for lazy ByteStrings.
What can we learn from that?
First, don't use forM or mapM unless you really want to collect the results. Performancewise, that sucks.
Then, ByteString output can be very fast (10.), but if the construction of the ByteString to output is slow (3.), you end up with slower code than the naive String output.
What's so terrible about 3.? Well, all the involved Strings are very short. So you get a list of
Chunk "1234567" Empty
and between any two such, a Chunk "\n" Empty is put, then the resulting list is concatenated, which means all these Emptys are tossed away when a ... (Chunk "1234567" (Chunk "\n" (Chunk "1234568" (...)))) is built. That's a lot of wasteful construct-deconstruct-reconstruct going on. Speed comparable to that of the Text and the fixed "naive" String version can be achieved by packing to strict ByteStrings and using fromChunks (and Data.List.intersperse for the newlines). Better performance, slightly better than 6., can be obtained by eliminating the costly singletons. If you glue the newlines to the Strings, using \k -> shows k "\n" instead of show, the concatenation has to deal with half as many slightly longer ByteStrings, which pays off.
I'm not familiar enough with the internals of either text or vector to offer more than a semi-educated guess concerning the reasons for the observed performance, so I'll leave them out. Suffice it to say that the performance gain is marginal at best compared to the fixed naive String version.
Now, 6. shows that ByteString output is faster than String output, enough that in this case the additional work of packing is more than compensated. However, don't be fooled by that to believe that is always so. If the Strings to pack are long, the packing can take more time than the String output.
But ten million invocations of putStrLn, be it the String or the ByteString version, take a lot of time. It's faster to grab the stdout Handle just once and construct the output String in non-IO code. unlines already does well, but we still suffer from the construction of the list map show [1 .. 10^7]. Unfortunately, the compiler didn't manage to eliminate that (but it eliminated [1 .. 10^7], that's already pretty good). So let's do it ourselves, leading to 8. That's not too terrible, but still takes more than twice as long as the C programme.
One can make a faster Haskell programme by going low-level and directly filling ByteStrings without going through String via show, but I don't know if the C speed is reachable. Anyway, that low-level code isn't very pretty, so I'll spare you what I have, but sometimes one has to get one's hands dirty if speed matters.

Using lazy byte strings gives you some buffering, because the string will be written instantly and more numbers will only be produced as they are needed. This code shows the basic idea (there might be some optimizations that could be made):
import qualified Data.ByteString.Lazy.Char8 as ByteString
main =
ByteString.putStrLn .
ByteString.intercalate (ByteString.singleton '\n') .
map (ByteString.pack . show) $
([1..10000000] :: [Int])
I still use Strings for the numbers here, which leads to horrible slowdowns. If we switch to the text library instead of the bytestring library, we get access to "native" show functions for ints, and can do this:
import Data.Monoid
import Data.List
import Data.Text.Lazy.IO as Text
import Data.Text.Lazy.Builder as Text
import Data.Text.Lazy.Builder.Int as Text
main :: IO ()
main =
Text.putStrLn .
Text.toLazyText .
mconcat .
intersperse (Text.singleton '\n') .
map Text.decimal $
([1..10000000] :: [Int])
I don't know how you are measuring the "speed" of these programs (with the pv tool?) but I imagine that one of these procedures will be the fastest trivial program you can get.

If you are going for maximum performance, then it helps to take a holistic view; i.e., you want to write a function that maps from [Int] to series of system calls that write chunks of memory to a file.
Lazy bytestrings are good representation for a sequence of chunks of memory. Mapping a lazy bytestring to a series of systems calls that write chunks of memory is what L.hPut is doing (assuming an import qualified Data.ByteString.Lazy as L). Hence, we just need a means to efficiently construct the corresponding lazy bytestring. This is what lazy bytestring builders are good at. With the new bytestring builder (here is the API documentation), the following code does the job.
import qualified Data.ByteString.Lazy as L
import Data.ByteString.Lazy.Builder (toLazyByteString, charUtf8)
import Data.ByteString.Lazy.Builder.ASCII (intDec)
import Data.Foldable (foldMap)
import Data.Monoid (mappend)
import System.IO (openFile, IOMode(..))
main :: IO ()
main = do
h <- openFile "/dev/null" WriteMode
L.hPut h $ toLazyByteString $
foldMap ((charUtf8 '\n' `mappend`) . intDec) [1..10000000]
Note that I output to /dev/null to avoid interference by the disk driver. The effort of moving the data to the OS remains the same. On my machine, the above code runs in 0.45 seconds, which is 12 times faster than the 5.4 seconds of your original code. This implies a throughput of 168 MB/s. We can squeeze out an additional 30% speed (220 MB/s) using bounded encodings].
import qualified Data.ByteString.Lazy.Builder.BasicEncoding as E
L.hPut h $ toLazyByteString $
E.encodeListWithB
((\x -> (x, '\n')) E.>$< E.intDec `E.pairB` E.charUtf8)
[1..10000000]
Their syntax looks a bit quirky because a BoundedEncoding a specifies the conversion of a Haskell value of type a to a bounded-length sequence of bytes such that the bound can be computed at compile-time. This allows functions such as E.encodeListWithB to perform some additional optimizations for implementing the actual filling of the buffer. See the the documentation of Data.ByteString.Lazy.Builder.BasicEncoding in the above link to the API documentation (phew, stupid hyperlink limit for new users) for more information.
Here is the source of all my benchmarks.
The conclusion is that we can get very good performance from a declarative solution provided that we understand the cost model of our implementation and use the right datastructures. Whenever constructing a packed sequence of values (e.g., a sequence of bytes represented as a bytestring), then the right datastructure to use is a bytestring Builder.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string