How to read big csv file? - haskell

I trying to read a big csv file by haskell, and generate the word count by each column.
This more than 4M rows in the file.
So I choice read a block and get the word count each time(5k rows one block).
And than sum it together.
When I test the function with 12000 rows and 120000 rows the time increase almost linear.
But When read 180000 rows, run time exceeds more than four times.
I think it because the memory is not enough,swap with disk make the function much slower.
I had write my code as map/reduce style,But how to make the haskell don't hold all data in memory?
The blow is my code and profiling result.
import Data.Ord
import Text.CSV.Lazy.String
import Data.List
import System.IO
import Data.Function (on)
import System.Environment
splitLength = 5000
mySplit' [] = []
mySplit' xs = [x] ++ mySplit' t
where
x = take splitLength xs
t = drop splitLength xs
getBlockCount::Ord a => [[a]] -> [[(a,Int)]]
getBlockCount t = map
(map (\x -> ((head x),length x))) $
map group $ map sort $ transpose t
foldData::Ord a=> [(a,Int)]->[(a,Int)]->[(a,Int)]
foldData lxs rxs = map combind wlist
where
wlist = groupBy ((==) `on` fst) $ sortBy (comparing fst) $ lxs ++ rxs
combind xs
| 1==(length xs) = head xs
| 2 ==(length xs) = (((fst . head) xs ), ((snd . head) xs)+((snd . last) xs))
loadTestData datalen = do
testFile <- readFile "data/test_csv"
let cfile = fromCSVTable $ csvTable $ parseCSV testFile
let column = head cfile
let body = take datalen $ tail cfile
let countData = foldl1' (zipWith foldData) $ map getBlockCount $ mySplit' body
let output = zip column $ map ( reverse . sortBy (comparing snd) ) countData
appendFile "testdata" $ foldl1 (\x y -> x ++"\n"++y)$ map show $tail output
main = do
s<-getArgs
loadTestData $ read $ last s
profiling result
loadData +RTS -p -RTS 12000
total time = 1.02 secs (1025 ticks # 1000 us, 1 processor)
total alloc = 991,266,560 bytes (excludes profiling overheads)
loadData +RTS -p -RTS 120000
total time = 17.28 secs (17284 ticks # 1000 us, 1 processor)
total alloc = 9,202,259,064 bytes (excludes profiling overheads)
loadData +RTS -p -RTS 180000
total time = 85.06 secs (85059 ticks # 1000 us, 1 processor)
total alloc = 13,760,818,848 bytes (excludes profiling overheads)

So first, a few suggestions.
Lists aren't fast. Okay, okay, cons is constant time, but in general, lists aren't fast. You're using lists. (Data.Sequence would've been faster for two-ended cons'ing and consumption)
Strings are slow. Strings are slow because they're [Char] (List of Char). The library you're currently using is written in terms of lists of Strings. Usually linked-lists of linked-lists of characters aren't what you want for text processing. This is no bueno. Use Text (for, uh, text) or ByteString (for bytes) instead of String in future unless it's something small and not performance sensitive.
The library you are using is just lazy, not streaming. You'd have to handle getting streaming behavior overlaid onto the lazy semantics to get constant memory use. Streaming libraries solve the problem of incrementally processing data and limiting memory use. I'd suggest learning Pipes or Conduit for that general class of problems. Some problem-specific libraries will also offer an iteratee API which can be used for streaming. Iteratee APIs can be used directly or hooked up to Pipes/Conduit/etc.
I don't think the library you're using is a good idea.
I suggest you use one of the following libraries:
http://hackage.haskell.org/package/pipes-csv (Pipes based)
https://hackage.haskell.org/package/cassava-0.4.2.0/docs/Data-Csv-Streaming.html (Common CSV library, not based on a particular streaming library)
https://hackage.haskell.org/package/csv-conduit (Conduit based)
These should give you good performance and constant memory use modulo whatever you might be accumulating.

There are a couple of things to be aware of:
You want to stream the data so that you are only holding in memory a small portion of the input file at any time. You might be able to accomplish this with lazy IO and the lazy-csv package. However, it still is easy to inadvertently hold on to references which keep all of your input in memory. A better option is to use a streaming library like csv-conduit or pipes-csv.
Use ByteString or Text when processing large amounts of string data.
You want to make sure to use strict operations when reducing your data. Otherwise you will just be building up thunks of unevaluated expressions in memory until the very end when you print out the result. One place where thunks could be building up is your foldData function - the word count expressions do not appear to be getting reduced.
Here is an example of a program which will compute the total length of all of the words in each column of a CSV file and does it in constant memory. The main features are:
uses lazy IO
uses the lazy-csv package with (lazy) ByteString instead of String
uses BangPatterns to strictify the computation of the number of lines
uses an unboxed array to hold the column counters
The code:
{-# LANGUAGE BangPatterns #-}
import qualified Data.ByteString.Lazy.Char8 as BS
import Data.ByteString.Lazy (ByteString)
import Text.CSV.Lazy.ByteString
import System.Environment (getArgs)
import Data.List (foldl')
import Data.Int
import Data.Array.IO
import Data.Array.Unboxed
import Control.Monad
type Length = Int64 -- use Int on 32-bit systems
main = do
(arg:_) <- getArgs
(line1:lns) <- fmap BS.lines $ BS.readFile arg
-- line1 contains the header
let (headers:_) = [ map csvFieldContent r | r <- csvTable (parseCSV line1) ]
ncols = length headers :: Int
arr <- newArray (1,ncols) 0 :: IO (IOUArray Int Length)
let inc i a = do v <- readArray arr i; writeArray arr i (v+a)
let loop !n [] = return n
loop !n (b:bs) = do
let lengths = map BS.length $ head [ map csvFieldContent r | r <- csvTable (parseCSV b) ]
forM_ (zip [1..] lengths) $ \(i,a) -> inc i a
loop (n+1) bs
print headers
n <- loop 0 lns
putStrLn $ "n = " ++ show (n :: Int)
arr' <- freeze arr :: IO (UArray Int Length)
putStrLn $ "totals = " ++ show arr'

I have had this issue before in another language. The trick is not to read the data into memory, but rather just read it in one line at a time. When you read the next line just overwrite your variables as you are only looking for a word count.
Just test for an EOF end of file condition in your io stream and exit then. That way you don;t have to split the file.
Hope that helps

Related

How to make foldl consume constant memory?

We define the following data type Stupid:
import qualified Data.Vector as V
import Data.List (foldl')
data Stupid = Stupid {content::V.Vector Int, ul::Int} deriving Show
Now I have two slightly different code.
foldl' (\acc x->Stupid{content=(content acc) V.// [(x,x+123)],ul=1}) (Stupid {content=V.replicate 10000 10,ul=1}) $ take 100000 $ cycle [0..9999]
takes constant memory (~100M), while
foldl' (\acc x->Stupid{content=(content acc) V.// [(x,x+123)],ul=ul acc}) (Stupid {content=V.replicate 10000 10,ul=1}) $ take 100000 $ cycle [0..9999]
takes a huge amount of memory(~8G).
Theoretically, only one copy of the current Stupid object is needed though out the process for both cases. I don't understand why there is such a difference in memory consumption if I want to access and record the ul acc.
Can someone explain why this happens and give a workaround for constant memory if I need to access ul acc? Thanks.
Note: I know that I can do replacements of a vector in batch, this script is just for demonstration purpose, so please don't modify that part.
I would try to force the fields of Stupid and see if that helps.
let f acc x = c `seq` a `seq` Stupid{content=c,ul=a}
where
c = content acc V.// [(x,x+123)]
a = ul acc
in foldl' f (Stupid {content=V.replicate 10000 10,ul=1}) $
take 100000 $
cycle [0..9999]
This should be nearly equivalent to forcing the parameters of the function:
foldl' (\acc x -> acc `seq` x `seq`
Stupid{content=(content acc) V.// [(x,x+123)],ul=ul acc})
(Stupid {content=V.replicate 10000 10,ul=1}) $ take 100000 $ cycle [0..9999]
(This can also be written with bang patterns, if one prefers those.)
Another option, more aggressive, would be to use strictness annotations in the definition of the Stupid constructor.
data ... = Stupid { content = ! someType , ul :: ! someOtherType }
This will always force those fields in the whole program.

Benchmarking Filter and Partition

I was testing the performance of the partition function for lists and got some strange results, I think.
We have that partition p xs == (filter p xs, filter (not . p) xs) but we chose the first implementation because it only performs a single traversal over the list. Yet, the results I got say that it maybe be better to use the implementation that uses two traversals.
Here is the minimal code that shows what I'm seeing
import Criterion.Main
import System.Random
import Data.List (partition)
mypartition :: (a -> Bool) -> [a] -> ([a],[a])
mypartition p l = (filter p l, filter (not . p) l)
randList :: RandomGen g => g -> Integer -> [Integer]
randList gen 0 = []
randList gen n = x:xs
where
(x, gen') = random gen
xs = randList gen' (n - 1)
main = do
gen <- getStdGen
let arg10000000 = randList gen 10000000
defaultMain [
bgroup "filters -- split list in half " [
bench "partition100" $ nf (partition (>= 50)) arg10000000
, bench "mypartition100" $ nf (mypartition (>= 50)) arg10000000
]
]
I ran the tests both with -O and without it and both times I get that the double traversals is better.
I am using ghc-7.10.3 with criterion-1.1.1.0
My questions are:
Is this expected?
Am I using Criterion correctly? I know that laziness can be tricky and (filter p xs, filter (not . p) xs) will only do two traversals if both elements of the tuple are used.
Does this has to do something with the way lists are handled in Haskell?
Thanks a lot!
There is no black or white answer to the question. To dissect the problem consider the following code:
import Control.DeepSeq
import Data.List (partition)
import System.Environment (getArgs)
mypartition :: (a -> Bool) -> [a] -> ([a],[a])
mypartition p l = (filter p l, filter (not . p) l)
main :: IO ()
main = do
let cnt = 10000000
xs = take cnt $ concat $ repeat [1 .. 100 :: Int]
args <- getArgs
putStrLn $ unwords $ "Args:" : args
case args of
[percent, fun]
-> let p = (read percent >=)
in case fun of
"partition" -> print $ rnf $ partition p xs
"mypartition" -> print $ rnf $ mypartition p xs
"partition-ds" -> deepseq xs $ print $ rnf $ partition p xs
"mypartition-ds" -> deepseq xs $ print $ rnf $ mypartition p xs
_ -> err
_ -> err
where
err = putStrLn "Sorry, I do not understand."
I do not use Criterion to have a better control about the order of evaluation. To get timings, I use the +RTS -s runtime option. The different test case are executed using different command line options. The first command line option defines for which percentage of the data the predicate holds. The second command line option chooses between different tests.
The tests distinguish two cases:
The data is generated lazily (2nd argument partition or mypartition).
The data is already fully evaluated in memory (2nd argument partition-ds or mypartition-ds).
The result of the partitioning is always evaluated from left to right, i.e. starting with the list that contains all the elements for which the predicate holds.
In case 1 partition has the advantage that elements of the first resulting list get discarded before all elements of the input list were even produced. Case 1 is especially good, if the predicate matches many elements, i.e. the first command line argument is large.
In case 2, partition cannot play out this advantage, since all elements are already in memory.
For mypartition, in any case all elements are held in memory after the first resulting list is evaluated, because they are needed again to compute the second resulting list. Therefore there is not much of a difference between the two cases.
It seems, the more memory is used, the harder garbage collection gets. Therefore partition is well suited, if the predicate matches many elements and the lazy variant is used.
Conversely, if the predicate does not match many elements or all elements are already in memory, mypartition performs better, since its recursion does not deal with pairs in contrast to partition.
The Stackoverflow question “Irrefutable pattern does not leak memory in recursion, but why?” might give some more insights about the handling of pairs in the recursion of partition.

GHC profiling file and chart are contradictory

I have a sieve of Eratosthenes program written in ST.Strict, and I was profiling it when I saw that it was taking a ridiculous amount of memory:
Sun Jul 10 18:27 2016 Time and Allocation Profiling Report (Final)
Primes +RTS -hc -p -K1000M -RTS 10000000
total time = 2.32 secs (2317 ticks # 1000 us, 1 processor)
total alloc = 5,128,702,952 bytes (excludes profiling overheads)
(where 10^7) is the amount of primes I asked it to generate.
Weirdly, the profiling graph shows something completely different:
Am I misreading something in one of these graphs? Or is there something wrong with one of these tools?
For reference, my code is
{-# LANGUAGE BangPatterns #-}
import Prelude hiding (replicate, read)
import qualified Text.Read as T
import Data.Vector.Unboxed.Mutable(replicate, write, read)
import Control.Monad.ST.Strict
import Data.STRef
import Control.Monad.Primitive
import Control.Monad
import System.Environment
main = print . length . primesUpTo . T.read . head =<< getArgs
primesUpTo :: Int -> [Int]
primesUpTo n = runST $ do
primes <- replicate n True
write primes 0 False
write primes 1 False
sieve 2 primes
return []
-- Removed to avoid the memory allocation of creating the list for profiling purposes
-- filterM (read primes) [0..n-1]
where
sieve !i primes | i * i >= n = return primes
sieve !i primes = do
v <- read primes i
counter <- newSTRef $ i * i
when v $ whileM_ ((< n) <$!> readSTRef counter) $ do
curr_count <- readSTRef counter
write primes curr_count False
writeSTRef counter (curr_count + i)
sieve (i + 1) primes
whileM_ :: (Monad m) => m Bool -> m a -> m ()
whileM_ condition body = do
cond <- condition
when cond $ do
body
whileM_ condition body
This seems to confuse many people.
total alloc = 5,128,702,952 bytes (excludes profiling overheads)
This is literally the total size of all the allocations ever performed by your program, including "temporary" objects that become dead almost immediately after being allocated. Allocation itself is nearly free, and generally Haskell programs allocate at a rate of around 1-2 GB/s.
Weirdly, the profiling graph shows something completely different:
Indeed, the profiling graph shows the total size of all the objects that are live on the heap at any particular time. This reflects the space usage of your program. If your program runs in constant space, then the number shown in this graph will stay constant.

Lazily read and manipulate float from stdin in haskell

I'm trying to read data as Doubles from stdin, manipulate them and write them as well. What I've come up with so far is:
import qualified Data.ByteString.Lazy as B
import Data.Binary.IEEE754
import Data.Binary.Get
-- gives a list of doubles read from stdin
listOfFloat64le = do
empty <- isEmpty
if empty
then return []
else do v <- getFloat64le
rest <- listOfFloat64le
return (v : rest)
-- delay signal by one
delay us = 0 : us
-- feedback system, add delayed version of signal to signal
sys us = zipWith (+) us (delay us)
main = do
input <- B.getContents
let hs = sys $ runGet listOfFloat64le input
print $ take 10 hs
The idea is to fead data to the program which is then passed through a feedback system before it is written to stdout. Although right now it just prints the first 10 values.
This works but does not seem to evaluate lazily. I.e it has to read all the input into memory.
So:
dd if=/dev/urandom bs=8 count=10 | runhaskell feedback.hs
will work just fine but:
dd if=/dev/urandom | runhaskell feedback.hs
will not. My guess is it's the listOfFloat64le function that makes things not work properly. So how do I create an iterable to pass into my sys function without having to read everything into memory?
I'm not a very experienced haskeller.
I took another route by instead splitting the ByteString at intervals of 8 bytes and mapping over it instead:
import qualified Data.ByteString.Lazy as L
import Data.Binary.IEEE754
import Data.Binary.Get
-- delay signal by one
delay us = 0 : us
-- feedback system, add delayed version of signal to signal
sys us = zipWith (+) us (delay us)
-- split ByteString into chunks of size n
chunk n xs = if (L.null xs)
then []
else y1 : chunk n y2
where
(y1, y2) = L.splitAt n xs
main = do
input <- L.getContents
let signal = map (runGet getFloat64le) (chunk 8 input)
print $ take 10 (sys signal)
This seem to work atleast but I don't know what the performance is like.
EDIT: I switched from chunk to chunker which uses runGetState instead:
chunker :: Get a -> L.ByteString -> [a]
chunker f input = if (L.null input)
then []
else val : chunker f rest
where
(val, rest, _) = runGetState f input 0
And using it like: let signal = chunker getFloat64le input
See this question. Looks like Binary become more strict than it was long time ago when I used it.
This seems a standards problem where you can easily use something like pipes or conduits. You can make stdin as the source and stdout as the sink and apply the transformer as a conduit.

How to use Criterion to measure performance of Haskell programs?

I'm trying to measure the performance of a simple Haar DWT program using the Criterion framework. (It is erroneously slow, but I'll leave that for another question). I can't find any good documentation on the web, unfortunately. My two primary problems are
How can one pass data from one benchmark to another? I want to time each stage of the program.
How does the sampling work, and avoid lazy evaluation reusing its previous computations?
This source is relatively pared down; the first function getRandList generates a list of random numbers; haarStep transforms an input signal into differences and sums, and haarDWT calls the former and recurses on the sums. I'm trying to pass the getRandList to the haarDWT via lazy evaluation, but perhaps my usage is incorrect / unsupported. The timings don't seem to make sense.
{-# LANGUAGE ViewPatterns #-}
import Control.Arrow
import qualified Data.Vector.Unboxed as V
import System.Random
import Criterion.Main
invSqrt2 = 0.70710678118654752440
getRandList :: RandomGen g => g -> Int -> [Float]
getRandList gen 0 = []
getRandList gen n = v:rest where
(v, gen') = random gen
rest = getRandList gen' (n - 1)
haarStep :: V.Vector Float -> (V.Vector Float, V.Vector Float)
haarStep = (alternatingOp (-) &&& alternatingOp (+)) where
alternatingOp op x = V.generate (V.length x `div` 2) (\i ->
((x V.! (2 * i)) `op` (x V.! (2 * i + 1))) * invSqrt2)
haarDWT :: V.Vector Float -> V.Vector Float
haarDWT xl#(V.length -> 1) = xl
haarDWT (haarStep -> (d, s)) = haarDWT s V.++ d
main = do
gen <- getStdGen
inData <- return $ getRandList gen 2097152
outData <- return $ haarDWT (V.fromList inData)
defaultMain [
bench "get input" $ nf id inData,
bench "transform" $ nf V.toList outData
]
writeFile "input.dat" (unlines $ map show inData)
writeFile "output.dat" (unlines $ map show $ V.toList outData)
Finally, I'm getting an error when I try to call it with -s 1; maybe this is just a Criterion bug.
Main: ./Data/Vector/Generic.hs:237 ((!)): index out of bounds (1,1)
Thanks in advance!
The posted benchmark is erroniously slow... or is it
Are you sure it's erroneous? You're touching (well, the "nf" call is touching) 2 million boxed elements - thats 4 million pointers. You can call this erroneous if you want, but the issue is just what you think you're measure compared to what you really are measuring.
Sharing Data Between Benchmarks
Data sharing can be accomplished through partial application. In my benchmarks I commonly have
let var = somethingCommon in
defaultMain [ bench "one" (nf (func1 somethingCommon) input1)
, bench "two" (nf (func2 somethingCommon) input2)]
Avoiding Reuse in the presences of lazy evaluation
Criterion avoids sharing by separating out your function and your input. You have signatures such as:
funcToBenchmark :: (NFData b) => a -> b
inputForFunc :: a
In Haskell every time you apply funcToBenchmark inputForFunc it will create a thunk that needs evaluated. There is no sharing unless you use the same variable name as a previous computation. There is no automatic memoization - this seems to be a common misunderstanding.
Notice the nuance in what isn't shared. We aren't sharing the final result, but the input is shared. If the generation of the input is what you want to benchmark (i.e. getRandList, in this case) then benchmark that and not just the identity + nf function:
main = do
gen <- getStdGen
let inData = getRandList gen size
inVec = V.fromList inData
size = 2097152
defaultMain
[ bench "get input for real" $ nf (getRandList gen) size
, bench "get input for real and run harrDWT and listify a vector" $ nf (V.toList . haarDWT . V.fromList . getRandList gen) size
, bench "screw generation, how fast is haarDWT" $ whnf haarDWT inVec] -- for unboxed vectors whnf is sufficient
Interpreting Data
The third benchmark is rather instructive. Lets look at what criterion prints out:
benchmarking screw generation, how fast is haarDWT
collecting 100 samples, 1 iterations each, in estimated 137.3525 s
bootstrapping with 100000 resamples
mean: 134.7204 ms, lb 134.5117 ms, ub 135.0135 ms, ci 0.950
Based on a single run, Criterion thinks it will take 137 seconds to perform it's 100 samples. About ten seconds later it was done - what happened? Well, the first run forced all the inputs (inVec), which was expensive. The subsequent runs found a value instead of a thunk, and thus we truely benchmarked haarDWT and not the StdGen RNG (which is known to be painfully slow).

Resources