I'm looking to try to run a moderately expensive function on a large list of inputs, using part of the output of that function as one of its inputs. The code runs as expected, unfortunately it consumes a large amount of memory in the process (just under 22GiB on the heap, just over 1GiB maximum residency). Here is a simplified example of what I mean:
{-# LANGUAGE OverloadedStrings #-}
import Data.List (foldl')
import qualified Data.Text as T
import qualified Data.Text.Lazy as TL
import qualified Data.Text.Lazy.IO as TL
import qualified Data.Text.Lazy.Builder as TB
main :: IO ()
main = TL.putStr $ TB.toLazyText showInts
showInts :: TB.Builder
showInts = foldMap fst shownLines
where
shownLines = map (showInt maxwidth) [0..10^7]
maxwidth = foldl' (\n -> max n . snd) 0 shownLines
showInt :: Int -> Int -> (TB.Builder, Int)
showInt maxwidth n = (builder, len)
where
builder = TB.fromText "This number: "
<> TB.fromText (T.replicate (maxwidth - len) " ") <> thisText
<> TB.singleton '\n'
(thisText, len) = expensiveShow n
expensiveShow :: Int -> (TB.Builder, Int)
expensiveShow n = (TB.fromText text, T.length text)
where text = T.pack (show n)
Note that in the where clause of showInts, showInt takes maxwidth as an argument, where maxwidth itself depends on the output of running showInt maxwidth on the whole list.
If, on the other hand, I do the naìˆve thing and replace the definition of maxwidth with foldl' max 0 $ map (snd . expensiveShow) [0..10^7], then maximum residency falls to just 44KiB. I would hope that performance like this would be achievable without workarounds like precomputing expensiveShow and then zipping it with the list [0..10^7].
I tried consuming the list strictly (using the foldl package), but this did not improve the situation.
I'm trying to have my cake and eat it too: exploiting laziness, while also making things strict enough that we don't build up a mountain of thunks. Is this possible to do? Or is there a better technique for accomplishing this?
You can't do it like this.
The problem is that your showInts has to traverse the list twice, first to find the longest number, second to print the numbers with the necessary format. That means the list has to be held in memory between the first and second passes. This isn't a problem with unevaluated thunks; it is simply that the whole list, completely evaluated, is being traversed twice.
The only solution is to generate the same list twice. In this case it is trivial; just have two [0..10^7] values, one for the maximum length and the second to format them. I suspect in your real application you are reading them from a file or something, in which case you need to read the file twice.
Related
I'm trying to process some Point Cloud data with Haskell, and it seems to use a LOT of memory. The code I'm using is below, it basically parses the data into a format I can work with. The dataset has 440MB with 10M rows. When I run it with runhaskell, it uses up all the ram in a short time (~3-4gb) and then crashes. If I compile it with -O2 and run it, it goes to 100% cpu and takes a long time to finish (~3 minutes). I should mention that I'm using an i7 cpu with 4GB ram and an SSD, so there should be plenty of resources. How can I improve the performance of this?
{-# LANGUAGE OverloadedStrings #-}
import Prelude hiding (lines, readFile)
import Data.Text.Lazy (Text, splitOn, unpack, lines)
import Data.Text.Lazy.IO (readFile)
import Data.Maybe (fromJust)
import Text.Read (readMaybe)
filename :: FilePath
filename = "sample.txt"
readTextMaybe = readMaybe . unpack
data Classification = Classification
{ id :: Int, description :: Text
} deriving (Show)
data Point = Point
{ x :: Int, y :: Int, z :: Int, classification :: Classification
} deriving (Show)
type PointCloud = [Point]
maybeReadPoint :: Text -> Maybe Point
maybeReadPoint text = parse $ splitOn "," text
where toMaybePoint :: Maybe Int -> Maybe Int -> Maybe Int -> Maybe Int -> Text -> Maybe Point
toMaybePoint (Just x) (Just y) (Just z) (Just cid) cdesc = Just (Point x y z (Classification cid cdesc))
toMaybePoint _ _ _ _ _ = Nothing
parse :: [Text] -> Maybe Point
parse [x, y, z, cid, cdesc] = toMaybePoint (readTextMaybe x) (readTextMaybe y) (readTextMaybe z) (readTextMaybe cid) cdesc
parse _ = Nothing
readPointCloud :: Text -> PointCloud
readPointCloud = map (fromJust . maybeReadPoint) . lines
main = (readFile filename) >>= (putStrLn . show . sum . map x . readPointCloud)
The reason this uses all your memory when compiled without optimization is most likely because sum is defined using foldl. Without the strictness analysis that comes with optimization, that will blow up badly. You can try using this function instead:
sum' :: Num n => [n] -> n
sum' = foldl' (+) 0
The reason this is slow when compiled with optimization seems likely related to the way you parse the input. A cons will be allocated for each character when reading in the input, and again when breaking the input into lines, and probably yet again when splitting on commas. Using a proper parsing library (any of them) will almost certainly help; using one of the streaming ones like pipes or conduit may or may not be best (I'm not sure).
Another issue, not related to performance: fromJust is rather poor form in general, and is a really bad idea when dealing with user input. You should instead mapM over the list in the Maybe monad, which will produce a Maybe [Point] for you.
I have a simple script written in both Python and Haskell. It reads a file with 1,000,000 newline separated integers, parses that file into a list of integers, quick sorts it and then writes it to a different file sorted. This file has the same format as the unsorted one. Simple.
Here is Haskell:
quicksort :: Ord a => [a] -> [a]
quicksort [] = []
quicksort (p:xs) = (quicksort lesser) ++ [p] ++ (quicksort greater)
where
lesser = filter (< p) xs
greater = filter (>= p) xs
main = do
file <- readFile "data"
let un = lines file
let f = map (\x -> read x::Int ) un
let done = quicksort f
writeFile "sorted" (unlines (map show done))
And here is Python:
def qs(ar):
if len(ar) == 0:
return ar
p = ar[0]
return qs([i for i in ar if i < p]) + [p] + qs([i for i in ar if i > p])
def read_file(fn):
f = open(fn)
data = f.read()
f.close()
return data
def write_file(fn, data):
f = open('sorted', 'w')
f.write(data)
f.close()
def main():
data = read_file('data')
lines = data.split('\n')
lines = [int(l) for l in lines]
done = qs(lines)
done = [str(l) for l in done]
write_file('sorted', "\n".join(done))
if __name__ == '__main__':
main()
Very simple. Now I compile the Haskell code with
$ ghc -O2 --make quick.hs
And I time those two with:
$ time ./quick
$ time python qs.py
Results:
Haskell:
real 0m10.820s
user 0m10.656s
sys 0m0.154s
Python:
real 0m9.888s
user 0m9.669s
sys 0m0.203s
How can Python possibly be faster than native code Haskell?
Thanks
EDIT:
Python version: 2.7.1
GHC version: 7.0.4
Mac OSX, 10.7.3
2.4GHz Intel Core i5
List generated by
from random import shuffle
a = [str(a) for a in xrange(0, 1000*1000)]
shuffle(a)
s = "\n".join(a)
f = open('data', 'w')
f.write(s)
f.close()
So all numbers are unique.
The Original Haskell Code
There are two issues with the Haskell version:
You're using string IO, which builds linked lists of characters
You're using a non-quicksort that looks like quicksort.
This program takes 18.7 seconds to run on my Intel Core2 2.5 GHz laptop. (GHC 7.4 using -O2)
Daniel's ByteString Version
This is much improved, but notice it still uses the inefficient built-in merge sort.
His version takes 8.1 seconds (and doesn't handle negative numbers, but that's more of a non-issue for this exploration).
Note
From here on this answer uses the following packages: Vector, attoparsec, text and vector-algorithms. Also notice that kindall's version using timsort takes 2.8 seconds on my machine (edit: and 2 seconds using pypy).
A Text Version
I ripped off Daniel's version, translated it to Text (so it handles various encodings) and added better sorting using a mutable Vector in an ST monad:
import Data.Attoparsec.Text.Lazy
import qualified Data.Text.Lazy as T
import qualified Data.Text.Lazy.IO as TIO
import qualified Data.Vector.Unboxed as V
import qualified Data.Vector.Algorithms.Intro as I
import Control.Applicative
import Control.Monad.ST
import System.Environment (getArgs)
parser = many (decimal <* char '\n')
main = do
numbers <- TIO.readFile =<< fmap head getArgs
case parse parser numbers of
Done t r | T.null t -> writeFile "sorted" . unlines
. map show . vsort $ r
x -> error $ Prelude.take 40 (show x)
vsort :: [Int] -> [Int]
vsort l = runST $ do
let v = V.fromList l
m <- V.unsafeThaw v
I.sort m
v' <- V.unsafeFreeze m
return (V.toList v')
This runs in 4 seconds (and also doesn't handle negatives)
Return to the Bytestring
So now we know we can make a more general program that's faster, what about making the ASCii -only version fast? No problem!
import qualified Data.ByteString.Lazy.Char8 as BS
import Data.Attoparsec.ByteString.Lazy (parse, Result(..))
import Data.Attoparsec.ByteString.Char8 (decimal, char)
import Control.Applicative ((<*), many)
import qualified Data.Vector.Unboxed as V
import qualified Data.Vector.Algorithms.Intro as I
import Control.Monad.ST
parser = many (decimal <* char '\n')
main = do
numbers <- BS.readFile "rands"
case parse parser numbers of
Done t r | BS.null t -> writeFile "sorted" . unlines
. map show . vsort $ r
vsort :: [Int] -> [Int]
vsort l = runST $ do
let v = V.fromList l
m <- V.unsafeThaw v
I.sort m
v' <- V.unsafeFreeze m
return (V.toList v')
This runs in 2.3 seconds.
Producing a Test File
Just in case anyone's curious, my test file was produced by:
import Control.Monad.CryptoRandom
import Crypto.Random
main = do
g <- newGenIO :: IO SystemRandom
let rs = Prelude.take (2^20) (map abs (crandoms g) :: [Int])
writeFile "rands" (unlines $ map show rs)
If you're wondering why vsort isn't packaged in some easier form on Hackage... so am I.
In short, don't use read. Replace read with a function like this:
import Numeric
fastRead :: String -> Int
fastRead s = case readDec s of [(n, "")] -> n
I get a pretty fair speedup:
~/programming% time ./test.slow
./test.slow 9.82s user 0.06s system 99% cpu 9.901 total
~/programming% time ./test.fast
./test.fast 6.99s user 0.05s system 99% cpu 7.064 total
~/programming% time ./test.bytestring
./test.bytestring 4.94s user 0.06s system 99% cpu 5.026 total
Just for fun, the above results include a version that uses ByteString (and hence fails the "ready for the 21st century" test by totally ignoring the problem of file encodings) for ULTIMATE BARE-METAL SPEED. It also has a few other differences; for example, it ships out to the standard library's sort function. The full code is below.
import qualified Data.ByteString as BS
import Data.Attoparsec.ByteString.Char8
import Control.Applicative
import Data.List
parser = many (decimal <* char '\n')
reallyParse p bs = case parse p bs of
Partial f -> f BS.empty
v -> v
main = do
numbers <- BS.readFile "data"
case reallyParse parser numbers of
Done t r | BS.null t -> writeFile "sorted" . unlines . map show . sort $ r
More a Pythonista than a Haskellite, but I'll take a stab:
There's a fair bit of overhead in your measured runtime just reading and writing the files, which is probably pretty similar between the two programs. Also, be careful that you've warmed up the cache for both programs.
Most of your time is spent making copies of lists and fragments of lists. Python list operations are heavily optimized, being one of the most-frequently used parts of the language, and list comprehensions are usually pretty performant too, spending much of their time in C-land inside the Python interpreter. There is not a lot of the stuff that is slowish in Python but wicked fast in static languages, such as attribute lookups on object instances.
Your Python implementation throws away numbers that are equal to the pivot, so by the end it may be sorting fewer items, giving it an obvious advantage. (If there are no duplicates in the data set you're sorting, this isn't an issue.) Fixing this bug probably requires making another copy of most of the list in each call to qs(), which would slow Python down a little more.
You don't mention what version of Python you're using. If you're using 2.x, you could probably get Haskell to beat Python just by switching to Python 3.x. :-)
I'm not too surprised the two languages are basically neck-and-neck here (a 10% difference is not noteworthy). Using C as a performance benchmark, Haskell loses some performance for its lazy functional nature, while Python loses some performance due to being an interpreted language. A decent match.
Since Daniel Wagner posted an optimized Haskell version using the built-in sort, here's a similarly optimized Python version using list.sort():
mylist = [int(x.strip()) for x in open("data")]
mylist.sort()
open("sorted", "w").write("\n".join(str(x) for x in mylist))
3.5 seconds on my machine, vs. about 9 for the original code. Pretty much still neck-and-neck with the optimized Haskell. Reason: it's spending most of its time in C-programmed libraries. Also, TimSort (the sort used in Python) is a beast.
This is after the fact, but I think most of the trouble is in the Haskell writing. The following module is pretty primitive -- one should use builders probably and certainly avoid the ridiculous roundtrip via String for showing -- but it is simple and did distinctly better than pypy with kindall's improved python and better than the 2 and 4 sec Haskell modules elsewhere on this page (it surprised me how much they were using lists, so I made a couple more turns of the crank.)
$ time aa.hs real 0m0.709s
$ time pypy aa.py real 0m1.818s
$ time python aa.py real 0m3.103s
I'm using the sort recommended for unboxed vectors from vector-algorithms. The use of Data.Vector.Unboxed in some form is clearly now the standard, naive way of doing this sort of thing -- it's the new Data.List (for Int, Double, etc.) Everything but the sort is irritating IO management, which could I think still be massively improved, on the write end in particular. The reading and sorting together take about 0.2 sec as you can see from asking it to print what's at a bunch of indexes instead of writing to file, so twice as much time is spent writing as in anything else. If the pypy is spending most of its time using timsort or whatever, then it looks like the sorting itself is surely massively better in Haskell, and just as simple -- if you can just get your hands on the darned vector...
I'm not sure why there aren't convenient functions around for reading and writing vectors of unboxed things from natural formats -- if there were, this would be three lines long and would avoid String and be much faster, but maybe I just haven't seen them.
import qualified Data.ByteString.Lazy.Char8 as BL
import qualified Data.ByteString.Char8 as B
import qualified Data.Vector.Unboxed.Mutable as M
import qualified Data.Vector.Unboxed as V
import Data.Vector.Algorithms.Radix
import System.IO
main = do unsorted <- fmap toInts (BL.readFile "data")
vec <- V.thaw unsorted
sorted <- sort vec >> V.freeze vec
withFile "sorted" WriteMode $ \handle ->
V.mapM_ (writeLine handle) sorted
writeLine :: Handle -> Int -> IO ()
writeLine h int = B.hPut h $ B.pack (show int ++ "\n")
toInts :: BL.ByteString -> V.Vector Int
toInts bs = V.unfoldr oneInt (BL.cons ' ' bs)
oneInt :: BL.ByteString -> Maybe (Int, BL.ByteString)
oneInt bs = if BL.null bs then Nothing else
let bstail = BL.tail bs
in if BL.null bstail then Nothing else BL.readInt bstail
To follow up #kindall interesting answer, those timings are dependent from both the python / Haskell implementation you use, the hardware configuration on which you run the tests, and the algorithm implementation you right in both languages.
Nevertheless we can try to get some good hints of the relative performances of one language implementation compared to another, or from one language to another language. With well known alogrithms like qsort, it's a good beginning.
To illustrate a python/python comparison, I just tested your script on CPython 2.7.3 and PyPy 1.8 on the same machine:
CPython: ~8s
PyPy: ~2.5s
This shows there can be room for improvements in the language implementation, maybe compiled Haskell is not performing at best the interpretation and compilation of your corresponding code. If you are searching for speed in Python, consider also to switch to pypy if needed and if your covering code permits you to do so.
i noticed some problem everybody else didn't notice for some reason; both your haskell and python code have this. (please tell me if it's fixed in the auto-optimizations, I know nothing about optimizations). for this I will demonstrate in haskell.
in your code you define the lesser and greater lists like this:
where lesser = filter (<p) xs
greater = filter (>=p) xs
this is bad, because you compare with p each element in xs twice, once for getting in the lesser list, and again for getting in the greater list. this (theoretically; I havn't checked timing) makes your sort use twice as much comparisons; this is a disaster. instead, you should make a function which splits a list into two lists using a predicate, in such a way that
split f xs
is equivalent to
(filter f xs, filter (not.f) xs)
using this kind of function you will only need to compare each element in the list once to know in which side of the tuple to put it.
okay, lets do it:
where
split :: (a -> Bool) -> [a] -> ([a], [a])
split _ [] = ([],[])
split f (x:xs)
|f x = let (a,b) = split f xs in (x:a,b)
|otherwise = let (a,b) = split f xs in (a,x:b)
now lets replace the lesser/greater generator with
let (lesser, greater) = split (p>) xs in (insert function here)
full code:
quicksort :: Ord a => [a] -> [a]
quicksort [] = []
quicksort (p:xs) =
let (lesser, greater) = splitf (p>) xs
in (quicksort lesser) ++ [p] ++ (quicksort greater)
where
splitf :: (a -> Bool) -> [a] -> ([a], [a])
splitf _ [] = ([],[])
splitf f (x:xs)
|f x = let (a,b) = splitf f xs in (x:a,b)
|otherwise = let (a,b) = splitf f xs in (a,x:b)
for some reason I can't right the getter/lesser part in the where clauses so I had to right it in let clauses.
also, if it is not tail-recursive let me know and fix it for me (I don't know yet how tail-recorsive works fully)
now you should do the same for the python code. I don't know python so I can't do it for you.
EDIT:
there actually happens to already be such function in Data.List called partition. note this proves the need for this kind of function because otherwise it wouldn't be defined.
this shrinks the code to:
quicksort :: Ord a => [a] -> [a]
quicksort [] = []
quicksort (p:xs) =
let (lesser, greater) = partition (p>) xs
in (quicksort lesser) ++ [p] ++ (quicksort greater)
Python is really optimized for this sort of thing. I suspect that Haskell isn't. Here's a similar question that provides some very good answers.
I'm trying to make a conduit that's sort of a cross between takeWhile and isolate. That is, it will consume from the input and yield to the output until either the predicate no longer holds or it has reached the byte limit. I know the type signature will be
isolateWhile :: (Monad m) => Int -> (Word8 -> Bool) -> Conduit ByteString m ByteString
As an example of its use:
{-# LANGUAGE OverloadedStrings #-}
import Data.Conduit
import qualified Data.Conduit.List as CL
import qualified Data.Conduit.Binary as CB
import Control.Monad.Trans.Class
charToWord = fromIntegral . fromEnum
example :: Int -> Char -> IO ()
example limit upTo = do
untaken <- CB.sourceLbs "Hello, world!" $= conduit $$ CB.sinkLbs
putStrLn $ "Left " ++ show untaken
where
conduit = do
taken <- toConsumer $ isolateWhile limit (/= charToWord upTo) =$ CB.sinkLbs
lift $ putStrLn $ "Took " ++ show taken
CL.map id -- pass the rest through untouched
I expect that
ghci> example 5 'l'
Took "He"
Left "llo, world!"
ghci> example 5 'w'
Took "Hello"
Left ", world!"
However, the simplest possible definition of isolateWhile:
isolateWhile limit pred = CB.isolate limit =$= CB.takeWhile pred
yields
ghci> example 5 'l'
Took "He"
Left ", world!"
ghci> example 5 'w'
Took "Hello"
Left ", world!"
In other words, isolate will eat up the entire Hello, leaving He to takeWhile and discarding the llo. This data loss is undesirable for my application. However, it is notable that the second case yields the expected result.
If I swap the operands of =$= like so:
isolateWhile limit pred = CB.takeWhile pred =$= CB.isolate limit
Then
ghci> example 5 'l'
Took "He"
Left ", world!"
ghci> example 5 'w'
Took "Hello"
Left ""
Now I've fixed the first test, but broken the second one! This time, takeWhile will take whatever it needs and isolate will take a subset of that; but whatever takeWhile uses that isolate doesn't will be discarded, and this is undesirable.
Lastly, I tried:
isolateWhile limit pred = do
untaken <- CB.isolate limit =$= (CB.takeWhile pred >> CL.consume)
mapM_ leftover $ reverse untaken
This actually works! Whatever isolate accepts and takeWhile doesn't is consumed by the CL.consume and placed back into the stream with leftover. Unfortunately, this seems like a horrible kludge, and undesirably (although not unusably so) it will buffer up to limit bytes in memory only to put it back with leftover. That seems like a waste.
The only solution I can think of is to write it in terms of the primitives await, yield and leftover as takeWhile and isolate are themselves written. While this would solve all the problems without wasting much, it seems like there must be a better way.
Am I missing something, or is there really no better way to write this?
There's a known limitation in the current version of conduit: fusion always discards downstream leftovers, which is exactly what you're running into here. There are some discussions right now about an architecture to resolve this, but for the moment, writing your function in terms of the primitives is likely your best option.
In GHCI, I run this simple test:
encodeFile "test" [0..10000000]
The line runs really quickly (<10sec), but my memory usage shoots up to ~500MB before it finishes. Shouldn't encodeFile be lazy since it uses ByteString.Lazy?
Edit: Roman's answer below is great! I also want to point out this answer to another question, that explains why Data.Binary does strict encoding on lists and provides a slightly more elegant work around.
Here's how serialization of lists is defined:
instance Binary a => Binary [a] where
put l = put (length l) >> mapM_ put l
That is, first serialize the length of the list, then serialize the list itself.
In order to find out the length of the list, we need to evaluate the whole list.
But we cannot garbage-collect it, because its elements are needed for the second
part, mapM_ put l. So the whole list has to be stored in memory after the
length is evaluated and before the elements serialization starts.
Here's how the heap profile looks like:
Notice how it grows while the list is being built to compute its length, and
then decreases while the elements are serialized and can be collected by the GC.
So, how to fix this? In your example, you already know the length. So you
can write a function which takes the known length, as opposed to computing it:
import Data.Binary
import Data.ByteString.Lazy as L
import qualified Data.ByteString as B
import Data.Binary.Put
main = do
let len = 10000001 :: Int
bs = encodeWithLength len [0..len-1]
L.writeFile "test" bs
putWithLength :: Binary a => Int -> [a] -> Put
putWithLength len list =
put len >> mapM_ put list
encodeWithLength :: Binary a => Int -> [a] -> ByteString
encodeWithLength len list = runPut $ putWithLength len list
This program runs within 53k of heap space.
You can also include a safety feature into putWithLength: compute the length while serializing the list, and check with the first argument in the end. If there's a mismatch, throw an error.
Exercise: why do you still need to pass in the length to putWithLength instead of using the computed value as described above?
Greetings,
I'm trying to understand why I'm seeing the entire file loaded into memory with the following program, yet if you comment out the line below "(***)" then the program runs in constant (about 1.5M) space.
EDIT: The file is about 660MB, the field in column 26 is a date string like '2009-10-01', and there are one million lines. The process uses about 810MB by the time it hits the 'getLine'
Am I right in thinking it's related to the splitting of the string using 'split', and that somehow the underlying ByteString that has been read from the file can't be garbage-collected because it's still referenced? But if so, then I thought BS.copy would work around that. Any ideas how to force the computation - I can't seem to get 'seq' into the right place to have an effect.
(NB the source file is tab-separated lines)
Thanks in advance,
Kevin
module Main where
import System.IO
import qualified Data.ByteString.Lazy.Char8 as BS
import Control.Monad
type Record = BS.ByteString
importRecords :: String -> IO [Record]
importRecords filename = do
liftM (map importRecord.BS.lines) (BS.readFile filename)
importRecord :: BS.ByteString -> Record
importRecord txt = r
where
r = getField 26
getField f = BS.copy $ ((BS.split '\t' txt) !! f)
loopInput :: [Record] -> IO ()
loopInput jrs = do
putStrLn $ "Done" ++ (show $ last jrs)
hFlush stdout
x <- getLine
return ()
-- (***)
loopInput jrs
main = do
jrs <- importRecords "c:\\downloads\\lcg1m.txt"
loopInput jrs
Your call to last forces the list, jrs. To figure that out it must run through the entire file building up thunks for each entry in jrs. Because you aren't evaluating each element in jrs (except the last one) these thunks hang out with references to the bytestring, so that must stay in memory.
The solution is to force the evaluation of those thunks. Because we're talking about space the first thing I did was actually to store your info in a smaller format:
type Year = Word16
type Month = Word8
type Day = Word8
data Record = Rec {-# UNPACK #-} !Year {-# UNPACK #-} !Month {-# UNPACK #-} !Day
deriving (Eq, Ord, Show, Read)
This reduces that ugly 10 byte Bytestring (+ overhead of ~16 bytes of structure information) to around 8 bytes.
importRecord now has to call toRecord r to get the right type:
toRecord :: BS.ByteString -> Record
toRecord bs =
case BS.splitWith (== '-') bs of
(y:m:d:[]) -> Rec (rup y) (rup m) (rup d)
_ -> Rec 0 0 0
rup :: (Read a) => BS.ByteString -> a
rup = read . BS.unpack
We'll need to evalute data when we convert from ByteString to Record, so lets use the parallel package and define an NFData instance from DeepSeq.
instance NFData Record where
rnf (Rec y m d) = y `seq` m `seq` d `seq` ()
Now we're ready to go, I modified main to use evalList, thus forcing the whole list before your function that wants the last one:
main = do
jrs <- importRecords "./tabLines"
let jrs' = using jrs (evalList rdeepseq)
loopInput jrs'
And we can see the heap profile looks beautiful (and top agrees, the program uses very little memory).
Sorry about that other misleading wrong answer - I was hooked on the fact that incremental processing fixes it and didn't really realize the thunks really were hanging around, not sure why my brain glided over that. Though I do stand by the gist, you should incrementally process this information making all of this answer moot.
FYI the huge bytestring didn't show up in those previous heap profiles I posted because foreign allocations (which includes ByteString) aren't tracked by the heap profiler.
There seem to be two questions here:
why does the memory usage depend on the presence or absence of the line (***);
why is the memory usage with (***) present about 800MB, rather than, say, 40MB.
I don't really know what to say about the first one that TomMD didn't already say; inside the loopInput loop, jrs can never be freed, because it's needed as an argument to the recursive call of loopInput. (You know that return () doesn't do anything when (***) is present, right?)
As for the second question, I think you are right that the input ByteString isn't being garbage collected. The reason is that you never evaluate the elements of your list jrs besides the last one, so they still contain references to the original ByteString (even though they are of the form BS.copy ...). I would think that replacing show $ last jrs with show jrs would reduce your memory usage; does it? Alternatively, you could try a stricter map, like
map' f [] = []
map' f (x:xs) = ((:) $! (f $! x)) (map' f xs)
Replace the map in importRecords with map' and see whether that reduces your memory usage.