I'm trying to process some Point Cloud data with Haskell, and it seems to use a LOT of memory. The code I'm using is below, it basically parses the data into a format I can work with. The dataset has 440MB with 10M rows. When I run it with runhaskell, it uses up all the ram in a short time (~3-4gb) and then crashes. If I compile it with -O2 and run it, it goes to 100% cpu and takes a long time to finish (~3 minutes). I should mention that I'm using an i7 cpu with 4GB ram and an SSD, so there should be plenty of resources. How can I improve the performance of this?
{-# LANGUAGE OverloadedStrings #-}
import Prelude hiding (lines, readFile)
import Data.Text.Lazy (Text, splitOn, unpack, lines)
import Data.Text.Lazy.IO (readFile)
import Data.Maybe (fromJust)
import Text.Read (readMaybe)
filename :: FilePath
filename = "sample.txt"
readTextMaybe = readMaybe . unpack
data Classification = Classification
{ id :: Int, description :: Text
} deriving (Show)
data Point = Point
{ x :: Int, y :: Int, z :: Int, classification :: Classification
} deriving (Show)
type PointCloud = [Point]
maybeReadPoint :: Text -> Maybe Point
maybeReadPoint text = parse $ splitOn "," text
where toMaybePoint :: Maybe Int -> Maybe Int -> Maybe Int -> Maybe Int -> Text -> Maybe Point
toMaybePoint (Just x) (Just y) (Just z) (Just cid) cdesc = Just (Point x y z (Classification cid cdesc))
toMaybePoint _ _ _ _ _ = Nothing
parse :: [Text] -> Maybe Point
parse [x, y, z, cid, cdesc] = toMaybePoint (readTextMaybe x) (readTextMaybe y) (readTextMaybe z) (readTextMaybe cid) cdesc
parse _ = Nothing
readPointCloud :: Text -> PointCloud
readPointCloud = map (fromJust . maybeReadPoint) . lines
main = (readFile filename) >>= (putStrLn . show . sum . map x . readPointCloud)
The reason this uses all your memory when compiled without optimization is most likely because sum is defined using foldl. Without the strictness analysis that comes with optimization, that will blow up badly. You can try using this function instead:
sum' :: Num n => [n] -> n
sum' = foldl' (+) 0
The reason this is slow when compiled with optimization seems likely related to the way you parse the input. A cons will be allocated for each character when reading in the input, and again when breaking the input into lines, and probably yet again when splitting on commas. Using a proper parsing library (any of them) will almost certainly help; using one of the streaming ones like pipes or conduit may or may not be best (I'm not sure).
Another issue, not related to performance: fromJust is rather poor form in general, and is a really bad idea when dealing with user input. You should instead mapM over the list in the Maybe monad, which will produce a Maybe [Point] for you.
Related
In an ongoing endeavour to efficiently fiddle with bits (e.g. see this SO question) the newest challenge is the efficient streaming and consumption of bits.
As a first simple task I choose to find the longest sequence of identical bits in a bitstream generated by /dev/urandom. A typical incantation would be head -c 1000000 </dev/urandom | my-exe. The actual goal is to stream bits and decode an Elias gamma code, for example, i.e. codes that are not chunks of bytes or multiples thereof.
For such codes of variable length it is nice to have the take, takeWhile, group, etc. language for list manipulation. Since a BitStream.take would actually consume part of the bistream some monad would probably come into play.
The obvious starting point is the lazy bytestring from Data.ByteString.Lazy.
A. Counting bytes
This very simple Haskell program performs on par with a C program, as is to be expected.
import qualified Data.ByteString.Lazy as BSL
main :: IO ()
main = do
bs <- BSL.getContents
print $ BSL.length bs
B. Adding bytes
Once I start using unpack things should get worse.
main = do
bs <- BSL.getContents
print $ sum $ BSL.unpack bs
Suprisingly, Haskell and C show the almost same performance.
C. Longest sequence of identical bits
As a first nontrivial task the longest sequence of identical bits can be found like this:
module Main where
import Data.Bits (shiftR, (.&.))
import qualified Data.ByteString.Lazy as BSL
import Data.List (group)
import Data.Word8 (Word8)
splitByte :: Word8 -> [Bool]
splitByte w = Prelude.map (\i-> (w `shiftR` i) .&. 1 == 1) [0..7]
bitStream :: BSL.ByteString -> [Bool]
bitStream bs = concat $ map splitByte (BSL.unpack bs)
main :: IO ()
main = do
bs <- BSL.getContents
print $ maximum $ length <$> (group $ bitStream bs)
The lazy bytestring is converted to a list [Word8] and then, using shifts, each Word is split into the bits, resulting in a list [Bool]. This list of lists is then flattened with concat. Having obtained a (lazy) list of Bool, use group to split the list into sequences of identical bits and then map length over it. Finally maximum gives the desired result. Quite simple, but not very fast:
# C
real 0m0.606s
# Haskell
real 0m6.062s
This naive implementation is exactly one order of magnitude slower.
Profiling shows that quite a lot of memory gets allocated (about 3GB for parsing 1MB of input). There is no massive space leak to be observed, though.
From here I start poking around:
There is a bitstream package that promises "Fast, packed, strict bit streams (i.e. list of Bools) with semi-automatic stream fusion.". Unfortunately it is not up-to-date with the current vector package, see here for details.
Next, I investigate streaming. I don't quite see why I should need 'effectful' streaming that brings some monad into play - at least until I start with the reverse of the posed task, i.e. encoding and writing bitstreams to file.
How about just fold-ing over the ByteString? I'd have to introduce state to keep track of consumed bits. That's not quite the nice take, takeWhile, group, etc. language that is desirable.
And now I'm not quite sure where to go.
Update:
I figured out how to do this with streaming and streaming-bytestring. I'm probably not doing this right because the result is catastrophically bad.
import Data.Bits (shiftR, (.&.))
import qualified Data.ByteString.Streaming as BSS
import Data.Word8 (Word8)
import qualified Streaming as S
import Streaming.Prelude (Of, Stream)
import qualified Streaming.Prelude as S
splitByte :: Word8 -> [Bool]
splitByte w = (\i-> (w `shiftR` i) .&. 1 == 1) <$> [0..7]
bitStream :: Monad m => Stream (Of Word8) m () -> Stream (Of Bool) m ()
bitStream s = S.concat $ S.map splitByte s
main :: IO ()
main = do
let bs = BSS.unpack BSS.getContents :: Stream (Of Word8) IO ()
gs = S.group $ bitStream bs :: Stream (Stream (Of Bool) IO) IO ()
maxLen <- S.maximum $ S.mapped S.length gs
print $ S.fst' maxLen
This will test your patience with anything beyond a few thousand bytes of input from stdin. The profiler says it spends an insane amount of time (quadratic in the input size) in Streaming.Internal.>>=.loop and Data.Functor.Of.fmap. I'm not quite sure what the first one is, but the fmap indicates (?) that the juggling of these Of a b isn't doing us any good and because we're in the IO monad it can't be optimised away.
I also have the streaming equivalent of the byte adder here: SumBytesStream.hs, which is slightly slower than the simple lazy ByteString implementation, but still decent. Since streaming-bytestring is proclaimed to be "bytestring io done right" I expected better. I'm probably not doing it right, then.
In any case, all these bit-computations shouldn't be happening in the IO monad. But BSS.getContents forces me into the IO monad because getContents :: MonadIO m => ByteString m () and there's no way out.
Update 2
Following the advice of #dfeuer I used the streaming package at master#HEAD. Here's the result.
longest-seq-c 0m0.747s (C)
longest-seq 0m8.190s (Haskell ByteString)
longest-seq-stream 0m13.946s (Haskell streaming-bytestring)
The O(n^2) problem of Streaming.concat is solved, but we're still not getting closer to the C benchmark.
Update 3
Cirdec's solution produces a performance on par with C. The construct that is used is called "Church encoded lists", see this SO answer or the Haskell Wiki on rank-N types.
Source files:
All the source files can be found on github. The Makefile has all the various targets to run the experiments and the profiling. The default make will just build everything (create a bin/ directory first!) and then make time will do the timing on the longest-seq executables. The C executables get a -c appended to distinguish them.
Intermediate allocations and their corresponding overhead can be removed when operations on streams fuse together. The GHC prelude provides foldr/build fusion for lazy streams in the form of rewrite rules. The general idea is if one function produces a result that looks like a foldr (it has the type (a -> b -> b) -> b -> b applied to (:) and []) and another function consumes a list that looks like a foldr, constructing the intermediate list can be removed.
For your problem I'm going to build something similar, but using strict left folds (foldl') instead of foldr. Instead of using rewrite rules that try to detect when something looks like a foldl, I'll use a data type that forces lists to look like left folds.
-- A list encoded as a strict left fold.
newtype ListS a = ListS {build :: forall b. (b -> a -> b) -> b -> b}
Since I've started by abandoning lists we'll be re-implementing part of the prelude for lists.
Strict left folds can be created from the foldl' functions of both lists and bytestrings.
{-# INLINE fromList #-}
fromList :: [a] -> ListS a
fromList l = ListS (\c z -> foldl' c z l)
{-# INLINE fromBS #-}
fromBS :: BSL.ByteString -> ListS Word8
fromBS l = ListS (\c z -> BSL.foldl' c z l)
The simplest example of using one is to find the length of a list.
{-# INLINE length' #-}
length' :: ListS a -> Int
length' l = build l (\z a -> z+1) 0
We can also map and concatenate left folds.
{-# INLINE map' #-}
-- fmap renamed so it can be inlined
map' f l = ListS (\c z -> build l (\z a -> c z (f a)) z)
{-# INLINE concat' #-}
concat' :: ListS (ListS a) -> ListS a
concat' ll = ListS (\c z -> build ll (\z l -> build l c z) z)
For your problem we need to be able to split a word into bits.
{-# INLINE splitByte #-}
splitByte :: Word8 -> [Bool]
splitByte w = Prelude.map (\i-> (w `shiftR` i) .&. 1 == 1) [0..7]
{-# INLINE splitByte' #-}
splitByte' :: Word8 -> ListS Bool
splitByte' = fromList . splitByte
And a ByteString into bits
{-# INLINE bitStream' #-}
bitStream' :: BSL.ByteString -> ListS Bool
bitStream' = concat' . map' splitByte' . fromBS
To find the longest run we'll keep track of the previous value, the length of the current run, and the length of the longest run. We make the fields strict so that the strictness of the fold prevents chains of thunks from being accumulated in memory. Making a strict data type for a state is an easy way to get control over both its memory representation and when its fields are evaluated.
data LongestRun = LongestRun !Bool !Int !Int
{-# INLINE extendRun #-}
extendRun (LongestRun previous run longest) x = LongestRun x current (max current longest)
where
current = if x == previous then run + 1 else 1
{-# INLINE longestRun #-}
longestRun :: ListS Bool -> Int
longestRun l = longest
where
(LongestRun _ _ longest) = build l extendRun (LongestRun False 0 0)
And we're done
main :: IO ()
main = do
bs <- BSL.getContents
print $ longestRun $ bitStream' bs
This is much faster, but not quite the performance of c.
longest-seq-c 0m00.12s (C)
longest-seq 0m08.65s (Haskell ByteString)
longest-seq-fuse 0m00.81s (Haskell ByteString fused)
The program allocates about 1 Mb to read 1000000 bytes from input.
total alloc = 1,173,104 bytes (excludes profiling overheads)
Updated github code
I found another solution that is on par with C. The Data.Vector.Fusion.Stream.Monadic has a stream implementation based on this Coutts, Leshchinskiy, Stewart 2007 paper. The idea behind it is to use a destroy/unfoldr stream fusion.
Recall that list's unfoldr :: (b -> Maybe (a, b)) -> b -> [a] creates a list by repeatedly applying (unfolding) a step-forward function, starting with an initial value. A Stream is just an unfoldr function with starting state. (The Data.Vector.Fusion.Stream.Monadic library uses GADTs to create constructors for Step that can be pattern-matched conveniently. It could just as well be done without GADTs, I think.)
The central piece of the solution is the mkBitstream :: BSL.ByteString -> Stream Bool function that turns a BytesString into a stream of Bool. Basically, we keep track of the current ByteString, the current byte, and how much of the current byte is still unconsumed. Whenever a byte is used up another byte is chopped off ByteString. When Nothing is left, the stream is Done.
The longestRun function is taken straight from #Cirdec's solution.
Here's the etude:
{-# LANGUAGE CPP #-}
#define PHASE_FUSED [1]
#define PHASE_INNER [0]
#define INLINE_FUSED INLINE PHASE_FUSED
#define INLINE_INNER INLINE PHASE_INNER
module Main where
import Control.Monad.Identity (Identity)
import Data.Bits (shiftR, (.&.))
import qualified Data.ByteString.Lazy as BSL
import Data.Functor.Identity (runIdentity)
import qualified Data.Vector.Fusion.Stream.Monadic as S
import Data.Word8 (Word8)
type Stream a = S.Stream Identity a -- no need for any monad, really
data Step = Step BSL.ByteString !Word8 !Word8 -- could use tuples, but this is faster
mkBitstream :: BSL.ByteString -> Stream Bool
mkBitstream bs' = S.Stream step (Step bs' 0 0) where
{-# INLINE_INNER step #-}
step (Step bs w n) | n==0 = case (BSL.uncons bs) of
Nothing -> return S.Done
Just (w', bs') -> return $
S.Yield (w' .&. 1 == 1) (Step bs' (w' `shiftR` 1) 7)
| otherwise = return $
S.Yield (w .&. 1 == 1) (Step bs (w `shiftR` 1) (n-1))
data LongestRun = LongestRun !Bool !Int !Int
{-# INLINE extendRun #-}
extendRun :: LongestRun -> Bool -> LongestRun
extendRun (LongestRun previous run longest) x = LongestRun x current (max current longest)
where current = if x == previous then run + 1 else 1
{-# INLINE longestRun #-}
longestRun :: Stream Bool -> Int
longestRun s = runIdentity $ do
(LongestRun _ _ longest) <- S.foldl' extendRun (LongestRun False 0 0) s
return longest
main :: IO ()
main = do
bs <- BSL.getContents
print $ longestRun (mkBitstream bs)
I have a simple script written in both Python and Haskell. It reads a file with 1,000,000 newline separated integers, parses that file into a list of integers, quick sorts it and then writes it to a different file sorted. This file has the same format as the unsorted one. Simple.
Here is Haskell:
quicksort :: Ord a => [a] -> [a]
quicksort [] = []
quicksort (p:xs) = (quicksort lesser) ++ [p] ++ (quicksort greater)
where
lesser = filter (< p) xs
greater = filter (>= p) xs
main = do
file <- readFile "data"
let un = lines file
let f = map (\x -> read x::Int ) un
let done = quicksort f
writeFile "sorted" (unlines (map show done))
And here is Python:
def qs(ar):
if len(ar) == 0:
return ar
p = ar[0]
return qs([i for i in ar if i < p]) + [p] + qs([i for i in ar if i > p])
def read_file(fn):
f = open(fn)
data = f.read()
f.close()
return data
def write_file(fn, data):
f = open('sorted', 'w')
f.write(data)
f.close()
def main():
data = read_file('data')
lines = data.split('\n')
lines = [int(l) for l in lines]
done = qs(lines)
done = [str(l) for l in done]
write_file('sorted', "\n".join(done))
if __name__ == '__main__':
main()
Very simple. Now I compile the Haskell code with
$ ghc -O2 --make quick.hs
And I time those two with:
$ time ./quick
$ time python qs.py
Results:
Haskell:
real 0m10.820s
user 0m10.656s
sys 0m0.154s
Python:
real 0m9.888s
user 0m9.669s
sys 0m0.203s
How can Python possibly be faster than native code Haskell?
Thanks
EDIT:
Python version: 2.7.1
GHC version: 7.0.4
Mac OSX, 10.7.3
2.4GHz Intel Core i5
List generated by
from random import shuffle
a = [str(a) for a in xrange(0, 1000*1000)]
shuffle(a)
s = "\n".join(a)
f = open('data', 'w')
f.write(s)
f.close()
So all numbers are unique.
The Original Haskell Code
There are two issues with the Haskell version:
You're using string IO, which builds linked lists of characters
You're using a non-quicksort that looks like quicksort.
This program takes 18.7 seconds to run on my Intel Core2 2.5 GHz laptop. (GHC 7.4 using -O2)
Daniel's ByteString Version
This is much improved, but notice it still uses the inefficient built-in merge sort.
His version takes 8.1 seconds (and doesn't handle negative numbers, but that's more of a non-issue for this exploration).
Note
From here on this answer uses the following packages: Vector, attoparsec, text and vector-algorithms. Also notice that kindall's version using timsort takes 2.8 seconds on my machine (edit: and 2 seconds using pypy).
A Text Version
I ripped off Daniel's version, translated it to Text (so it handles various encodings) and added better sorting using a mutable Vector in an ST monad:
import Data.Attoparsec.Text.Lazy
import qualified Data.Text.Lazy as T
import qualified Data.Text.Lazy.IO as TIO
import qualified Data.Vector.Unboxed as V
import qualified Data.Vector.Algorithms.Intro as I
import Control.Applicative
import Control.Monad.ST
import System.Environment (getArgs)
parser = many (decimal <* char '\n')
main = do
numbers <- TIO.readFile =<< fmap head getArgs
case parse parser numbers of
Done t r | T.null t -> writeFile "sorted" . unlines
. map show . vsort $ r
x -> error $ Prelude.take 40 (show x)
vsort :: [Int] -> [Int]
vsort l = runST $ do
let v = V.fromList l
m <- V.unsafeThaw v
I.sort m
v' <- V.unsafeFreeze m
return (V.toList v')
This runs in 4 seconds (and also doesn't handle negatives)
Return to the Bytestring
So now we know we can make a more general program that's faster, what about making the ASCii -only version fast? No problem!
import qualified Data.ByteString.Lazy.Char8 as BS
import Data.Attoparsec.ByteString.Lazy (parse, Result(..))
import Data.Attoparsec.ByteString.Char8 (decimal, char)
import Control.Applicative ((<*), many)
import qualified Data.Vector.Unboxed as V
import qualified Data.Vector.Algorithms.Intro as I
import Control.Monad.ST
parser = many (decimal <* char '\n')
main = do
numbers <- BS.readFile "rands"
case parse parser numbers of
Done t r | BS.null t -> writeFile "sorted" . unlines
. map show . vsort $ r
vsort :: [Int] -> [Int]
vsort l = runST $ do
let v = V.fromList l
m <- V.unsafeThaw v
I.sort m
v' <- V.unsafeFreeze m
return (V.toList v')
This runs in 2.3 seconds.
Producing a Test File
Just in case anyone's curious, my test file was produced by:
import Control.Monad.CryptoRandom
import Crypto.Random
main = do
g <- newGenIO :: IO SystemRandom
let rs = Prelude.take (2^20) (map abs (crandoms g) :: [Int])
writeFile "rands" (unlines $ map show rs)
If you're wondering why vsort isn't packaged in some easier form on Hackage... so am I.
In short, don't use read. Replace read with a function like this:
import Numeric
fastRead :: String -> Int
fastRead s = case readDec s of [(n, "")] -> n
I get a pretty fair speedup:
~/programming% time ./test.slow
./test.slow 9.82s user 0.06s system 99% cpu 9.901 total
~/programming% time ./test.fast
./test.fast 6.99s user 0.05s system 99% cpu 7.064 total
~/programming% time ./test.bytestring
./test.bytestring 4.94s user 0.06s system 99% cpu 5.026 total
Just for fun, the above results include a version that uses ByteString (and hence fails the "ready for the 21st century" test by totally ignoring the problem of file encodings) for ULTIMATE BARE-METAL SPEED. It also has a few other differences; for example, it ships out to the standard library's sort function. The full code is below.
import qualified Data.ByteString as BS
import Data.Attoparsec.ByteString.Char8
import Control.Applicative
import Data.List
parser = many (decimal <* char '\n')
reallyParse p bs = case parse p bs of
Partial f -> f BS.empty
v -> v
main = do
numbers <- BS.readFile "data"
case reallyParse parser numbers of
Done t r | BS.null t -> writeFile "sorted" . unlines . map show . sort $ r
More a Pythonista than a Haskellite, but I'll take a stab:
There's a fair bit of overhead in your measured runtime just reading and writing the files, which is probably pretty similar between the two programs. Also, be careful that you've warmed up the cache for both programs.
Most of your time is spent making copies of lists and fragments of lists. Python list operations are heavily optimized, being one of the most-frequently used parts of the language, and list comprehensions are usually pretty performant too, spending much of their time in C-land inside the Python interpreter. There is not a lot of the stuff that is slowish in Python but wicked fast in static languages, such as attribute lookups on object instances.
Your Python implementation throws away numbers that are equal to the pivot, so by the end it may be sorting fewer items, giving it an obvious advantage. (If there are no duplicates in the data set you're sorting, this isn't an issue.) Fixing this bug probably requires making another copy of most of the list in each call to qs(), which would slow Python down a little more.
You don't mention what version of Python you're using. If you're using 2.x, you could probably get Haskell to beat Python just by switching to Python 3.x. :-)
I'm not too surprised the two languages are basically neck-and-neck here (a 10% difference is not noteworthy). Using C as a performance benchmark, Haskell loses some performance for its lazy functional nature, while Python loses some performance due to being an interpreted language. A decent match.
Since Daniel Wagner posted an optimized Haskell version using the built-in sort, here's a similarly optimized Python version using list.sort():
mylist = [int(x.strip()) for x in open("data")]
mylist.sort()
open("sorted", "w").write("\n".join(str(x) for x in mylist))
3.5 seconds on my machine, vs. about 9 for the original code. Pretty much still neck-and-neck with the optimized Haskell. Reason: it's spending most of its time in C-programmed libraries. Also, TimSort (the sort used in Python) is a beast.
This is after the fact, but I think most of the trouble is in the Haskell writing. The following module is pretty primitive -- one should use builders probably and certainly avoid the ridiculous roundtrip via String for showing -- but it is simple and did distinctly better than pypy with kindall's improved python and better than the 2 and 4 sec Haskell modules elsewhere on this page (it surprised me how much they were using lists, so I made a couple more turns of the crank.)
$ time aa.hs real 0m0.709s
$ time pypy aa.py real 0m1.818s
$ time python aa.py real 0m3.103s
I'm using the sort recommended for unboxed vectors from vector-algorithms. The use of Data.Vector.Unboxed in some form is clearly now the standard, naive way of doing this sort of thing -- it's the new Data.List (for Int, Double, etc.) Everything but the sort is irritating IO management, which could I think still be massively improved, on the write end in particular. The reading and sorting together take about 0.2 sec as you can see from asking it to print what's at a bunch of indexes instead of writing to file, so twice as much time is spent writing as in anything else. If the pypy is spending most of its time using timsort or whatever, then it looks like the sorting itself is surely massively better in Haskell, and just as simple -- if you can just get your hands on the darned vector...
I'm not sure why there aren't convenient functions around for reading and writing vectors of unboxed things from natural formats -- if there were, this would be three lines long and would avoid String and be much faster, but maybe I just haven't seen them.
import qualified Data.ByteString.Lazy.Char8 as BL
import qualified Data.ByteString.Char8 as B
import qualified Data.Vector.Unboxed.Mutable as M
import qualified Data.Vector.Unboxed as V
import Data.Vector.Algorithms.Radix
import System.IO
main = do unsorted <- fmap toInts (BL.readFile "data")
vec <- V.thaw unsorted
sorted <- sort vec >> V.freeze vec
withFile "sorted" WriteMode $ \handle ->
V.mapM_ (writeLine handle) sorted
writeLine :: Handle -> Int -> IO ()
writeLine h int = B.hPut h $ B.pack (show int ++ "\n")
toInts :: BL.ByteString -> V.Vector Int
toInts bs = V.unfoldr oneInt (BL.cons ' ' bs)
oneInt :: BL.ByteString -> Maybe (Int, BL.ByteString)
oneInt bs = if BL.null bs then Nothing else
let bstail = BL.tail bs
in if BL.null bstail then Nothing else BL.readInt bstail
To follow up #kindall interesting answer, those timings are dependent from both the python / Haskell implementation you use, the hardware configuration on which you run the tests, and the algorithm implementation you right in both languages.
Nevertheless we can try to get some good hints of the relative performances of one language implementation compared to another, or from one language to another language. With well known alogrithms like qsort, it's a good beginning.
To illustrate a python/python comparison, I just tested your script on CPython 2.7.3 and PyPy 1.8 on the same machine:
CPython: ~8s
PyPy: ~2.5s
This shows there can be room for improvements in the language implementation, maybe compiled Haskell is not performing at best the interpretation and compilation of your corresponding code. If you are searching for speed in Python, consider also to switch to pypy if needed and if your covering code permits you to do so.
i noticed some problem everybody else didn't notice for some reason; both your haskell and python code have this. (please tell me if it's fixed in the auto-optimizations, I know nothing about optimizations). for this I will demonstrate in haskell.
in your code you define the lesser and greater lists like this:
where lesser = filter (<p) xs
greater = filter (>=p) xs
this is bad, because you compare with p each element in xs twice, once for getting in the lesser list, and again for getting in the greater list. this (theoretically; I havn't checked timing) makes your sort use twice as much comparisons; this is a disaster. instead, you should make a function which splits a list into two lists using a predicate, in such a way that
split f xs
is equivalent to
(filter f xs, filter (not.f) xs)
using this kind of function you will only need to compare each element in the list once to know in which side of the tuple to put it.
okay, lets do it:
where
split :: (a -> Bool) -> [a] -> ([a], [a])
split _ [] = ([],[])
split f (x:xs)
|f x = let (a,b) = split f xs in (x:a,b)
|otherwise = let (a,b) = split f xs in (a,x:b)
now lets replace the lesser/greater generator with
let (lesser, greater) = split (p>) xs in (insert function here)
full code:
quicksort :: Ord a => [a] -> [a]
quicksort [] = []
quicksort (p:xs) =
let (lesser, greater) = splitf (p>) xs
in (quicksort lesser) ++ [p] ++ (quicksort greater)
where
splitf :: (a -> Bool) -> [a] -> ([a], [a])
splitf _ [] = ([],[])
splitf f (x:xs)
|f x = let (a,b) = splitf f xs in (x:a,b)
|otherwise = let (a,b) = splitf f xs in (a,x:b)
for some reason I can't right the getter/lesser part in the where clauses so I had to right it in let clauses.
also, if it is not tail-recursive let me know and fix it for me (I don't know yet how tail-recorsive works fully)
now you should do the same for the python code. I don't know python so I can't do it for you.
EDIT:
there actually happens to already be such function in Data.List called partition. note this proves the need for this kind of function because otherwise it wouldn't be defined.
this shrinks the code to:
quicksort :: Ord a => [a] -> [a]
quicksort [] = []
quicksort (p:xs) =
let (lesser, greater) = partition (p>) xs
in (quicksort lesser) ++ [p] ++ (quicksort greater)
Python is really optimized for this sort of thing. I suspect that Haskell isn't. Here's a similar question that provides some very good answers.
I need to parse and process a text file that is a nested list of integer. The file is about 250mb large. This already leads to performace problems my naive solution takes 20GB or more of RAM.
The question is related to another question.
I have written about the memory problems and the suggestion was to use Data.Vector to get rtid of the memory problems.
So the goal is to process a nested list of integers and, say, filter the values so that only values larger than 30 get printed out.
Test file "myfile.tx":
11,22,33,44,55
66,77,88,99,10
Here is my code using Attoparsec, adapted from attoparsec-csv:
{-# Language OverloadedStrings #-}
-- adapted from https://github.com/robinbb/attoparsec-csv
module Text.ParseCSV
(
parseCSV
) where
import Prelude hiding (concat, takeWhile)
import Control.Applicative ((<$>), (<|>), (<*>), (<*), (*>), many)
import Control.Monad (void, liftM)
import Data.Attoparsec.Text
import qualified Data.Text as T (Text, concat, cons, append, pack, lines)
import qualified Data.Text.IO as IO (readFile, putStr)
import qualified Data.ByteString.Char8 as BSCH (readInteger)
lineEnd :: Parser ()
lineEnd =
void (char '\n') <|> void (string "\r\n") <|> void (char '\r')
<?> "end of line"
parserInt :: Parser Integer
parserInt = (signed decimal)
record :: Parser [Integer]
record =
parserInt `sepBy1` char ','
<?> "record"
file :: Parser [[Integer]]
file =
(:) <$> record
<*> manyTill (lineEnd *> record)
(endOfInput <|> lineEnd *> endOfInput)
<?> "file"
parseCSV :: T.Text -> Either String [[Integer]]
parseCSV =
parseOnly file
getValues :: Either String [[Integer]] -> [Integer]
getValues (Right [x]) = x
getValues _ = []
getLines :: FilePath -> IO [T.Text]
getLines = liftM T.lines . IO.readFile
parseAndFilter :: T.Text -> [Integer]
parseAndFilter = ((\x -> filter (>30) x) . getValues . parseCSV)
main = do
list <- getLines "myfile.txt"
putStr $ show $ map parseAndFilter list
But instead of using a list [Integer] I would like to use Data.Vector.
I found a relevant part in the Data.Vector tutorial:
--The simplest way to parse a file of Int or Integer types is with a strict or lazy --ByteString, and the readInt or readInteger functions:
{-# LANGUAGE BangPatterns #-}
import qualified Data.ByteString.Lazy.Char8 as L
import qualified Data.Vector as U
import System.Environment
main = do
[f] <- getArgs
s <- L.readFile f
print . U.sum . parse $ s
-- Fill a new vector from a file containing a list of numbers.
parse = U.unfoldr step
where
step !s = case L.readInt s of
Nothing -> Nothing
Just (!k, !t) -> Just (k, L.tail t)
However, this is regular, not a nested list of integers.
I tried to adapt my code but it did not work.
How can I change my code to
use a nested Vector (or Vector of Vectors) instead of [Integer] (i.e., while also running the Filter of >30 on the Vector).
There is an important question you don't mention in the posting.... Do you need everything in memory at once. If the processing is local, or if you can summarize all the data up to a point in the file with a few values, you can solve the performance problems by streaming the data through and throwing away all but the current line. This will usually run way faster and allow you to process orders of magnitude larger files. And it usually doesn't even matter (as much) what data structure you use to parse the values.
Here is an example:
import Text.Regex
process::[Int]->String
process = (++"\n") . show . sum --put whatever you want here.
main = interact (concat . map (process . map read . splitRegex (mkRegex ",")) . lines)
The whole program runs lazily, so it processes line by line as the data comes in and frees up the memory for old data (you can check this by typing in data by hand and watch the output come out). There is a performance hit by using the unpacked structures, but this isn't as big a problem as pulling everything into memory.
Many problems that don't seem to fit this criteria at first can be modified to do so (you may have to sort the data first, but there are many performance effective ways to do this).... I rewrote the full online stats system for a gaming company once following this principle, and was able to take a stats crunching time from hours to a couple of minutes (with even more metrics).
Because of its lazy nature, Haskell is a good language to stream data through.
I found a post that there is no easy way to parse with attoparsec to a vector.
See this forum post and thread.
But the good new is that the overhead of Data.Vector.fromList isn't so bad.
Attoparsec seems to be quite fast for parsing.
I keep the whole data in memory and this doesn't seem a speed overhead. It's more flexible, as perhaps later I need to have the whole data in memory, altough currently it is not needed per se for my problem.
Currently the code runs in ~30 seconds and about 1.5GB RAM for a 150MB text file. Now the memory consumption is quite little versus 20GB of before and I only need to focus on improving the speed.
Here are the changes from the code of my question my post, commented out code is using lists, functions with Vector in the type are new (this is not production code or meant to be good code yet):
{-
getValues :: Either String [[Integer]] -> [Integer]
getValues (Right [x]) = x
getValues _ = []
-}
getValues :: Either String [[Integer]] -> Vector Integer
getValues (Right [x]) = V.fromList x
getValues _ = V.fromList [999999,9999999,99999,999999] --- represents an ERROR
getLines :: FilePath -> IO [T.Text]
getLines = liftM T.lines . IO.readFile
{-
parseAndFilter :: T.Text -> [Integer]
parseAndFilter = ((\x -> filter (>30) x) . getValues . parseCSV)
-}
filterLarger :: Vector Integer -> Vector Integer
filterLarger = \x -> V.filter (>37) x
parseVector :: T.Text -> Vector Integer
parseVector = (getValues . parseCSV)
-- mystr = T.pack "3, 6, 7" --, 13, 14, 15, 17, 21, 22, 23, 24, 25, 28, 29, 30, 32, 33, 35, 36"
main = do
list <- getLines "mydata.txt"
--putStr $ show $ parseCSV $ mystr
putStr $ show $ V.map filterLarger $ V.map parseVector $ V.fromList list
--show $ parseOnly parserInt $ T.pack "123"
Thanks to jamshidh and all the comments that pointed me to the right direction.
Here is the final solution. Switching to ByteString and Int in the code, it now runs twice as fast and a bit less memory consumtion (time is now ~14 Seconds).
{-# Language OverloadedStrings #-}
-- adapted from https://github.com/robinbb/attoparsec-csv
module Main
(
parseCSV, main
) where
import Data.Vector as V (Vector, fromList, map, head, filter)
import Prelude hiding (concat, takeWhile)
import Control.Applicative ((<$>), (<|>), (<*>), (<*), (*>), many)
import Control.Monad (void, liftM)
import Data.Attoparsec.Char8
import qualified Data.ByteString.Char8 as B
lineEnd :: Parser ()
lineEnd =
void (char '\n') <|> void (string "\r\n") <|> void (char '\r')
<?> "end of line"
parserInt :: Parser Int
parserInt = skipSpace *> signed decimal
record :: Parser [Int]
record =
parserInt `sepBy1` char ','
<?> "record"
file :: Parser [[Int]]
file =
(:) <$> record
<*> manyTill (lineEnd *> record)
(endOfInput <|> lineEnd *> endOfInput)
<?> "file"
parseCSV :: B.ByteString -> Either String [[Int]]
parseCSV =
parseOnly file
getValues :: Either String [[Int]] -> Vector Int
getValues (Right [x]) = V.fromList x
getValues _ = error "ERROR in getValues function!"
filterLarger :: Vector Int -> Vector Int
filterLarger = \x -> V.filter (>36) x
parseVector :: B.ByteString -> Vector Int
parseVector = (getValues . parseCSV)
-- MAIN
main = do
fContent <- B.readFile "myfile.txt"
putStr $ show $ V.map filterLarger $ V.map parseVector $ V.fromList $ B.lines fContent
I am building some moderately large DIMACS files, however with the method used below the memory usage is rather large compared to the size of the files generated, and on some of the larger files I need to generate I run in to out of memory problems.
import Control.Monad.State.Strict
import Control.Monad.Writer.Strict
import qualified Data.ByteString.Lazy.Char8 as B
import Control.Monad
import qualified Text.Show.ByteString as BS
import Data.List
main = printDIMACS "test.cnf" test
test = do
xs <- freshs 100000
forM_ (zip xs (tail xs))
(\(x,y) -> addAll [[negate x, negate y],[x,y]])
type Var = Int
type Clause = [Var]
data DIMACSS = DS{
nextFresh :: Int,
numClauses :: Int
} deriving (Show)
type DIMACSM a = StateT DIMACSS (Writer B.ByteString) a
freshs :: Int -> DIMACSM [Var]
freshs i = do
next <- gets nextFresh
let toRet = [next..next+i-1]
modify (\s -> s{nextFresh = next+i})
return toRet
fresh :: DIMACSM Int
fresh = do
i <- gets nextFresh
modify (\s -> s{nextFresh = i+1})
return i
addAll :: [Clause] -> DIMACSM ()
addAll c = do
tell
(B.concat .
intersperse (B.pack " 0\n") .
map (B.unwords . map BS.show) $ c)
tell (B.pack " 0\n")
modify (\s -> s{numClauses = numClauses s + length c})
add h = addAll [h]
printDIMACS :: FilePath -> DIMACSM a -> IO ()
printDIMACS file f = do
writeFile file ""
appendFile file (concat ["p cnf ", show i, " ", show j, "\n"])
B.appendFile file b
where
(s,b) = runWriter (execStateT f (DS 1 0))
i = nextFresh s - 1
j = numClauses s
I would like to keep the monadic building of clauses since it is very handy, but I need to overcome the memory problem. How do I optimize the above program so that it doesn't use too much memory?
If you want good memory behavior, you need to make sure that you write out the clauses as you generate them, instead of collecting them in memory and dumping them as such, either using lazyness or a more explicit approach such as conduits, enumerators, pipes or the like.
The main obstacle to that approach is that the DIMACS format expects the number of clauses and variables in the header. This prevents the naive implementation from being sufficiently lazy. There are two possibilities:
The pragmatic one is to write the clauses first to a temporary location. After that the numbers are known, so you write them to the real file and append the contents of the temporary file.
The prettier approach is possible if the generation of clauses has no side effects (besides the effects offered by your DIMACSM monad) and is sufficiently fast: Run it twice, first throwing away the clauses and just calculating the numbers, print the header line, run the generator again; now printing the clauses.
(This is from my experience with implementing SAT-Britney, where I took the second approach, because it fitted better with other requirements in that context.)
Also, in your code, addAll is not lazy enough: The list c needs to be retained even after writing (in the MonadWriter sense) the clauses. This is another space leak. I suggest you implement add as the primitive operation and then addAll = mapM_ add.
As explained in Joachim Breitner's answer the problem was that DIMACSM was not lazy enough, both because the strict versions of the monads was used and because the number of variables and clauses are needed before the ByteString can be written to the file. The solution is to use the lazy versions of the Monads and execute them twice. It turns out that it is also necessary to have WriterT be the outer monad:
import Control.Monad.State
import Control.Monad.Writer
...
type DIMACSM a = WriterT B.ByteString (State DIMACSS) a
...
printDIMACS :: FilePath -> DIMACSM a -> IO ()
printDIMACS file f = do
writeFile file ""
appendFile file (concat ["p cnf ", show i, " ", show j, "\n"])
B.appendFile file b
where
s = execState (execWriterT f) (DS 1 0)
b = evalState (execWriterT f) (DS 1 0)
i = nextFresh s - 1
j = numClauses s
Greetings,
I'm trying to understand why I'm seeing the entire file loaded into memory with the following program, yet if you comment out the line below "(***)" then the program runs in constant (about 1.5M) space.
EDIT: The file is about 660MB, the field in column 26 is a date string like '2009-10-01', and there are one million lines. The process uses about 810MB by the time it hits the 'getLine'
Am I right in thinking it's related to the splitting of the string using 'split', and that somehow the underlying ByteString that has been read from the file can't be garbage-collected because it's still referenced? But if so, then I thought BS.copy would work around that. Any ideas how to force the computation - I can't seem to get 'seq' into the right place to have an effect.
(NB the source file is tab-separated lines)
Thanks in advance,
Kevin
module Main where
import System.IO
import qualified Data.ByteString.Lazy.Char8 as BS
import Control.Monad
type Record = BS.ByteString
importRecords :: String -> IO [Record]
importRecords filename = do
liftM (map importRecord.BS.lines) (BS.readFile filename)
importRecord :: BS.ByteString -> Record
importRecord txt = r
where
r = getField 26
getField f = BS.copy $ ((BS.split '\t' txt) !! f)
loopInput :: [Record] -> IO ()
loopInput jrs = do
putStrLn $ "Done" ++ (show $ last jrs)
hFlush stdout
x <- getLine
return ()
-- (***)
loopInput jrs
main = do
jrs <- importRecords "c:\\downloads\\lcg1m.txt"
loopInput jrs
Your call to last forces the list, jrs. To figure that out it must run through the entire file building up thunks for each entry in jrs. Because you aren't evaluating each element in jrs (except the last one) these thunks hang out with references to the bytestring, so that must stay in memory.
The solution is to force the evaluation of those thunks. Because we're talking about space the first thing I did was actually to store your info in a smaller format:
type Year = Word16
type Month = Word8
type Day = Word8
data Record = Rec {-# UNPACK #-} !Year {-# UNPACK #-} !Month {-# UNPACK #-} !Day
deriving (Eq, Ord, Show, Read)
This reduces that ugly 10 byte Bytestring (+ overhead of ~16 bytes of structure information) to around 8 bytes.
importRecord now has to call toRecord r to get the right type:
toRecord :: BS.ByteString -> Record
toRecord bs =
case BS.splitWith (== '-') bs of
(y:m:d:[]) -> Rec (rup y) (rup m) (rup d)
_ -> Rec 0 0 0
rup :: (Read a) => BS.ByteString -> a
rup = read . BS.unpack
We'll need to evalute data when we convert from ByteString to Record, so lets use the parallel package and define an NFData instance from DeepSeq.
instance NFData Record where
rnf (Rec y m d) = y `seq` m `seq` d `seq` ()
Now we're ready to go, I modified main to use evalList, thus forcing the whole list before your function that wants the last one:
main = do
jrs <- importRecords "./tabLines"
let jrs' = using jrs (evalList rdeepseq)
loopInput jrs'
And we can see the heap profile looks beautiful (and top agrees, the program uses very little memory).
Sorry about that other misleading wrong answer - I was hooked on the fact that incremental processing fixes it and didn't really realize the thunks really were hanging around, not sure why my brain glided over that. Though I do stand by the gist, you should incrementally process this information making all of this answer moot.
FYI the huge bytestring didn't show up in those previous heap profiles I posted because foreign allocations (which includes ByteString) aren't tracked by the heap profiler.
There seem to be two questions here:
why does the memory usage depend on the presence or absence of the line (***);
why is the memory usage with (***) present about 800MB, rather than, say, 40MB.
I don't really know what to say about the first one that TomMD didn't already say; inside the loopInput loop, jrs can never be freed, because it's needed as an argument to the recursive call of loopInput. (You know that return () doesn't do anything when (***) is present, right?)
As for the second question, I think you are right that the input ByteString isn't being garbage collected. The reason is that you never evaluate the elements of your list jrs besides the last one, so they still contain references to the original ByteString (even though they are of the form BS.copy ...). I would think that replacing show $ last jrs with show jrs would reduce your memory usage; does it? Alternatively, you could try a stricter map, like
map' f [] = []
map' f (x:xs) = ((:) $! (f $! x)) (map' f xs)
Replace the map in importRecords with map' and see whether that reduces your memory usage.