Efficient hash map container in Haskell?

Efficient hash map container in Haskell? - haskell

I want to count unique blocks stored in a file using Haskell.
The block is just consecutive bytes with a length of 512 and the target file has a size of at least 1GB.
This is my initial try.
import Control.Monad
import qualified Data.ByteString.Lazy as LB
import Data.Foldable
import Data.HashMap
import Data.Int
import qualified Data.List as DL
import System.Environment
type DummyDedupe = Map LB.ByteString Int64
toBlocks :: Int64 -> LB.ByteString -> [LB.ByteString]
toBlocks n bs | LB.null bs = []
| otherwise = let (block, rest) = LB.splitAt n bs
in block : toBlocks n rest
dedupeBlocks :: [LB.ByteString] -> DummyDedupe -> DummyDedupe
dedupeBlocks = flip $ DL.foldl' (\acc block -> insertWith (+) block 1 $! acc)
dedupeFile :: FilePath -> DummyDedupe -> IO DummyDedupe
dedupeFile fp dd = LB.readFile fp >>= return . (`dedupeBlocks` dd) . toBlocks 512
main :: IO ()
main = do
dd <- getArgs >>= (`dedupeFile` empty) . head
putStrLn . show . (*512) . size $ dd
putStrLn . show . (*512) . foldl' (+) 0 $ dd
It works, but I got frustrated with its execution time and memory usage. Especilly when I compared with those of C++ and even Python implementation listed below, it was 3~5x slower and consumed 2~3x more memory space.
import os
import os.path
import sys
def dedupeFile(dd, fp):
fd = os.open(fp, os.O_RDONLY)
for block in iter(lambda : os.read(fd, 512), ''):
dd.setdefault(block, 0)
dd[block] = dd[block] + 1
os.close(fd)
return dd
dd = {}
dedupeFile(dd, sys.argv[1])
print(len(dd) * 512)
print(sum(dd.values()) * 512)
I thought it was mainly due to the hashmap implementation, and tried other implementations such as hashmap, hashtables and unordered-containers.
But there wasn't any noticeable difference.
Please help me to improve this program.

I don't think you will be able to beat the performance of python dictionaries. They are actually implemented in c with years of optimizations put into it on the other hand hashmap is new and not that much optimized. So getting 3x performance in my opinion is good enough. You can optimize you haskell code at certain places but still it won't matter much. If you are still adamant about increasing performance I think you should use a highly optimized c library with ffi in your code.
Here are some of the similar discussions
haskell beginners

This may be completely irrelevant depending on your usage, but I am slightly worried about insertWith (+) block 1. If your counts reach high numbers, you will accumulate thunks in the cells of the hash map. It doesn't matter that you used ($!), that only forces the spine -- the values are likely still lazy.
Data.HashMap provides no strict version insertWith' like Data.Map does. But you can implement it:
insertWith' :: (Hashable k, Ord k) => (a -> a -> a) -> k -> a
-> HashMap k a -> HashMap k a
insertWith' f k v m = maybe id seq maybeval m'
where
(maybeval, m') = insertLookupWithKey (const f) k v m
Also, you may want to output (but not input) a list of strict ByteStrings from toBlocks, which will make hashing faster.
That's all I've got -- I'm no performance guru, though.

Related

Efficient bitstreams in Haskell

In an ongoing endeavour to efficiently fiddle with bits (e.g. see this SO question) the newest challenge is the efficient streaming and consumption of bits.
As a first simple task I choose to find the longest sequence of identical bits in a bitstream generated by /dev/urandom. A typical incantation would be head -c 1000000 </dev/urandom | my-exe. The actual goal is to stream bits and decode an Elias gamma code, for example, i.e. codes that are not chunks of bytes or multiples thereof.
For such codes of variable length it is nice to have the take, takeWhile, group, etc. language for list manipulation. Since a BitStream.take would actually consume part of the bistream some monad would probably come into play.
The obvious starting point is the lazy bytestring from Data.ByteString.Lazy.
A. Counting bytes
This very simple Haskell program performs on par with a C program, as is to be expected.
import qualified Data.ByteString.Lazy as BSL
main :: IO ()
main = do
bs <- BSL.getContents
print $ BSL.length bs
B. Adding bytes
Once I start using unpack things should get worse.
main = do
bs <- BSL.getContents
print $ sum $ BSL.unpack bs
Suprisingly, Haskell and C show the almost same performance.
C. Longest sequence of identical bits
As a first nontrivial task the longest sequence of identical bits can be found like this:
module Main where
import Data.Bits (shiftR, (.&.))
import qualified Data.ByteString.Lazy as BSL
import Data.List (group)
import Data.Word8 (Word8)
splitByte :: Word8 -> [Bool]
splitByte w = Prelude.map (\i-> (w `shiftR` i) .&. 1 == 1) [0..7]
bitStream :: BSL.ByteString -> [Bool]
bitStream bs = concat $ map splitByte (BSL.unpack bs)
main :: IO ()
main = do
bs <- BSL.getContents
print $ maximum $ length <$> (group $ bitStream bs)
The lazy bytestring is converted to a list [Word8] and then, using shifts, each Word is split into the bits, resulting in a list [Bool]. This list of lists is then flattened with concat. Having obtained a (lazy) list of Bool, use group to split the list into sequences of identical bits and then map length over it. Finally maximum gives the desired result. Quite simple, but not very fast:
# C
real 0m0.606s
# Haskell
real 0m6.062s
This naive implementation is exactly one order of magnitude slower.
Profiling shows that quite a lot of memory gets allocated (about 3GB for parsing 1MB of input). There is no massive space leak to be observed, though.
From here I start poking around:
There is a bitstream package that promises "Fast, packed, strict bit streams (i.e. list of Bools) with semi-automatic stream fusion.". Unfortunately it is not up-to-date with the current vector package, see here for details.
Next, I investigate streaming. I don't quite see why I should need 'effectful' streaming that brings some monad into play - at least until I start with the reverse of the posed task, i.e. encoding and writing bitstreams to file.
How about just fold-ing over the ByteString? I'd have to introduce state to keep track of consumed bits. That's not quite the nice take, takeWhile, group, etc. language that is desirable.
And now I'm not quite sure where to go.
Update:
I figured out how to do this with streaming and streaming-bytestring. I'm probably not doing this right because the result is catastrophically bad.
import Data.Bits (shiftR, (.&.))
import qualified Data.ByteString.Streaming as BSS
import Data.Word8 (Word8)
import qualified Streaming as S
import Streaming.Prelude (Of, Stream)
import qualified Streaming.Prelude as S
splitByte :: Word8 -> [Bool]
splitByte w = (\i-> (w `shiftR` i) .&. 1 == 1) <$> [0..7]
bitStream :: Monad m => Stream (Of Word8) m () -> Stream (Of Bool) m ()
bitStream s = S.concat $ S.map splitByte s
main :: IO ()
main = do
let bs = BSS.unpack BSS.getContents :: Stream (Of Word8) IO ()
gs = S.group $ bitStream bs :: Stream (Stream (Of Bool) IO) IO ()
maxLen <- S.maximum $ S.mapped S.length gs
print $ S.fst' maxLen
This will test your patience with anything beyond a few thousand bytes of input from stdin. The profiler says it spends an insane amount of time (quadratic in the input size) in Streaming.Internal.>>=.loop and Data.Functor.Of.fmap. I'm not quite sure what the first one is, but the fmap indicates (?) that the juggling of these Of a b isn't doing us any good and because we're in the IO monad it can't be optimised away.
I also have the streaming equivalent of the byte adder here: SumBytesStream.hs, which is slightly slower than the simple lazy ByteString implementation, but still decent. Since streaming-bytestring is proclaimed to be "bytestring io done right" I expected better. I'm probably not doing it right, then.
In any case, all these bit-computations shouldn't be happening in the IO monad. But BSS.getContents forces me into the IO monad because getContents :: MonadIO m => ByteString m () and there's no way out.
Update 2
Following the advice of #dfeuer I used the streaming package at master#HEAD. Here's the result.
longest-seq-c 0m0.747s (C)
longest-seq 0m8.190s (Haskell ByteString)
longest-seq-stream 0m13.946s (Haskell streaming-bytestring)
The O(n^2) problem of Streaming.concat is solved, but we're still not getting closer to the C benchmark.
Update 3
Cirdec's solution produces a performance on par with C. The construct that is used is called "Church encoded lists", see this SO answer or the Haskell Wiki on rank-N types.
Source files:
All the source files can be found on github. The Makefile has all the various targets to run the experiments and the profiling. The default make will just build everything (create a bin/ directory first!) and then make time will do the timing on the longest-seq executables. The C executables get a -c appended to distinguish them.

Intermediate allocations and their corresponding overhead can be removed when operations on streams fuse together. The GHC prelude provides foldr/build fusion for lazy streams in the form of rewrite rules. The general idea is if one function produces a result that looks like a foldr (it has the type (a -> b -> b) -> b -> b applied to (:) and []) and another function consumes a list that looks like a foldr, constructing the intermediate list can be removed.
For your problem I'm going to build something similar, but using strict left folds (foldl') instead of foldr. Instead of using rewrite rules that try to detect when something looks like a foldl, I'll use a data type that forces lists to look like left folds.
-- A list encoded as a strict left fold.
newtype ListS a = ListS {build :: forall b. (b -> a -> b) -> b -> b}
Since I've started by abandoning lists we'll be re-implementing part of the prelude for lists.
Strict left folds can be created from the foldl' functions of both lists and bytestrings.
{-# INLINE fromList #-}
fromList :: [a] -> ListS a
fromList l = ListS (\c z -> foldl' c z l)
{-# INLINE fromBS #-}
fromBS :: BSL.ByteString -> ListS Word8
fromBS l = ListS (\c z -> BSL.foldl' c z l)
The simplest example of using one is to find the length of a list.
{-# INLINE length' #-}
length' :: ListS a -> Int
length' l = build l (\z a -> z+1) 0
We can also map and concatenate left folds.
{-# INLINE map' #-}
-- fmap renamed so it can be inlined
map' f l = ListS (\c z -> build l (\z a -> c z (f a)) z)
{-# INLINE concat' #-}
concat' :: ListS (ListS a) -> ListS a
concat' ll = ListS (\c z -> build ll (\z l -> build l c z) z)
For your problem we need to be able to split a word into bits.
{-# INLINE splitByte #-}
splitByte :: Word8 -> [Bool]
splitByte w = Prelude.map (\i-> (w `shiftR` i) .&. 1 == 1) [0..7]
{-# INLINE splitByte' #-}
splitByte' :: Word8 -> ListS Bool
splitByte' = fromList . splitByte
And a ByteString into bits
{-# INLINE bitStream' #-}
bitStream' :: BSL.ByteString -> ListS Bool
bitStream' = concat' . map' splitByte' . fromBS
To find the longest run we'll keep track of the previous value, the length of the current run, and the length of the longest run. We make the fields strict so that the strictness of the fold prevents chains of thunks from being accumulated in memory. Making a strict data type for a state is an easy way to get control over both its memory representation and when its fields are evaluated.
data LongestRun = LongestRun !Bool !Int !Int
{-# INLINE extendRun #-}
extendRun (LongestRun previous run longest) x = LongestRun x current (max current longest)
where
current = if x == previous then run + 1 else 1
{-# INLINE longestRun #-}
longestRun :: ListS Bool -> Int
longestRun l = longest
where
(LongestRun _ _ longest) = build l extendRun (LongestRun False 0 0)
And we're done
main :: IO ()
main = do
bs <- BSL.getContents
print $ longestRun $ bitStream' bs
This is much faster, but not quite the performance of c.
longest-seq-c 0m00.12s (C)
longest-seq 0m08.65s (Haskell ByteString)
longest-seq-fuse 0m00.81s (Haskell ByteString fused)
The program allocates about 1 Mb to read 1000000 bytes from input.
total alloc = 1,173,104 bytes (excludes profiling overheads)
Updated github code

I found another solution that is on par with C. The Data.Vector.Fusion.Stream.Monadic has a stream implementation based on this Coutts, Leshchinskiy, Stewart 2007 paper. The idea behind it is to use a destroy/unfoldr stream fusion.
Recall that list's unfoldr :: (b -> Maybe (a, b)) -> b -> [a] creates a list by repeatedly applying (unfolding) a step-forward function, starting with an initial value. A Stream is just an unfoldr function with starting state. (The Data.Vector.Fusion.Stream.Monadic library uses GADTs to create constructors for Step that can be pattern-matched conveniently. It could just as well be done without GADTs, I think.)
The central piece of the solution is the mkBitstream :: BSL.ByteString -> Stream Bool function that turns a BytesString into a stream of Bool. Basically, we keep track of the current ByteString, the current byte, and how much of the current byte is still unconsumed. Whenever a byte is used up another byte is chopped off ByteString. When Nothing is left, the stream is Done.
The longestRun function is taken straight from #Cirdec's solution.
Here's the etude:
{-# LANGUAGE CPP #-}
#define PHASE_FUSED [1]
#define PHASE_INNER [0]
#define INLINE_FUSED INLINE PHASE_FUSED
#define INLINE_INNER INLINE PHASE_INNER
module Main where
import Control.Monad.Identity (Identity)
import Data.Bits (shiftR, (.&.))
import qualified Data.ByteString.Lazy as BSL
import Data.Functor.Identity (runIdentity)
import qualified Data.Vector.Fusion.Stream.Monadic as S
import Data.Word8 (Word8)
type Stream a = S.Stream Identity a -- no need for any monad, really
data Step = Step BSL.ByteString !Word8 !Word8 -- could use tuples, but this is faster
mkBitstream :: BSL.ByteString -> Stream Bool
mkBitstream bs' = S.Stream step (Step bs' 0 0) where
{-# INLINE_INNER step #-}
step (Step bs w n) | n==0 = case (BSL.uncons bs) of
Nothing -> return S.Done
Just (w', bs') -> return $
S.Yield (w' .&. 1 == 1) (Step bs' (w' `shiftR` 1) 7)
| otherwise = return $
S.Yield (w .&. 1 == 1) (Step bs (w `shiftR` 1) (n-1))
data LongestRun = LongestRun !Bool !Int !Int
{-# INLINE extendRun #-}
extendRun :: LongestRun -> Bool -> LongestRun
extendRun (LongestRun previous run longest) x = LongestRun x current (max current longest)
where current = if x == previous then run + 1 else 1
{-# INLINE longestRun #-}
longestRun :: Stream Bool -> Int
longestRun s = runIdentity $ do
(LongestRun _ _ longest) <- S.foldl' extendRun (LongestRun False 0 0) s
return longest
main :: IO ()
main = do
bs <- BSL.getContents
print $ longestRun (mkBitstream bs)

Why is this Haskell IO Code so slow? [duplicate]

I have a simple script written in both Python and Haskell. It reads a file with 1,000,000 newline separated integers, parses that file into a list of integers, quick sorts it and then writes it to a different file sorted. This file has the same format as the unsorted one. Simple.
Here is Haskell:
quicksort :: Ord a => [a] -> [a]
quicksort [] = []
quicksort (p:xs) = (quicksort lesser) ++ [p] ++ (quicksort greater)
where
lesser = filter (< p) xs
greater = filter (>= p) xs
main = do
file <- readFile "data"
let un = lines file
let f = map (\x -> read x::Int ) un
let done = quicksort f
writeFile "sorted" (unlines (map show done))
And here is Python:
def qs(ar):
if len(ar) == 0:
return ar
p = ar[0]
return qs([i for i in ar if i < p]) + [p] + qs([i for i in ar if i > p])
def read_file(fn):
f = open(fn)
data = f.read()
f.close()
return data
def write_file(fn, data):
f = open('sorted', 'w')
f.write(data)
f.close()
def main():
data = read_file('data')
lines = data.split('\n')
lines = [int(l) for l in lines]
done = qs(lines)
done = [str(l) for l in done]
write_file('sorted', "\n".join(done))
if __name__ == '__main__':
main()
Very simple. Now I compile the Haskell code with
$ ghc -O2 --make quick.hs
And I time those two with:
$ time ./quick
$ time python qs.py
Results:
Haskell:
real 0m10.820s
user 0m10.656s
sys 0m0.154s
Python:
real 0m9.888s
user 0m9.669s
sys 0m0.203s
How can Python possibly be faster than native code Haskell?
Thanks
EDIT:
Python version: 2.7.1
GHC version: 7.0.4
Mac OSX, 10.7.3
2.4GHz Intel Core i5
List generated by
from random import shuffle
a = [str(a) for a in xrange(0, 1000*1000)]
shuffle(a)
s = "\n".join(a)
f = open('data', 'w')
f.write(s)
f.close()
So all numbers are unique.

The Original Haskell Code
There are two issues with the Haskell version:
You're using string IO, which builds linked lists of characters
You're using a non-quicksort that looks like quicksort.
This program takes 18.7 seconds to run on my Intel Core2 2.5 GHz laptop. (GHC 7.4 using -O2)
Daniel's ByteString Version
This is much improved, but notice it still uses the inefficient built-in merge sort.
His version takes 8.1 seconds (and doesn't handle negative numbers, but that's more of a non-issue for this exploration).
Note
From here on this answer uses the following packages: Vector, attoparsec, text and vector-algorithms. Also notice that kindall's version using timsort takes 2.8 seconds on my machine (edit: and 2 seconds using pypy).
A Text Version
I ripped off Daniel's version, translated it to Text (so it handles various encodings) and added better sorting using a mutable Vector in an ST monad:
import Data.Attoparsec.Text.Lazy
import qualified Data.Text.Lazy as T
import qualified Data.Text.Lazy.IO as TIO
import qualified Data.Vector.Unboxed as V
import qualified Data.Vector.Algorithms.Intro as I
import Control.Applicative
import Control.Monad.ST
import System.Environment (getArgs)
parser = many (decimal <* char '\n')
main = do
numbers <- TIO.readFile =<< fmap head getArgs
case parse parser numbers of
Done t r | T.null t -> writeFile "sorted" . unlines
. map show . vsort $ r
x -> error $ Prelude.take 40 (show x)
vsort :: [Int] -> [Int]
vsort l = runST $ do
let v = V.fromList l
m <- V.unsafeThaw v
I.sort m
v' <- V.unsafeFreeze m
return (V.toList v')
This runs in 4 seconds (and also doesn't handle negatives)
Return to the Bytestring
So now we know we can make a more general program that's faster, what about making the ASCii -only version fast? No problem!
import qualified Data.ByteString.Lazy.Char8 as BS
import Data.Attoparsec.ByteString.Lazy (parse, Result(..))
import Data.Attoparsec.ByteString.Char8 (decimal, char)
import Control.Applicative ((<*), many)
import qualified Data.Vector.Unboxed as V
import qualified Data.Vector.Algorithms.Intro as I
import Control.Monad.ST
parser = many (decimal <* char '\n')
main = do
numbers <- BS.readFile "rands"
case parse parser numbers of
Done t r | BS.null t -> writeFile "sorted" . unlines
. map show . vsort $ r
vsort :: [Int] -> [Int]
vsort l = runST $ do
let v = V.fromList l
m <- V.unsafeThaw v
I.sort m
v' <- V.unsafeFreeze m
return (V.toList v')
This runs in 2.3 seconds.
Producing a Test File
Just in case anyone's curious, my test file was produced by:
import Control.Monad.CryptoRandom
import Crypto.Random
main = do
g <- newGenIO :: IO SystemRandom
let rs = Prelude.take (2^20) (map abs (crandoms g) :: [Int])
writeFile "rands" (unlines $ map show rs)
If you're wondering why vsort isn't packaged in some easier form on Hackage... so am I.

In short, don't use read. Replace read with a function like this:
import Numeric
fastRead :: String -> Int
fastRead s = case readDec s of [(n, "")] -> n
I get a pretty fair speedup:
~/programming% time ./test.slow
./test.slow 9.82s user 0.06s system 99% cpu 9.901 total
~/programming% time ./test.fast
./test.fast 6.99s user 0.05s system 99% cpu 7.064 total
~/programming% time ./test.bytestring
./test.bytestring 4.94s user 0.06s system 99% cpu 5.026 total
Just for fun, the above results include a version that uses ByteString (and hence fails the "ready for the 21st century" test by totally ignoring the problem of file encodings) for ULTIMATE BARE-METAL SPEED. It also has a few other differences; for example, it ships out to the standard library's sort function. The full code is below.
import qualified Data.ByteString as BS
import Data.Attoparsec.ByteString.Char8
import Control.Applicative
import Data.List
parser = many (decimal <* char '\n')
reallyParse p bs = case parse p bs of
Partial f -> f BS.empty
v -> v
main = do
numbers <- BS.readFile "data"
case reallyParse parser numbers of
Done t r | BS.null t -> writeFile "sorted" . unlines . map show . sort $ r

More a Pythonista than a Haskellite, but I'll take a stab:
There's a fair bit of overhead in your measured runtime just reading and writing the files, which is probably pretty similar between the two programs. Also, be careful that you've warmed up the cache for both programs.
Most of your time is spent making copies of lists and fragments of lists. Python list operations are heavily optimized, being one of the most-frequently used parts of the language, and list comprehensions are usually pretty performant too, spending much of their time in C-land inside the Python interpreter. There is not a lot of the stuff that is slowish in Python but wicked fast in static languages, such as attribute lookups on object instances.
Your Python implementation throws away numbers that are equal to the pivot, so by the end it may be sorting fewer items, giving it an obvious advantage. (If there are no duplicates in the data set you're sorting, this isn't an issue.) Fixing this bug probably requires making another copy of most of the list in each call to qs(), which would slow Python down a little more.
You don't mention what version of Python you're using. If you're using 2.x, you could probably get Haskell to beat Python just by switching to Python 3.x. :-)
I'm not too surprised the two languages are basically neck-and-neck here (a 10% difference is not noteworthy). Using C as a performance benchmark, Haskell loses some performance for its lazy functional nature, while Python loses some performance due to being an interpreted language. A decent match.
Since Daniel Wagner posted an optimized Haskell version using the built-in sort, here's a similarly optimized Python version using list.sort():
mylist = [int(x.strip()) for x in open("data")]
mylist.sort()
open("sorted", "w").write("\n".join(str(x) for x in mylist))
3.5 seconds on my machine, vs. about 9 for the original code. Pretty much still neck-and-neck with the optimized Haskell. Reason: it's spending most of its time in C-programmed libraries. Also, TimSort (the sort used in Python) is a beast.

This is after the fact, but I think most of the trouble is in the Haskell writing. The following module is pretty primitive -- one should use builders probably and certainly avoid the ridiculous roundtrip via String for showing -- but it is simple and did distinctly better than pypy with kindall's improved python and better than the 2 and 4 sec Haskell modules elsewhere on this page (it surprised me how much they were using lists, so I made a couple more turns of the crank.)
$ time aa.hs real 0m0.709s
$ time pypy aa.py real 0m1.818s
$ time python aa.py real 0m3.103s
I'm using the sort recommended for unboxed vectors from vector-algorithms. The use of Data.Vector.Unboxed in some form is clearly now the standard, naive way of doing this sort of thing -- it's the new Data.List (for Int, Double, etc.) Everything but the sort is irritating IO management, which could I think still be massively improved, on the write end in particular. The reading and sorting together take about 0.2 sec as you can see from asking it to print what's at a bunch of indexes instead of writing to file, so twice as much time is spent writing as in anything else. If the pypy is spending most of its time using timsort or whatever, then it looks like the sorting itself is surely massively better in Haskell, and just as simple -- if you can just get your hands on the darned vector...
I'm not sure why there aren't convenient functions around for reading and writing vectors of unboxed things from natural formats -- if there were, this would be three lines long and would avoid String and be much faster, but maybe I just haven't seen them.
import qualified Data.ByteString.Lazy.Char8 as BL
import qualified Data.ByteString.Char8 as B
import qualified Data.Vector.Unboxed.Mutable as M
import qualified Data.Vector.Unboxed as V
import Data.Vector.Algorithms.Radix
import System.IO
main = do unsorted <- fmap toInts (BL.readFile "data")
vec <- V.thaw unsorted
sorted <- sort vec >> V.freeze vec
withFile "sorted" WriteMode $ \handle ->
V.mapM_ (writeLine handle) sorted
writeLine :: Handle -> Int -> IO ()
writeLine h int = B.hPut h $ B.pack (show int ++ "\n")
toInts :: BL.ByteString -> V.Vector Int
toInts bs = V.unfoldr oneInt (BL.cons ' ' bs)
oneInt :: BL.ByteString -> Maybe (Int, BL.ByteString)
oneInt bs = if BL.null bs then Nothing else
let bstail = BL.tail bs
in if BL.null bstail then Nothing else BL.readInt bstail

To follow up #kindall interesting answer, those timings are dependent from both the python / Haskell implementation you use, the hardware configuration on which you run the tests, and the algorithm implementation you right in both languages.
Nevertheless we can try to get some good hints of the relative performances of one language implementation compared to another, or from one language to another language. With well known alogrithms like qsort, it's a good beginning.
To illustrate a python/python comparison, I just tested your script on CPython 2.7.3 and PyPy 1.8 on the same machine:
CPython: ~8s
PyPy: ~2.5s
This shows there can be room for improvements in the language implementation, maybe compiled Haskell is not performing at best the interpretation and compilation of your corresponding code. If you are searching for speed in Python, consider also to switch to pypy if needed and if your covering code permits you to do so.

i noticed some problem everybody else didn't notice for some reason; both your haskell and python code have this. (please tell me if it's fixed in the auto-optimizations, I know nothing about optimizations). for this I will demonstrate in haskell.
in your code you define the lesser and greater lists like this:
where lesser = filter (<p) xs
greater = filter (>=p) xs
this is bad, because you compare with p each element in xs twice, once for getting in the lesser list, and again for getting in the greater list. this (theoretically; I havn't checked timing) makes your sort use twice as much comparisons; this is a disaster. instead, you should make a function which splits a list into two lists using a predicate, in such a way that
split f xs
is equivalent to
(filter f xs, filter (not.f) xs)
using this kind of function you will only need to compare each element in the list once to know in which side of the tuple to put it.
okay, lets do it:
where
split :: (a -> Bool) -> [a] -> ([a], [a])
split _ [] = ([],[])
split f (x:xs)
|f x = let (a,b) = split f xs in (x:a,b)
|otherwise = let (a,b) = split f xs in (a,x:b)
now lets replace the lesser/greater generator with
let (lesser, greater) = split (p>) xs in (insert function here)
full code:
quicksort :: Ord a => [a] -> [a]
quicksort [] = []
quicksort (p:xs) =
let (lesser, greater) = splitf (p>) xs
in (quicksort lesser) ++ [p] ++ (quicksort greater)
where
splitf :: (a -> Bool) -> [a] -> ([a], [a])
splitf _ [] = ([],[])
splitf f (x:xs)
|f x = let (a,b) = splitf f xs in (x:a,b)
|otherwise = let (a,b) = splitf f xs in (a,x:b)
for some reason I can't right the getter/lesser part in the where clauses so I had to right it in let clauses.
also, if it is not tail-recursive let me know and fix it for me (I don't know yet how tail-recorsive works fully)
now you should do the same for the python code. I don't know python so I can't do it for you.
EDIT:
there actually happens to already be such function in Data.List called partition. note this proves the need for this kind of function because otherwise it wouldn't be defined.
this shrinks the code to:
quicksort :: Ord a => [a] -> [a]
quicksort [] = []
quicksort (p:xs) =
let (lesser, greater) = partition (p>) xs
in (quicksort lesser) ++ [p] ++ (quicksort greater)

Python is really optimized for this sort of thing. I suspect that Haskell isn't. Here's a similar question that provides some very good answers.

Haskell Hashtable Performance

I am trying to use hash tables in Haskell with the hashtables package, and finding that I cannot get anywhere near Python's performance. How can I achieve similar performance? Is it possible given current Haskell libraries and compilers? If not, what's the underlying issue?
Here is my Python code:
y = {}
for x in xrange(10000000):
y[x] = x
print y[100]
Here's my corresponding Haskell code:
import qualified Data.HashTable.IO as H
import Control.Monad
main = do
y <- H.new :: IO (H.CuckooHashTable Int Int)
forM_ [1..10000000] $ \x -> H.insert y x x
H.lookup y 100 >>= print
Here is another version using Data.Map, which is slower than both for me:
import qualified Data.Map as Map
import Data.List
import Control.Monad
main = do
let m = foldl' (\m x -> Map.insert x x m) Map.empty [1..10000000]
print $ Map.lookup 100 m
Interestingly enough, Data.HashMap performs very badly:
import qualified Data.HashMap.Strict as Map
import Data.List
main = do
let m = foldl' (\m x -> Map.insert x x m) Map.empty [1..10000000]
print $ Map.lookup 100 m
My suspicion is that Data.HashMap performs badly because unlike Data.Map, it is not spine-strict (I think), so foldl' is just a foldl, with the associated thunk buildup problems.
Note that I have used -prof and verified that the majority of the time is spend in the hashtables or Data.Map code, not on the forM or anything like that. All code is compiled with -O2 and no other parameters.

As reddit.com/u/cheecheeo suggested here, using Data.Judy, you'll get similar performance for your particular microbenchmark:
module Main where
import qualified Data.Judy as J
import Control.Monad (forM_)
main = do
h <- J.new :: IO (J.JudyL Int)
forM_ [0..10000000] $ \i -> J.insert (fromIntegral i) i h
v <- J.lookup 100 h
putStrLn $ show v
Timeing the above:
$ time ./Main
Just 100
real 0m0.958s
user 0m0.924s
sys 0m0.032s
Timing the python code of OP:
$ time ./main.py
100
real 0m1.067s
user 0m0.886s
sys 0m0.180s

The documentation for hashtables notes that "Cuckoo hashing, like the basic hash table implementation using linear probing, can suffer from long delays when the table is resized." You use new, which creates a new table of the default size. From looking at the source, it appears that the default size is 2. Inserting 10000000 items likely entails numerous resizings.
Try using newSized.

Given the times above, I thought I would throw in the Data.Map solution, which seems to be comparable to using newSized.
import qualified Data.Map as M
main = do
print $ M.lookup 100 $ M.fromList $ map (\x -> (x,x)) [1..10000000]

How do I avoid memory problems when writing to file using the Writer monad?

I am building some moderately large DIMACS files, however with the method used below the memory usage is rather large compared to the size of the files generated, and on some of the larger files I need to generate I run in to out of memory problems.
import Control.Monad.State.Strict
import Control.Monad.Writer.Strict
import qualified Data.ByteString.Lazy.Char8 as B
import Control.Monad
import qualified Text.Show.ByteString as BS
import Data.List
main = printDIMACS "test.cnf" test
test = do
xs <- freshs 100000
forM_ (zip xs (tail xs))
(\(x,y) -> addAll [[negate x, negate y],[x,y]])
type Var = Int
type Clause = [Var]
data DIMACSS = DS{
nextFresh :: Int,
numClauses :: Int
} deriving (Show)
type DIMACSM a = StateT DIMACSS (Writer B.ByteString) a
freshs :: Int -> DIMACSM [Var]
freshs i = do
next <- gets nextFresh
let toRet = [next..next+i-1]
modify (\s -> s{nextFresh = next+i})
return toRet
fresh :: DIMACSM Int
fresh = do
i <- gets nextFresh
modify (\s -> s{nextFresh = i+1})
return i
addAll :: [Clause] -> DIMACSM ()
addAll c = do
tell
(B.concat .
intersperse (B.pack " 0\n") .
map (B.unwords . map BS.show) $ c)
tell (B.pack " 0\n")
modify (\s -> s{numClauses = numClauses s + length c})
add h = addAll [h]
printDIMACS :: FilePath -> DIMACSM a -> IO ()
printDIMACS file f = do
writeFile file ""
appendFile file (concat ["p cnf ", show i, " ", show j, "\n"])
B.appendFile file b
where
(s,b) = runWriter (execStateT f (DS 1 0))
i = nextFresh s - 1
j = numClauses s
I would like to keep the monadic building of clauses since it is very handy, but I need to overcome the memory problem. How do I optimize the above program so that it doesn't use too much memory?

If you want good memory behavior, you need to make sure that you write out the clauses as you generate them, instead of collecting them in memory and dumping them as such, either using lazyness or a more explicit approach such as conduits, enumerators, pipes or the like.
The main obstacle to that approach is that the DIMACS format expects the number of clauses and variables in the header. This prevents the naive implementation from being sufficiently lazy. There are two possibilities:
The pragmatic one is to write the clauses first to a temporary location. After that the numbers are known, so you write them to the real file and append the contents of the temporary file.
The prettier approach is possible if the generation of clauses has no side effects (besides the effects offered by your DIMACSM monad) and is sufficiently fast: Run it twice, first throwing away the clauses and just calculating the numbers, print the header line, run the generator again; now printing the clauses.
(This is from my experience with implementing SAT-Britney, where I took the second approach, because it fitted better with other requirements in that context.)
Also, in your code, addAll is not lazy enough: The list c needs to be retained even after writing (in the MonadWriter sense) the clauses. This is another space leak. I suggest you implement add as the primitive operation and then addAll = mapM_ add.

As explained in Joachim Breitner's answer the problem was that DIMACSM was not lazy enough, both because the strict versions of the monads was used and because the number of variables and clauses are needed before the ByteString can be written to the file. The solution is to use the lazy versions of the Monads and execute them twice. It turns out that it is also necessary to have WriterT be the outer monad:
import Control.Monad.State
import Control.Monad.Writer
...
type DIMACSM a = WriterT B.ByteString (State DIMACSS) a
...
printDIMACS :: FilePath -> DIMACSM a -> IO ()
printDIMACS file f = do
writeFile file ""
appendFile file (concat ["p cnf ", show i, " ", show j, "\n"])
B.appendFile file b
where
s = execState (execWriterT f) (DS 1 0)
b = evalState (execWriterT f) (DS 1 0)
i = nextFresh s - 1
j = numClauses s

Throttling parallel computations

The Question
I have a finite list of values:
values :: [A]
... and an expensive, but pure, function on those values:
expensiveFunction :: A -> Maybe B
How do I run that function on each value in parallel and only return the first n results that complete with a Just and stop computation of the unfinished results?
takeJustsPar :: (NFData b) => Int -> (a -> Maybe b) -> [a] -> [b]
takeJustsPar maxJusts f as = ???
The Motivation
I know how I would do this using Control.Concurrent, but I wanted to experiment using Haskell's parallelism features. Also, the (scant) literature I could find seems to indicate that Haskell's parallelism features make it cheaper to spawn parallel computations and adapt the workload among the number of capabilities.

I attempted two solutions. The first uses the Par monad (i.e. Control.Monad.Par):
import Control.Monad.Par (Par, NFData)
import Control.Monad.Par.Combinator (parMap)
import Data.Maybe (catMaybes)
import Data.List.Split (chunksOf)
takeJustsPar :: (NFData b) => Int -> Int -> (a -> Maybe b) -> [a] -> Par [b]
takeJustsPar n chunkSize f as = go n (chunksOf chunkSize as) where
go _ [] = return []
go 0 _ = return []
go numNeeded (chunk:chunks) = do
evaluatedChunk <- parMap f chunk
let results = catMaybes evaluatedChunk
numFound = length results
numRemaining = numNeeded - numFound
fmap (results ++) $ go numRemaining chunks
The second attempt used Control.Parallel.Strategies:
import Control.Parallel.Strategies
import Data.List.Split (chunksOf)
chunkPar :: (NFData a) => Int -> Int -> [a] -> [a]
chunkPar innerSize outerSize as
= concat ((chunksOf innerSize as) `using` (parBuffer outerSize rdeepseq))
The latter one ended up being MUCH more composable, since I could just write:
take n $ catMaybes $ chunkPar 1000 10 $ map expensiveFunction xs
... rather than baking in the take and catMaybes behavior into the parallelism strategy.
The latter solution also gives nearly perfect utilization. On the embarrassingly parallel problem I tested it on, it gave 99% utilization for 8 cores. I didn't test the utilization of the Par monad because I was borrowing a colleague's computer and didn't want to waste their time when I was satisfied with the performance of Control.Parallel.Strategies.
So the answer is to use Control.Parallel.Strategies, which gives much more composable behavior and great multi-core utilization.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string