Unable to force strictness in tail recursion

Unable to force strictness in tail recursion - haskell

I am dealing with memory leaks in my Haskell program and I was able to isolate it to very basic laziness problem in dealing with arrays. I understand what's happening there. First element of the array is computed while the rest produce the delayed computations which consumes the heap. Unfortunately, I was unable to force strictness for the entire array computation.
I tried various combinations of seq, BangPatterns, ($!) without much success.
import Control.Monad
force x = x `seq` x
loop :: [Int] -> IO ()
loop x = do
when (head x `mod` 10000000 == 0) $ print x
let x' = force $ map (+1) x
loop x'
main = loop $ replicate 200 1
The profile with standard profiling options didn't give me any more information than I already know:
ghc -prof -fprof-auto-calls -rtsopts test.hs
./test +RTS -M300M -p -hc
This runs out of memory in the matter of a few seconds.

force x = x `seq` x
That's useless. seq doesn't mean "evaluate this thing now"; it means "evaluate the left thing before returning the result of evaluating the right thing". When they're the same, it does nothing, and your force is equivalent to just id. Try this instead:
import Control.DeepSeq
import Control.Monad
loop :: [Int] -> IO ()
loop x = do
when (head x `mod` 10000000 == 0) $ print x
let x' = map (+1) x
loop $!! x'
main = loop $ replicate 200 1
That evaluates x' and everything in it before loop x', which is useful.
Alternatively, Control.DeepSeq has a force function that is useful. Its semantics in this case are "evaluate all of the elements of your list before returning the result of evaluating any of it". If you used its force function in place of your own, your original code would otherwise work, since the first line of loop does evaluate the beginning of the list.

Related

How to make foldl consume constant memory?

We define the following data type Stupid:
import qualified Data.Vector as V
import Data.List (foldl')
data Stupid = Stupid {content::V.Vector Int, ul::Int} deriving Show
Now I have two slightly different code.
foldl' (\acc x->Stupid{content=(content acc) V.// [(x,x+123)],ul=1}) (Stupid {content=V.replicate 10000 10,ul=1}) $ take 100000 $ cycle [0..9999]
takes constant memory (~100M), while
foldl' (\acc x->Stupid{content=(content acc) V.// [(x,x+123)],ul=ul acc}) (Stupid {content=V.replicate 10000 10,ul=1}) $ take 100000 $ cycle [0..9999]
takes a huge amount of memory(~8G).
Theoretically, only one copy of the current Stupid object is needed though out the process for both cases. I don't understand why there is such a difference in memory consumption if I want to access and record the ul acc.
Can someone explain why this happens and give a workaround for constant memory if I need to access ul acc? Thanks.
Note: I know that I can do replacements of a vector in batch, this script is just for demonstration purpose, so please don't modify that part.

I would try to force the fields of Stupid and see if that helps.
let f acc x = c `seq` a `seq` Stupid{content=c,ul=a}
where
c = content acc V.// [(x,x+123)]
a = ul acc
in foldl' f (Stupid {content=V.replicate 10000 10,ul=1}) $
take 100000 $
cycle [0..9999]
This should be nearly equivalent to forcing the parameters of the function:
foldl' (\acc x -> acc `seq` x `seq`
Stupid{content=(content acc) V.// [(x,x+123)],ul=ul acc})
(Stupid {content=V.replicate 10000 10,ul=1}) $ take 100000 $ cycle [0..9999]
(This can also be written with bang patterns, if one prefers those.)
Another option, more aggressive, would be to use strictness annotations in the definition of the Stupid constructor.
data ... = Stupid { content = ! someType , ul :: ! someOtherType }
This will always force those fields in the whole program.

Haskell parallel search with early abort

I'd like to search through a list, testing each element for property X and then return when an element with property X is found.
This list is very large and would benefit from parallelism, but the cost of the spark is rather high relative to the compute time. parListChunk would be great, but then it must search through the entire list.
Is there some way I can write something like parListChunk but with early abort?
This is the naive search code:
hasPropertyX :: Object -> Bool
anyObjectHasPropertyX :: [Object] -> Bool
anyObjectHasPropertyX [] = False
anyObjectHasPropertyX l
| hasPropertyX (head l) == True = True
| otherwise = anyObjectHasPropertyX (tail l)
and this is my first attempt at parallelism:
anyObjectHasPropertyXPar [] = False
anyObjectHasPropertyXPar [a] = hasPropertyX a
anyObjectHasPropertyXPar (a:b:rest) = runEval $ do c1 <- rpar (force (hasPropertyX a))
c2 <- rpar (force (hasPropertyX b))
rseq c1
rseq c2
if (c1 == True) || (c2 == True) then return True else return (anyObjectHasPropertyXPar rest)
This does run slightly faster than the naive code (even with -N1, oddly enough), but not by much (it helps a little by extending the number of parallel computations). I believe it's not benefitting much because it has to spark one thread for each element in the list.
Is there an approach similar to parListChunk that will only spark n threads and that allows for an early abort?
Edit: I'm having problems thinking about this because it seems that I would need to monitor the return value of all the threads. If I omit the rseq's and have something like
if (c1 == True) || (c2 == True) then ...
Is the runtime environment intelligent enough to monitor both threads and continue when either one of them returns?

I don't think you're going to have much luck using Control.Parallel.Strategies. A key feature of this module is that it expresses "deterministic parallelism" such that the result of the program is unaffected by the parallel evaluation. The problem you've described is fundamentally non-deterministic because threads are racing to find the first match.
Update: I see now that you're only returning True if the element is found, so the desired behavior is technically deterministic. So, perhaps there is a way to trick the Strategies module into working. Still, the implementation below seems to meet the requirements.
Here's an implementation of a parallel find parFind that runs in the IO monad using Control.Concurrent primitives and seems to do what you want. Two MVars are used: runningV keeps count of how many threads are still running to allow the last thread standing to detect search failure; and resultV is used to return Just the result or Nothing when search failure is detected by that last thread. Note that it is unlikely to perform better than a single-threaded implementation unless the test (your hasPropertyX above) is substantially more work than the list traversal, unlike this toy example.
import Control.Monad
import Control.Concurrent
import Data.List
import System.Environment
-- Thin a list to every `n`th element starting with index `i`
thin :: Int -> Int -> [a] -> [a]
thin i n = unfoldr step . drop i
where step [] = Nothing
step (y:ys) = Just (y, drop (n-1) ys)
-- Use `n` parallel threads to find first element of `xs` satisfying `f`
parFind :: Int -> (a -> Bool) -> [a] -> IO (Maybe a)
parFind n f xs = do
resultV <- newEmptyMVar
runningV <- newMVar n
comparisonsV <- newMVar 0
threads <- forM [0..n-1] $ \i -> forkIO $ do
case find f (thin i n xs) of
Just x -> void (tryPutMVar resultV (Just x))
Nothing -> do m <- takeMVar runningV
if m == 1
then void (tryPutMVar resultV Nothing)
else putMVar runningV (m-1)
result <- readMVar resultV
mapM_ killThread threads
return result
myList :: [Int]
myList = [1..1000000000]
-- Use `n` threads to find first element equal to `y` in `myList`
run :: Int -> Int -> IO ()
run n y = do x <- parFind n (== y) myList
print x
-- e.g., stack ghc -- -O2 -threaded SearchList.hs
-- time ./SearchList +RTS -N4 -RTS 4 12345 # find 12345 using 4 threads -> 0.018s
-- time ./SearchList +RTS -N4 -RTS 4 -1 # full search w/o match -> 6.7s
main :: IO ()
main = do [n,y] <- getArgs
run (read n) (read y)
Also, note that this version runs the threads on interleaved sublists rather than dividing the main list up into consecutive chunks. I did it this way because (1) it was easier to demonstrate that "early" elements were found quickly; and (2) my huge list means that memory usage can explode if the whole list needs to be kept in memory.
In fact, this example is a bit of a performance time bomb -- its memory usage is nondeterministic and can probably explode if one thread falls way behind so that a substantial portion of the whole list needs to be kept in memory.
In a real world example where the whole list is probably being kept in memory and the property test is expensive, you may find that breaking the list into chunks is faster.

Benchmarking Filter and Partition

I was testing the performance of the partition function for lists and got some strange results, I think.
We have that partition p xs == (filter p xs, filter (not . p) xs) but we chose the first implementation because it only performs a single traversal over the list. Yet, the results I got say that it maybe be better to use the implementation that uses two traversals.
Here is the minimal code that shows what I'm seeing
import Criterion.Main
import System.Random
import Data.List (partition)
mypartition :: (a -> Bool) -> [a] -> ([a],[a])
mypartition p l = (filter p l, filter (not . p) l)
randList :: RandomGen g => g -> Integer -> [Integer]
randList gen 0 = []
randList gen n = x:xs
where
(x, gen') = random gen
xs = randList gen' (n - 1)
main = do
gen <- getStdGen
let arg10000000 = randList gen 10000000
defaultMain [
bgroup "filters -- split list in half " [
bench "partition100" $ nf (partition (>= 50)) arg10000000
, bench "mypartition100" $ nf (mypartition (>= 50)) arg10000000
]
]
I ran the tests both with -O and without it and both times I get that the double traversals is better.
I am using ghc-7.10.3 with criterion-1.1.1.0
My questions are:
Is this expected?
Am I using Criterion correctly? I know that laziness can be tricky and (filter p xs, filter (not . p) xs) will only do two traversals if both elements of the tuple are used.
Does this has to do something with the way lists are handled in Haskell?
Thanks a lot!

There is no black or white answer to the question. To dissect the problem consider the following code:
import Control.DeepSeq
import Data.List (partition)
import System.Environment (getArgs)
mypartition :: (a -> Bool) -> [a] -> ([a],[a])
mypartition p l = (filter p l, filter (not . p) l)
main :: IO ()
main = do
let cnt = 10000000
xs = take cnt $ concat $ repeat [1 .. 100 :: Int]
args <- getArgs
putStrLn $ unwords $ "Args:" : args
case args of
[percent, fun]
-> let p = (read percent >=)
in case fun of
"partition" -> print $ rnf $ partition p xs
"mypartition" -> print $ rnf $ mypartition p xs
"partition-ds" -> deepseq xs $ print $ rnf $ partition p xs
"mypartition-ds" -> deepseq xs $ print $ rnf $ mypartition p xs
_ -> err
_ -> err
where
err = putStrLn "Sorry, I do not understand."
I do not use Criterion to have a better control about the order of evaluation. To get timings, I use the +RTS -s runtime option. The different test case are executed using different command line options. The first command line option defines for which percentage of the data the predicate holds. The second command line option chooses between different tests.
The tests distinguish two cases:
The data is generated lazily (2nd argument partition or mypartition).
The data is already fully evaluated in memory (2nd argument partition-ds or mypartition-ds).
The result of the partitioning is always evaluated from left to right, i.e. starting with the list that contains all the elements for which the predicate holds.
In case 1 partition has the advantage that elements of the first resulting list get discarded before all elements of the input list were even produced. Case 1 is especially good, if the predicate matches many elements, i.e. the first command line argument is large.
In case 2, partition cannot play out this advantage, since all elements are already in memory.
For mypartition, in any case all elements are held in memory after the first resulting list is evaluated, because they are needed again to compute the second resulting list. Therefore there is not much of a difference between the two cases.
It seems, the more memory is used, the harder garbage collection gets. Therefore partition is well suited, if the predicate matches many elements and the lazy variant is used.
Conversely, if the predicate does not match many elements or all elements are already in memory, mypartition performs better, since its recursion does not deal with pairs in contrast to partition.
The Stackoverflow question “Irrefutable pattern does not leak memory in recursion, but why?” might give some more insights about the handling of pairs in the recursion of partition.

Project euler 10 - [haskell] Why so inefficient?

Alright, so i've picked up project euler where i left off when using java, and i'm at problem 10. I use Haskell now and i figured it'd be good to learn some haskell since i'm still very much a beginner.
http://projecteuler.net/problem=10
My friend who still codes in java came up with a very straight forward way to implement the sieve of eratosthenes:
http://puu.sh/5zQoU.png
I tried implementing a better looking (and what i thought was gonna be a slightly more efficient) Haskell function to find all primes up to 2,000,000.
I came to this very elegant, yet apparently enormously inefficient function:
primeSieveV2 :: [Integer] -> [Integer]
primeSieveV2 [] = []
primeSieveV2 (x:xs) = x:primeSieveV2( (filter (\n -> ( mod n x ) /= 0) xs) )
Now i'm not sure why my function is so much slower than his (he claim his works in 5ms), if anything mine should be faster, since i only check composites once (they are removed from the list when they are found) whereas his checks them as many times as they can be formed.
Any help?

You don't actually have a sieve here. In Haskell you could write a sieve as
import Data.Vector.Unboxed hiding (forM_)
import Data.Vector.Unboxed.Mutable
import Control.Monad.ST (runST)
import Control.Monad (forM_, when)
import Prelude hiding (read)
sieve :: Int -> Vector Bool
sieve n = runST $ do
vec <- new (n + 1) -- Create the mutable vector
set vec True -- Set all the elements to True
forM_ [2..n] $ \ i -> do -- Loop for i from 2 to n
val <- read vec i -- read the value at i
when val $ -- if the value is true, set all it's multiples to false
forM_ [2*i, 3*i .. n] $ \j -> write vec j False
freeze vec -- return the immutable vector
main = print . ifoldl' summer 0 $ sieve 2000000
where summer s i b = if b then i + s else s
This "cheats" by using a mutable unboxed vector, but it's pretty darn fast
$ ghc -O2 primes.hs
$ time ./primes
142913828923
real: 0.238 s
This is about 5x faster than my benchmarking of augustss's solution.

To actually implement the sieve efficiently in Haskell you probably need to do it the Java way (i.e., allocate a mutable array an modify it).
For just generating primes I like this:
primes = 2 : filter (isPrime primes) [3,5 ..]
where isPrime (p:ps) x = p*p > x || x `rem` p /= 0 && isPrime ps x
And then you can print the sum of all primes primes < 2,000,000
main = print $ sum $ takeWhile (< 2000000) primes
You can speed it up by adding a type signature primes :: [Int].
But it works well with Integer as well and that also gives you the correct sum (which 32 bit Int will not).
See The Genuine Sieve of Eratosthenes for more information.

The time complexity of your code is n2 (in n primes produced). It is impractical to run for producing more than first 10...20 thousand primes.
The main problem with that code is not that it uses rem but that it starts its filters prematurely, so creates too many of them. Here's how you fix it, with a small tweak:
{-# LANGUAGE PatternGuards #-}
primes = 2 : sieve primes [3..]
sieve (p:ps) xs | (h,t) <- span (< p*p) xs = h ++ sieve ps [x | x <- t, rem x p /= 0]
-- sieve ps (filter (\x->rem x p/=0) t)
main = print $ sum $ takeWhile (< 100000) primes
This improves the time complexity by about n1/2 (in n primes produced) and gives it a drastic speedup: it gets to 100,000 75x faster. Your 28 seconds should become ~ 0.4 sec. But, you probably tested it in GHCi as interpreted code, not compiled. Marking it1) as :: [Int] and compiling with -O2 flag gives it another ~ 40x speedup, so it'll be ~ 0.01 sec. To reach 2,000,000 with this code takes ~ 90x longer, for a whopping ~ 1 sec of projected run time.
1) be sure to use sum $ map (fromIntegral :: Int -> Integer) $ takeWhile ... in main.
see also: http://en.wikipedia.org/wiki/Analysis_of_algorithms#Empirical_orders_of_growth

How to evaluate tuples in parallel using rpar Strategy in Haskell?

I stumbled upon a problem with Eval monad and rpar Strategy in Haskell. Consider following code:
module Main where
import Control.Parallel.Strategies
main :: IO ()
main = print . sum . inParallel2 $ [1..10000]
inParallel :: [Double] -> [Double]
inParallel xss = runEval . go $ xss
where
go [] = return []
go (x:xs) = do
x' <- rpar $ x + 1
xs' <- go xs
return (x':xs')
inParallel2 :: [Double] -> [Double]
inParallel2 xss = runEval . go $ xss
where
go [] = return []
go [x] = return $ [x + 1]
go (x:y:xs) = do
(x',y') <- rpar $ (x + 1, y + 1)
xs' <- go xs
return (x':y':xs'
I compile and run it like this:
ghc -O2 -Wall -threaded -rtsopts -fforce-recomp -eventlog eval.hs
./eval +RTS -N3 -ls -s
When I use inParallel function parallelism works as expected. In the output runtime statistics I see:
SPARKS: 100000 (100000 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
When I switch to inParallel2 function all parallelism is gone:
SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
Why doesn't evaluation of tuples work in parallel? I tried forcing the tuple before passing it to rpar:
rpar $!! (x + 1, y + 1)
but still no result. What am I doing wrong?

The rpar strategy annotates a term for possible evaluation in parallel, but only up to weak head normal form, which essentially means, up to the outermost constructor. So for an integer or double, that means full evaluation, but for a pair, only the pair constructor, not its components, will get evaluated.
Forcing the pair before passing it to rpar is not going to help. Now you're evaluating the pair locally, before annotating the already evaluated tuple for possible parallel evaluation.
You probably want to combine the rpar with the rdeepseq strategy, thereby stating that the term should be completely evaluated, if possible in parallel. You can do this by saying
(rpar `dot` rdeepseq) (x + 1, y + 1)
The dot operator is for composing strategies.
There is, however, yet another problem with your code: pattern matching forces immediate evaluation, and therefore using pattern matching for rpar-annotated expressions is usually a bad idea. In particular, the line
(x',y') <- (rpar `dot` rdeepseq) (x + 1, y + 1)
will defeat all parallelism, because before the spark can be picked up for evaluation by another thread, the local thread will already start evaluating it in order to match the pattern. You can prevent this by using a lazy / irrefutable pattern:
~(x',y') <- (rpar `dot` rdeepseq) (x + 1, y + 1)
Or alternatively use fst and snd to access the components of the pair.
Finally, don't expect actual speedup if you create sparks that are as cheap as adding one to an integer. While sparks themselves are relatively cheap, they are not cost-free, so they work better if the computation you are annotating for parallel evaluation is somewhat costly.
You might want to read some tutorials on using strategies, such as Simon Marlow's
Parallel and Concurrent Programming using Haskell or my own Deterministic Parallel Programming in Haskell.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string