Process large lists in Haskell into a single value

Process large lists in Haskell into a single value - haskell

I have 2 lists of Int of equal size (roughly 10,000 elements): say x and y. I need to compute a product of the following expression for each corresponding pair of elements from the lists: x_i/(x_i+y_i), i.e. x_i and y_i are the first elements of the lists, then second, etc.
My approaches work fine on small test cases, but ghci hangs for the larger lists. Any insight as to the cause and solution would be appreciated.
I tried to do this with fold, zipping the lists first:
getP:: [Int] -> [Int] -> Double
getP zippedCounts = foldr (\(x,y) acc -> let intX = fromIntegral x
intY = fromIntegral y
intSum = intX + intY in
(acc*(intX/intSum)))
1.0
zippedCounts
I also tried recusion:
getP lst [] = 1.0
getP [] lst = 1.0
getP (h1:t1) (h2:t2) = ((fromIntegral h1) / ((fromIntegral h1) + (fromIntegral h2))) * (getP t1 t2)
As well as list comprehension:
getP lst1 lst2 = (product [((fromIntegral x) / ((fromIntegral x) + (fromIntegral y)))|x <- lst1, y <- lst2])

All three solutions have space leaks, maybe that's what causes the unresponsiveness.
In Haskell, when reducing a big list to a single summary value, it is very easy to inadvertently cause space leaks if we never "look into" the intermediate values of the computation. We can end up with a gigantic tree of unevaluated thunks hiding behind a seemingly inoffensive single Double value.
The foldr example leaks because foldr never forces its accumulator to weak head normal form. Use the strict left foldl' instead (you will need to reorder some function arguments). foldl' should ensure that the intermediate Double values remain "small" and thunks don't accumulate.
The explicit recursion example is dangerous because it is not tail-recursive and for large lists can cause a stack overflow (we repeatedly put values in the stack, waiting for the next recursive call to complete). A solution would involve making the function tail-recursive by passing the intermediate result as an extra parameter, and putting a bang pattern on that parameter to ensure thunks don't accumulate.
The product example leaks because, unfortunately, neither the sum nor the product functions are strict. For large lists, it's better to use foldl' instead. (There's also a bug, as it has been mentioned in the comments.)

You could try zipWith followed by product:
getP :: [Int] -> [Int] -> Double
getP xs ys = product $ zipWith func xs ys
where
func x y = let dx = fromIntegral x
dy = fromIntegral y
in dx / (dx + dy)
I would avoid using explicit recursion and stick to library functions for speed. You could also use certain ghc flags to speed up the compiled code.

Related

Will let construction optimize something in list comprehension?

I should write a function that sums elements in a list comprehension block.
Let's take these two functions just for example:
letSum :: [Int] -> [Int]
letSum xs = [result | x <- xs, y <- xs, let result = x + y, result > 10]
normalSum :: [Int] -> [Int]
normalSum xs = [x + y | x <- xs, y <- xs, x + y > 10]
Question:
Is the second function summing x and y twice in opposite to the first one?
If not, how does it work?

The second function will compute the sum twice – there is no explicit sharing to be performed here, nor the Haskell performs memoization (source: When is memoization automatic in GHC Haskell?)
let lets the sum be computed once and used in several places, so the first function will be slightly faster.
EDIT:
Someone in the comments mentioned CSE (common subexpression elimination) as possible optimization that may occur here. I have tried compiling your function with -ddump-cse to discover whether it will happen, but although I didn't find any mentions of normalSum, the output was too mysterious to me. However, my answer should be true if you build your function without -O* flag. I will update my answer if I find more information about it.

Are these premises about folds and recursion right?

When using foldr, the recursion occours inside the function, so,
when the given function doesn't strictly evaluate both sides, and
can return based on the first one, foldr must be a good solution,
because it will work on infinity lists
findInt :: Int -> [Int] -> Bool
findInt z [] = False
-- The recursion occours inside de given function
findInt z (x:xs)
| z == x = True
| otherwise = findInt z xs
equivalent to:
findInt' :: Int -> [Int] -> Bool
findInt' z = foldr (\x r -> if z == x then True else r) False
-- Where False is the "default value" (when it finds [], ex: findInt z [] = False)
A situation when foldr is not appropriate:
addAll :: Int -> [Int] -> Int
addAll z [] = z
-- The recursion occours outside the given function (+)
addAll z (x:xs) = addAll (z + x) xs
In this case, because + is strict (needs to evaluate both sides to return)
it would be greately useful if we applied it in some way which we could
have a redex (reducible expression), to make it possible to avoid thunks
and (when forced to run with previous evaluation, not lazy) in constant
space and without pushing to much onto the stack
(similar to the advantages of a for loop in imperative algorithms)
addAll' :: Int -> [Int] -> Int
addAll' z [] = z
addAll' z (x:xs) = let z' = z + x
in seq z' $ addAll' z' xs
equivalent to:
addAll'' :: Int -> [Int] -> Int
addAll'' z = foldl' (+) z
In this little case, using foldr (inside recursion) doesn't make sense
because it wouldn't make redexes.
It would be like this:
addAll''' :: Int -> [Int] -> Int
addAll''' z [] = z
addAll''' z (x:xs) = (+) x $ addAll''' z xs
The main objective of this question is first, know whether my premises are
right or where they could be better and second, help to make it more clear
for others who are also learning Haskell the differences between inside and
outside recursion, among the approaches, to have it clear in mind which one
could be more appropriated to a given situation
Helpful links:
Haskell Wiki
Stackoverflow - Implications of foldr vs. foldl (or foldl')

Aside from the fact that foldr is the natural catamorphism of a list, while foldl and foldl' are not, a few guidelines for their use:
you are correct on that foldr will always return, even on infinite lists, as long as the function is non-strict in its second argument, since the elements of the list are made available to the first argument of the function immediately (as opposed to foldl and foldl', where the elements of the list are not available to the first argument of the function until the list has been entirely consumed);
foldl' will be a better choice for non-infinite lists if you want to ensure constant space, since it's tail recursive, but it will always parse the entire list regardless of the strictness in the evaluation of the arguments to the function passed to it;
in general, foldr is equivalent to recursion, while foldl and foldl' are analogous to loops;
because of the fact that foldr is the natural catamorphism, if your function needs to recreate the list (for example, if your function is just the list constructor ':'), foldr would be more adequate;
with respect to foldl vs. foldl', foldl' is usually preferable because it will not build a huge thunk but, if the function passed to it is non strict in its first argument and the list is not infinite, foldl may return while foldl' may give an error (there is a good example in the Haskell wiki).
As a side note, I believe that you are using the term "inside recursion" to define foldr and "outside recursion" for foldl and foldl', but I haven't seen these terms before in the literature. More commonly these functions are just referred to as folding from the right and folding from the left respectively, terms that while may not be exactly correct, they give a good notion of the order in which the elements of the list are passed to the function.

Lazy Evaluation - Space Leak

Thinking Functionally with Haskell provides the following code for calculating the mean of a list of Float's.
mean :: [Float] -> Float
mean [] = 0
mean xs = sum xs / fromIntegral (length xs)
Prof. Richard Bird comments:
Now we are ready to see what is really wrong with mean: it has a space leak. Evaluating mean [1..1000] will cause the list to be expanded and retained in memory after summing because there is a second pointer to it, namely in the computation of its length.
If I understand this text correctly, he's saying that, if there was no pointer to xs in the length computation, then the xs memory could've been freed after calculating the sum?
My confusion is - if the xs is already in memory, isn't the length function simply going to use the same memory that's already being taken up?
I don't understand the space leak here.

The sum function does not need to keep the entire list in memory; it can look at an element at a time then forget it as it moves to the next element.
Because Haskell has lazy evaluation by default, if you have a function that creates a list, sum could consume it without the whole list ever being in memory (each time a new element is generated by the producing function, it would be consumed by sum then released).
The exact same thing happens with length.
On the other hand, the mean function feeds the list to both sum and length. So during the evaluation of sum, we need to keep the list in memory so it can be processed by length later.
[Update] to be clear, the list will be garbage collected eventually. The problem is that it stays longer than needed. In such a simple case it is not a problem, but in more complex functions that operate on infinite streams, this would most likely cause a memory leak.

Others have explained what the problem is. The cleanest solution is probably to use Gabriel Gonzalez's foldl package. Specifically, you'll want to use
import qualified Control.Foldl as L
import Control.Foldl (Fold)
import Control.Applicative
meanFold :: Fractional n => Fold n (Maybe n)
meanFold = f <$> L.sum <*> L.genericLength where
f _ 0 = Nothing
f s l = Just (s/l)
mean :: (Fractional n, Foldable f) => f n -> Maybe n
mean = L.fold meanFold

if there was no pointer to xs in the length computation, then the xs memory could've been freed after calculating the sum?
No, you're missing the important aspect of lazy evaluation here. You're right that length will use the same memory as was allocated during the sum call, the memory in which we had expanded the whole list.
But the point here is that allocating memory for the whole list shouldn't be necessary at all. If there was no length computation but only the sum, then memory could've been freed during calculating the sum. Notice that the list [1..1000] is lazily generated only when it is consumed, so in fact the mean [1..1000] should run in constant space.
You might write the function like the following, to get an idea of how to avoid such a space leak:
import Control.Arrow
mean [] = 0
mean xs = uncurry (/) $ foldr (\x -> (x+) *** (1+)) (0, 0) xs
-- or more verbosely
mean xs = let (sum, len) = foldr (\x (s, l) -> (x+s, 1+l)) (0, 0)
in sum / len
which should traverse xs only once. However, Haskell is damn lazy - and computes the first tuple components only when evaluating sum and the second ones only later for len. We need to use some more tricks to actually force the evaluation:
{-# LANGUAGE BangPatterns #-}
import Data.List
mean [] = 0
mean xs = uncurry (/) $ foldl' (\(!s, !l) x -> (x+s, 1+l)) (0,0) xs
which really runs in constant space, as you can confirm in ghci by using :set +s.

The space leak is that the entire evaluated xs is held in memory for the length function. This is wasteful, as we aren't going to be using the actual values of the list after evaluating sum, nor do we need them all in memory at the same time, but Haskell doesn't know that.
A way to remove the space leak would be to recalculate the list each time:
sum [1..1000] / fromIntegral (length [1..1000])
Now the application can immediately start discarding values from the first list as it is evaluating sum, since it is not referenced anywhere else in the expression.
The same applies for length. The thunks it generates can be marked for deletion immediately, since nothing else could possibly want it evaluated further.
EDIT:
Implementation of sum in Prelude:
sum l = sum' l 0
where
sum' [] a = a
sum' (x:xs) a = sum' xs (a+x)

Finding mean of list in Haskell

I think my code to find the mean of a list (of integers) works ok, but has a problem. This is my code
listlen xs = if null xs
then 0
else 1 + (listlen (tail xs))
sumx xs = if null xs
then 0
else (head xs) + sumx (tail xs)
mean xs = if null xs
then 0
else (fromIntegral (sumx xs)) / (fromIntegral (listlen xs))
my mean function has to go through the list twice. Once to get the sum of the elements, and once to get the number of elements. Obviously this is not great.
I would like to know a more efficient way to do this (using elementary Haskell - this is a a question from Real World Haskell chapter 3.)

I like the other answers here. But I don't like that they write their recursion by hand. There are lots of ways to do this, but one handy one is to reuse the Monoid machinery we have in place.
Data.Monoid Data.Foldable> foldMap (\x -> (Sum x, Sum 1)) [15, 17, 19]
(Sum {getSum = 51}, Sum {getSum = 3})
The first part of the pair is the sum, and the second part is the length (computed as the sum of as many 1s as there are elements in the list). This is a quite general pattern: many statistics can actually be cast as monoids; and pairs of monoids are monoids; so you can compute as many statistics about a thing as you like in one pass using foldMap. You can see another example of this pattern in this question, which is where I got the idea.

What #simonzack is alluding to is that you should write listlen and sumx as folds.
Here is listlen written as a fold:
listlen :: [a] -> Int
listlen xs = go 0 xs -- 0 = initial value of accumulator
where go s [] = s -- return accumulator
go s (a:as) = go (s+1) as -- compute the next value of the accumulator
-- and recurse
Here s is an accumulator which is passed from one iteration of the helper function go to the next iteration. It is the value returned when the end of the list has been reached.
Writing sumx as a fold will look like:
sumx :: [a] -> Int
sumx xs = go 0 xs
where go s [] = s
go s (a:as) = go ... as -- flll in the blank ...
The point is that given two folds you can always combine them so they are computed together.
lenAndSum :: [a] -> (Int,Int)
lenAndSum xs = go (0,0) xs -- (0,0) = initial values of both accumulators
where go (s1,s2) [] = (s1,s2) -- return both accumulators at the end
go (s1,s2) (a:as) = go ... as -- left as an exercise
Now you have computed both functions with one traversal of the list.

Define a helper function that only goes through once:
lengthAndSum xs = if null xs
then (0,0)
else let (a,b) = lengthAndSum(tail xs) in (a + 1, b + head xs)
mean xs = let (a, b) = lengthAndSum xs in (fromIntegral b / fromIntegral a)

Now, there's an idea: a function that takes a bunch of monoids and applies each to a list, all at the same time. But that can come later!
I think what you'll need is a fold and a tuple for a good speed:
avg :: (Fractional a) => [a] -> a
avg [] = error "Cannot take average of empty list"
avg nums = let (sum,count) = foldr (\e (s,c) -> (s+e,c+1)) (0,0) nums
in sum / count
I've tried this out, and it's decently speedy in GHCi, though It may not be the best. I thought up a recursive method, too, though it requires a helper function:
avg :: (Fractional a) => [a] -> a
avg [] = error "Cannot take average of empty list"
avg nums = let (sum, count) = go nums
in sum / count
where go [] = (0,0)
go (x:xs) = let (sum',count') = go xs
in (sum' + x, count' + 1)
Then again, that's really slow. Painfully slow.
Looking at your solution, that's alright, but it's not really idiomatic Haskell. if statements within functions tend to work better as pattern matches, especially if the Eq class instance isn't defined for such-and-such a datatype. Furthermore, as my example illustrated, folds are beautiful! They allow Haskell to be lazy and therefore are faster. That's my feedback and my suggestion in response to your question.

Why is this tail-recursive Haskell function slower ?

I was trying to implement a Haskell function that takes as input an array of integers A
and produces another array B = [A[0], A[0]+A[1], A[0]+A[1]+A[2] ,... ]. I know that scanl from Data.List can be used for this with the function (+). I wrote the second implementation
(which performs faster) after seeing the source code of scanl. I want to know why the first implementation is slower compared to the second one, despite being tail-recursive?
-- This function works slow.
ps s x [] = x
ps s x y = ps s' x' y'
where
s' = s + head y
x' = x ++ [s']
y' = tail y
-- This function works fast.
ps' s [] = []
ps' s y = [s'] ++ (ps' s' y')
where
s' = s + head y
y' = tail y
Some details about the above code:
Implementation 1 : It should be called as
ps 0 [] a
where 'a' is your array.
Implementation 2: It should be called as
ps' 0 a
where 'a' is your array.

You are changing the way that ++ associates. In your first function you are computing ((([a0] ++ [a1]) ++ [a2]) ++ ...) whereas in the second function you are computing [a0] ++ ([a1] ++ ([a2] ++ ..)). Appending a few elements to the start of the list is O(1), whereas appending a few elements to the end of a list is O(n) in the length of the list. This leads to a linear versus quadratic algorithm overall.
You can fix the first example by building the list up in reverse order, and then reversing again at the end, or by using something like dlist. However the second will still be better for most purposes. While tail calls do exist and can be important in Haskell, if you are familiar with a strict functional language like Scheme or ML your intuition about how and when to use them is completely wrong.
The second example is better, in large part, because it's incremental; it immediately starts returning data that the consumer might be interested in. If you just fixed the first example using the double-reverse or dlist tricks, your function will traverse the entire list before it returns anything at all.

I would like to mention that your function can be more easily expressed as
drop 1 . scanl (+) 0
Usually, it is a good idea to use predefined combinators like scanl in favour of writing your own recursion schemes; it improves readability and makes it less likely that you needlessly squander performance.
However, in this case, both my scanl version and your original ps and ps' can sometimes lead to stack overflows due to lazy evaluation: Haskell does not necessarily immediately evaluate the additions (depends on strictness analysis).
One case where you can see this is if you do last (ps' 0 [1..100000000]). That leads to a stack overflow. You can solve that problem by forcing Haskell to evaluate the additions immediately, for instance by defining your own, strict scanl:
myscanl :: (b -> a -> b) -> b -> [a] -> [b]
myscanl f q [] = []
myscanl f q (x:xs) = q `seq` let q' = f q x in q' : myscanl f q' xs
ps' = myscanl (+) 0
Then, calling last (ps' [1..100000000]) works.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Process large lists in Haskell into a single value - haskell

Related

Will let construction optimize something in list comprehension?

Are these premises about folds and recursion right?

Lazy Evaluation - Space Leak

Finding mean of list in Haskell

Why is this tail-recursive Haskell function slower ?

Categories

Resources