How to exploit any parallelism in my haskell parallel code? - haskell

I've just stated working in haskell semi-explicit parallelism with GHC 6.12. I've write the following haskell code to compute in parallel the map of the fibonnaci function upon 4 elements on a list, and in the same time the map of the function sumEuler upon two elements.
import Control.Parallel
import Control.Parallel.Strategies
fib :: Int -> Int
fib 0 = 0
fib 1 = 1
fib n = fib (n-1) + fib (n-2)
mkList :: Int -> [Int]
mkList n = [1..n-1]
relprime :: Int -> Int -> Bool
relprime x y = gcd x y == 1
euler :: Int -> Int
euler n = length (filter (relprime n) (mkList n))
sumEuler :: Int -> Int
sumEuler = sum . (map euler) . mkList
-- parallel initiation of list walk
mapFib :: [Int]
mapFib = map fib [37, 38, 39, 40]
mapEuler :: [Int]
mapEuler = map sumEuler [7600, 7600]
parMapFibEuler :: Int
parMapFibEuler = (forceList mapFib) `par` (forceList mapEuler `pseq` (sum mapFib + sum mapEuler))
-- how to evaluate in whnf form by forcing
forceList :: [a] -> ()
forceList [] = ()
forceList (x:xs) = x `pseq` (forceList xs)
main = do putStrLn (" sum : " ++ show parMapFibEuler)
to improve my program in parallel I rewrote it with par and pseq and a forcing function to force whnf evaluation. My problem is that by looking in the threadscope it appear that i didn't gain any parallelism. Things are worse because i didn't gain any speedup.
That why I have theses two questions
Question 1 How could I modify my code to exploit any parallelism ?
Question 2 How could I write my program in order to use Strategies (parMap, parList, rdeepseq and so on ...) ?
First improvement with Strategies
according to his contribution
parMapFibEuler = (mapFib, mapEuler) `using` s `seq` (sum mapFib + sum mapEuler) where
s = parTuple2 (seqList rseq) (seqList rseq)
the parallelism appears in the threadscope but not enough to have a significant speedup

The reason you aren't seeing any parallelism here is because your spark has been garbage collected. Run the program with +RTS -s and note this line:
SPARKS: 1 (0 converted, 1 pruned)
the spark has been "pruned", which means removed by the garbage collector. In GHC 7 we made a change to the semantics of sparks, such that a spark is now garbage collected (GC'd) if it is not referred to by the rest of the program; the details are in the "Seq no more" paper.
Why is the spark GC'd in your case? Look at the code:
parMapFibEuler :: Int
parMapFibEuler = (forceList mapFib) `par` (forceList mapEuler `pseq` (sum mapFib + sum mapEuler))
the spark here is the expression forkList mapFib. Note that the value of this expression is not required by the rest of the program; it only appears as an argument to par. GHC knows that it isn't required, so it gets garbage collected.
The whole point of the recent changes to the parallel package were to let you easily avoid this bear trap. A good Rule of Thumb is to use Control.Parallel.Strategies rather than par and pseq directly. My preferred way to write this would be
parMapFibEuler :: Int
parMapFibEuler = runEval $ do
a <- rpar $ sum mapFib
b <- rseq $ sum mapEuler
return (a+b)
but sadly this doesn't work with GHC 7.0.2, because the spark sum mapFib is floated out as a static expression (a CAF), and the runtime doesn't think sparks that point to static expressions are worth keeping (I'll fix this). This wouldn't happen in a real program, of course! So let's make the program a bit more realistic and defeat the CAF optimisation:
parMapFibEuler :: Int -> Int
parMapFibEuler n = runEval $ do
a <- rpar $ sum (take n mapFib)
b <- rseq $ sum (take n mapEuler)
return (a+b)
main = do [n] <- fmap (fmap read) getArgs
putStrLn (" sum : " ++ show (parMapFibEuler n))
Now I get good parallelism with GHC 7.0.2. However, note that #John's comments also apply: generally you want to look for more fine-grained parallelism so as to let GHC use all your processors.

Your parallelism is far too course-grained to have much beneficial effect. The largest chunks of work that can be done in parallel efficiently are in sumEuler, so that's where you should add your par annotations. Try changing sumEuler to:
sumEuler :: Int -> Int
sumEuler = sum . (parMap rseq euler) . mkList
parMap is from Control.Parallel.Strategies; it expresses a map that can be done in parallel. The first argument, rseq having type Strategy a, is used to force the computation to a specific point, otherwise no work would be done, due to laziness. rseq is fine for most numeric types.
It's not useful to add parallelism to fib here, below about fib 40 there isn't enough work to make it worthwhile.
In addition to threadscope, it's useful to run your program with the -s flag. Look for a line like:
SPARKS: 15202 (15195 converted, 0 pruned)
in the output. Each spark is an entry in a work queue to possibly be performed in parallel. Converted sparks are actually done in parallel, while pruned sparks mean that the main thread got to them before a worker thread had the chance to do so. If the pruned number is high, it means your parallel expressions are too fine-grained. If the total number of sparks is low, you aren't trying to do enough in parallel.
Finally, I think parMapFibEuler is better written as:
parMapFibEuler :: Int
parMapFibEuler = sum (mapFib `using` parList rseq) + sum mapEuler
mapEuler is simply too short to have any parallelism usefully expressed here, especially as euler is already performed in parallel. I'm doubtful that it makes a substantial difference for mapFib either. If the lists mapFib and mapEuler were longer, parallelism here would be more useful. Instead of parList you may be able to use parBuffer, which tends to work well for list consumers.
Making these two changes cuts the runtime from 12s to 8s for me, with GHC 7.0.2.

Hmmm... Maybe?
((forceList mapFib) `par` (forceList mapEuler)) `pseq` (sum mapFib + sum mapEuler)
I.e. spawn mapFib in background and calculate mapEuler and only after it (mapEuler) do (+) of their sums.
Actually I guess you can do something like:
parMapFibEuler = a `par` b `pseq` (a+b) where
a = sum mapFib
b = sum mapEuler
About Q2:
As I know strategies - is the "strategies" to combine data-structures with those par and seq.
You can write your forceList = withStrategy (seqList rseq)
As well you can write your code like:
parMapFibEuler = (mapFib, mapEuler) `using` s `seq` (sum mapFib + sum mapEuler) where
s = parTuple2 (seqList rseq) (seqList rseq)
I.e. strategy applied to tuple of two lists will force their evaulation in parallel, but each list will be forced to be evaluated sequentially.

First off, I assume you know your fib definition is awful and you're just doing this to play with the parallel package.
You seem to be going for parallelism at the wrong level. Parallelizing mapFib and mapEuler won't give a good speed-up because there is more work to compute mapFib. What you should do is compute each of these very expensive elements in parallel, which is slightly finer grain but not overly so:
mapFib :: [Int]
mapFib = parMap rdeepseq fib [37, 38, 39, 40]
mapEuler :: [Int]
mapEuler = parMap rdeepseq sumEuler [7600, 7600, 7600,7600]
parMapFibEuler :: Int
parMapFibEuler = sum a + sum b
where
a = mapFib
b = mapEuler
Also, I originally fought using Control.Parallel.Strategies over Control.Parallel but have come to like it as it is more readable and avoids issues like yours where one would expect parallelism and have to squint at it to figure out why you aren't getting any.
Finally, you should always post how you compile and how you run code you're expecting to be parallelized. For example:
$ ghc --make -rtsopts -O2 -threaded so.hs -eventlog -fforce-recomp
[1 of 1] Compiling Main ( so.hs, so.o )
Linking so ...
$ ./so +RTS -ls -N2
sum : 299045675
Yields:

Related

Project euler 10 - [haskell] Why so inefficient?

Alright, so i've picked up project euler where i left off when using java, and i'm at problem 10. I use Haskell now and i figured it'd be good to learn some haskell since i'm still very much a beginner.
http://projecteuler.net/problem=10
My friend who still codes in java came up with a very straight forward way to implement the sieve of eratosthenes:
http://puu.sh/5zQoU.png
I tried implementing a better looking (and what i thought was gonna be a slightly more efficient) Haskell function to find all primes up to 2,000,000.
I came to this very elegant, yet apparently enormously inefficient function:
primeSieveV2 :: [Integer] -> [Integer]
primeSieveV2 [] = []
primeSieveV2 (x:xs) = x:primeSieveV2( (filter (\n -> ( mod n x ) /= 0) xs) )
Now i'm not sure why my function is so much slower than his (he claim his works in 5ms), if anything mine should be faster, since i only check composites once (they are removed from the list when they are found) whereas his checks them as many times as they can be formed.
Any help?
You don't actually have a sieve here. In Haskell you could write a sieve as
import Data.Vector.Unboxed hiding (forM_)
import Data.Vector.Unboxed.Mutable
import Control.Monad.ST (runST)
import Control.Monad (forM_, when)
import Prelude hiding (read)
sieve :: Int -> Vector Bool
sieve n = runST $ do
vec <- new (n + 1) -- Create the mutable vector
set vec True -- Set all the elements to True
forM_ [2..n] $ \ i -> do -- Loop for i from 2 to n
val <- read vec i -- read the value at i
when val $ -- if the value is true, set all it's multiples to false
forM_ [2*i, 3*i .. n] $ \j -> write vec j False
freeze vec -- return the immutable vector
main = print . ifoldl' summer 0 $ sieve 2000000
where summer s i b = if b then i + s else s
This "cheats" by using a mutable unboxed vector, but it's pretty darn fast
$ ghc -O2 primes.hs
$ time ./primes
142913828923
real: 0.238 s
This is about 5x faster than my benchmarking of augustss's solution.
To actually implement the sieve efficiently in Haskell you probably need to do it the Java way (i.e., allocate a mutable array an modify it).
For just generating primes I like this:
primes = 2 : filter (isPrime primes) [3,5 ..]
where isPrime (p:ps) x = p*p > x || x `rem` p /= 0 && isPrime ps x
And then you can print the sum of all primes primes < 2,000,000
main = print $ sum $ takeWhile (< 2000000) primes
You can speed it up by adding a type signature primes :: [Int].
But it works well with Integer as well and that also gives you the correct sum (which 32 bit Int will not).
See The Genuine Sieve of Eratosthenes for more information.
The time complexity of your code is n2 (in n primes produced). It is impractical to run for producing more than first 10...20 thousand primes.
The main problem with that code is not that it uses rem but that it starts its filters prematurely, so creates too many of them. Here's how you fix it, with a small tweak:
{-# LANGUAGE PatternGuards #-}
primes = 2 : sieve primes [3..]
sieve (p:ps) xs | (h,t) <- span (< p*p) xs = h ++ sieve ps [x | x <- t, rem x p /= 0]
-- sieve ps (filter (\x->rem x p/=0) t)
main = print $ sum $ takeWhile (< 100000) primes
This improves the time complexity by about n1/2 (in n primes produced) and gives it a drastic speedup: it gets to 100,000 75x faster. Your 28 seconds should become ~ 0.4 sec. But, you probably tested it in GHCi as interpreted code, not compiled. Marking it1) as :: [Int] and compiling with -O2 flag gives it another ~ 40x speedup, so it'll be ~ 0.01 sec. To reach 2,000,000 with this code takes ~ 90x longer, for a whopping ~ 1 sec of projected run time.
1) be sure to use sum $ map (fromIntegral :: Int -> Integer) $ takeWhile ... in main.
see also: http://en.wikipedia.org/wiki/Analysis_of_algorithms#Empirical_orders_of_growth

How to evaluate tuples in parallel using rpar Strategy in Haskell?

I stumbled upon a problem with Eval monad and rpar Strategy in Haskell. Consider following code:
module Main where
import Control.Parallel.Strategies
main :: IO ()
main = print . sum . inParallel2 $ [1..10000]
inParallel :: [Double] -> [Double]
inParallel xss = runEval . go $ xss
where
go [] = return []
go (x:xs) = do
x' <- rpar $ x + 1
xs' <- go xs
return (x':xs')
inParallel2 :: [Double] -> [Double]
inParallel2 xss = runEval . go $ xss
where
go [] = return []
go [x] = return $ [x + 1]
go (x:y:xs) = do
(x',y') <- rpar $ (x + 1, y + 1)
xs' <- go xs
return (x':y':xs'
I compile and run it like this:
ghc -O2 -Wall -threaded -rtsopts -fforce-recomp -eventlog eval.hs
./eval +RTS -N3 -ls -s
When I use inParallel function parallelism works as expected. In the output runtime statistics I see:
SPARKS: 100000 (100000 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
When I switch to inParallel2 function all parallelism is gone:
SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
Why doesn't evaluation of tuples work in parallel? I tried forcing the tuple before passing it to rpar:
rpar $!! (x + 1, y + 1)
but still no result. What am I doing wrong?
The rpar strategy annotates a term for possible evaluation in parallel, but only up to weak head normal form, which essentially means, up to the outermost constructor. So for an integer or double, that means full evaluation, but for a pair, only the pair constructor, not its components, will get evaluated.
Forcing the pair before passing it to rpar is not going to help. Now you're evaluating the pair locally, before annotating the already evaluated tuple for possible parallel evaluation.
You probably want to combine the rpar with the rdeepseq strategy, thereby stating that the term should be completely evaluated, if possible in parallel. You can do this by saying
(rpar `dot` rdeepseq) (x + 1, y + 1)
The dot operator is for composing strategies.
There is, however, yet another problem with your code: pattern matching forces immediate evaluation, and therefore using pattern matching for rpar-annotated expressions is usually a bad idea. In particular, the line
(x',y') <- (rpar `dot` rdeepseq) (x + 1, y + 1)
will defeat all parallelism, because before the spark can be picked up for evaluation by another thread, the local thread will already start evaluating it in order to match the pattern. You can prevent this by using a lazy / irrefutable pattern:
~(x',y') <- (rpar `dot` rdeepseq) (x + 1, y + 1)
Or alternatively use fst and snd to access the components of the pair.
Finally, don't expect actual speedup if you create sparks that are as cheap as adding one to an integer. While sparks themselves are relatively cheap, they are not cost-free, so they work better if the computation you are annotating for parallel evaluation is somewhat costly.
You might want to read some tutorials on using strategies, such as Simon Marlow's
Parallel and Concurrent Programming using Haskell or my own Deterministic Parallel Programming in Haskell.

Slowdown when using parallel strategies in Haskell

I was working through the exercises of Andre Loh's deterministic parallel programming in haskell exercises. I was trying to convert the N-Queens sequential code into parallel by using strategies, but I noticed that the parallel code runs much slower than the sequential code and also errors out with insufficient stack space.
This is the code for the parallel N-Queens,
import Control.Monad
import System.Environment
import GHC.Conc
import Control.Parallel.Strategies
import Data.List
import Data.Function
type PartialSolution = [Int] -- per column, list the row the queen is in
type Solution = PartialSolution
type BoardSize = Int
chunk :: Int -> [a] -> [[a]]
chunk n [] = []
chunk n xs = case splitAt n xs of
(ys, zs) -> ys : chunk n zs
-- Generate all solutions for a given board size.
queens :: BoardSize -> [Solution]
--queens n = iterate (concatMap (addQueen n)) [[]] !! n
queens n = iterate (\l -> concat (map (addQueen n) l `using` parListChunk (n `div` numCapabilities) rdeepseq)) [[]] !! n
-- Given the size of the problem and a partial solution for the
-- first few columns, find all possible assignments for the next
-- column and extend the partial solution.
addQueen :: BoardSize -> PartialSolution -> [PartialSolution]
addQueen n s = [ x : s | x <- [1..n], safe x s 1 ]
-- Given a row number, a partial solution and an offset, check
-- that a queen placed at that row threatens no queen in the
-- partial solution.
safe :: Int -> PartialSolution -> Int -> Bool
safe x [] n = True
safe x (c:y) n = x /= c && x /= c + n && x /= c - n && safe x y (n + 1)
main = do
[n] <- getArgs
print $ length $ queens (read n)
The line (\l -> concat (map (addQueen n) l using parListChunk (n div numCapabilities) rdeepseq)) is what I changed from the original code. I have seen Simon Marlow's solution but I wanted to know the reason for the slowdown and error in my code.
Thanks in advance.
You are sparking way too much work. The parListChunk parameter of div n numCapabilities is probably, what, 7 on your system (2 cores and you're running with n ~ 14). The list is going to grow large very quickly so there is no point in sparking such small units of work (and I don't see why it makes sense tying it to the value of n).
If I add a factor of ten (making the sparking unit 70 in this case) then I get a clear performance win over single threading. Also, I don't have the stack issue you refer to - if it goes away with a change to your parListChunk value then I'd report that as a bug.
If I make the chunking every 800 then the times top off at 5.375s vs 7.9s. Over 800 and the performance starts to get worse again, ymmv.
EDIT:
[tommd#mavlo Test]$ ghc --version
The Glorious Glasgow Haskell Compilation System, version 7.0.4
[tommd#mavlo Test]$ ghc -O2 so.hs -rtsopts -threaded -fforce-recomp ; time ./so 13 +RTS -N2
[1 of 1] Compiling Main ( so.hs, so.o )
Linking so ...
73712
real 0m5.404s
[tommd#mavlo Test]$ ghc -O2 so.hs -rtsopts -fforce-recomp ; time ./so 13
[1 of 1] Compiling Main ( so.hs, so.o )
Linking so ...
73712
real 0m8.134s

Project Euler 23: insight on this stackoverflow-ing program needed

Hi haskell fellows. I'm currently working on the 23rd problem of Project Euler. Where I'm at atm is that my code seems right to me - not in the "good algorithm" meaning, but in the "should work" meaning - but produces a Stack memory overflow.
I do know that my algorithm isn't perfect (in particular I could certainly avoid computing such a big intermediate result at each recursion step in my worker function).
Though, being in the process of learning Haskell, I'd like to understand why this code fails so miserably, in order to avoid this kind of mistakes next time.
Any insight on why this program is wrong will be appreciated.
import qualified Data.List as Set ((\\))
main = print $ sum $ worker abundants [1..28123]
-- Limited list of abundant numbers
abundants :: [Int]
abundants = filter (\x -> (sum (divisors x)) - x > x) [1..28123]
-- Given a positive number, returns its divisors unordered.
divisors :: Int -> [Int]
divisors x | x > 0 = [1..squareRoot x] >>=
(\y -> if mod x y == 0
then let d = div x y in
if y == d
then [y]
else [y, d]
else [])
| otherwise = []
worker :: [Int] -> [Int] -> [Int]
worker (a:[]) prev = prev Set.\\ [a + a]
worker (a:as) prev = worker as (prev Set.\\ (map ((+) a) (a:as)))
-- http://www.haskell.org/haskellwiki/Generic_number_type#squareRoot
(^!) :: Num a => a -> Int -> a
(^!) x n = x^n
squareRoot :: Int -> Int
squareRoot 0 = 0
squareRoot 1 = 1
squareRoot n =
let twopows = iterate (^!2) 2
(lowerRoot, lowerN) =
last $ takeWhile ((n>=) . snd) $ zip (1:twopows) twopows
newtonStep x = div (x + div n x) 2
iters = iterate newtonStep (squareRoot (div n lowerN) * lowerRoot)
isRoot r = r^!2 <= n && n < (r+1)^!2
in head $ dropWhile (not . isRoot) iters
Edit: the exact error is Stack space overflow: current size 8388608 bytes.. Increasing the stack memory limit through +RTS -K... doesn't solve the problem.
Edit2: about the sqrt thing, I just copy pasted it from the link in comments. To avoid having to cast Integer to Doubles and face the rounding problems etc...
In the future, it's polite to attempt a bit of minimalization on your own. For example, with a bit of playing, I was able to discover that the following program also stack-overflows (with an 8M stack):
main = print (worker [1..1000] [1..1000])
...which really nails down just what function is screwing you over. Let's take a look at worker:
worker (a:[]) prev = prev Set.\\ [a + a]
worker (a:as) prev = worker as (prev Set.\\ (map ((+) a) (a:as)))
Even on my first read, this function was red-flagged in my mind, because it's tail-recursive. Tail recursion in Haskell is generally not such a great idea as it is in other languages; guarded recursion (where you produce at least one constructor before recursing, or recurse some small number of times before producing a constructor) is generally better for lazy evaluation. And in fact, here, what's happening is that each recursive call to worker is building a deeper- and deeper-ly nested thunk in the prev argument. When the time comes to finally return prev, we have to go very deeply into a long chain of Set.\\ calls to work out just what it was we finally have.
This problem is obfuscated slightly by the fact that the obvious strictness annotation doesn't help. Let's massage worker until it works. The first observation is that the first clause is completely subsumed by the second one. This is stylistic; it shouldn't affect the behavior (except on empty lists).
worker [] prev = prev
worker (a:as) prev = worker as (prev Set.\\ map (a+) (a:as))
Now, the obvious strictness annotation:
worker [] prev = prev
worker (a:as) prev = prev `seq` worker as (prev Set.\\ map (a+) (a:as))
I was surprised to discover that this still stack overflows! The sneaky thing is that seq on lists only evaluates far enough to learn whether the list matches either [] or _:_. The following does not stack overflow:
import Control.DeepSeq
worker [] prev = prev
worker (a:as) prev = prev `deepseq` worker as (prev Set.\\ map (a+) (a:as))
I didn't plug this final version back into the original code, but it at least works with the minimized main above. By the way, you might like the following implementation idea, which also stack overflows:
import Control.Monad
worker as bs = bs Set.\\ liftM2 (+) as as
but which can be fixed by using Data.Set instead of Data.List, and no strictness annotations:
import Control.Monad
import Data.Set as Set
worker as bs = toList (fromList bs Set.\\ fromList (liftM2 (+) as as))
As Daniel Wagner correctly said, the problem is that
worker (a:as) prev = worker as (prev Set.\\ (map ((+) a) (a:as)))
builds a badly nested thunk. You can avoid that and get somewhat better performance than with deepseq by exploiting the fact that both arguments to worker are sorted in this application. Thus you can get incremental output by noting that at any step everything in prev smaller than 2*a cannot be the sum of two abundant numbers, so
worker (a:as) prev = small ++ worker as (large Set.\\ map (+ a) (a:as))
where
(small,large) = span (< a+a) prev
does better. However, it's still bad because (\\) cannot use the sortedness of the two lists. If you replace it with
minus xxs#(x:xs) yys#(y:ys)
= case compare x y of
LT -> x : minus xs yys
EQ -> minus xs ys
GT -> minus xxs ys
minus xs _ = xs -- originally forgot the case for one empty list
(or use the data-ordlist package's version), calculating the set-difference is O(length) instead of O(length^2).
Ok, I loaded it up and gave it a shot. Daniel Wagner's advice is pretty good, probably better than mine. The problem is indeed with the worker function, but I was going to suggest using Data.MemoCombinators to memoize your function instead.
Also, your divisors algorithm is kind of silly. There's a much better way to do that. It's kind of mathy and would require a lot of TeX, so here's a link to a math.stackexchange page about how to do that. The one I was talking about, was the accepted answer, though someone else gives a recursive solution that I think would run faster. (It doesn't require prime factorization.)
https://math.stackexchange.com/questions/22721/is-there-a-formula-to-calculate-the-sum-of-all-proper-divisors-of-a-number

How to avoid stack space overflows?

I've been a bit surprised by GHC throwing stack overflows if I'd need to get value of large list containing memory intensive elements.
I did expected GHC has TCO so I'll never meet such situations.
To most simplify the case look at the following straightforward implementations of functions returning Fibonacci numbers (taken from HaskellWiki). The goal is to display millionth number.
import Data.List
# elegant recursive definition
fibs = 0 : 1 : zipWith (+) fibs (tail fibs)
# a bit tricky using unfoldr from Data.List
fibs' = unfoldr (\(a,b) -> Just (a,(b,a+b))) (0,1)
# version using iterate
fibs'' = map fst $ iterate (\(a,b) -> (b,a+b)) (0,1)
# calculate number by definition
fib_at 0 = 0
fib_at 1 = 1
fib_at n = fib_at (n-1) + fib_at (n-2)
main = do
{-- All following expressions abort with
Stack space overflow: current size 8388608 bytes.
Use `+RTS -Ksize -RTS' to increase it.
--}
print $ fibs !! (10^6)
print . last $ take (10^6) fibs
print $ fibs' !! (10^6)
print $ fibs'' !! (10^6)
-- following expression does not finish after several
-- minutes
print $ fib_at (10^6)
The source is compiled with ghc -O2.
What am I doing wrong ? I'd like to avoid recompiling with increased stack size or other specific compiler options.
These links here will give you a good introduction to your problem of too many thunks (space leaks).
If you know what to look out for (and have a decent model of lazy evaluation), then solving them is quite easy, for example:
{-# LANGUAGE BangPatterns #-}
import Data.List
fibs' = unfoldr (\(!a,!b) -> Just (a,(b,a+b))) (0,1)
main = do
print $ fibs' !! (10^6) -- no more stack overflow
All of the definitions (except the useless fib_at) will delay all the + operations, which means that when you have selected the millionth element it is a thunk with a million delayed additions. You should try something stricter.
As other have pointed out, Haskell being lazy you have to force evaluation of the thunks to avoid stack overflow.
It appears to me that this version of fibs' should work up to 10^6:
fibs' = unfoldr (\(a,b) -> Just (seq a (a, (b, a + b) ))) (0,1)
I recommend to study this wiki page on Folds and have a look at the seq function.

Resources