How to write parallel code with Haskell vectors?

How to write parallel code with Haskell vectors? - haskell

One one hand, in Haskell Vector a seems to be the preferred type to use as an array of numbers. There is even an (incomplete) Vector Tutorial.
On the other hand, Control.Parallel.Strategies are defined mostly in terms of Traversable. Vector library doesn't provide these instances.
The minimal complete definition of Traversable t should also define Foldable and
traverse :: Applicative f => (a -> f b) -> t a -> f (t b)
sequenceA :: Applicative f => t (f a) -> f (t a)
I don't see how sequenceA can be defined for Data.Vector.Unboxed.Vector. So, what is the best approach to writing parallel code with unboxed vectors? Defining some new ad hoc strategies like evalVector or using par and pseq explicitly or using plain Data.Array instead of vectors?
P.S. Plain Arrays are parallelizable without problems: https://gist.github.com/701888

It's a hack job for parVector but this worked for me:
import qualified Data.Vector as V
import Control.Parallel.Strategies
import Control.Parallel
import Control.DeepSeq
ack :: Int -> Int -> Int
ack 0 n = n+1
ack m 0 = ack (m-1) 1
ack m n = ack (m-1) (ack m (n-1))
main = do
let vec = V.enumFromN 1 1000
let res = (V.map (ack 2) vec) `using` parVector
print res
parVector :: NFData a => Strategy (V.Vector a)
parVector vec = eval vec `seq` Done vec
where
chunkSize = 1
eval v
| vLen == 0 = ()
| vLen <= chunkSize = rnf (v V.! 0) -- FIX this to handle chunks > 1
| otherwise = eval (V.take half v) `par` eval (V.drop half v)
where vLen = V.length v
half = vLen `div` 2
And running this code:
[tommd#Mavlo Test]$ ghc --make -O2 -threaded t.hs
... dumb warning ...
[tommd#Mavlo Test]$ time ./t +RTS -N1 >/dev/null
real 0m1.962s user 0m1.951s sys 0m0.009s
[tommd#Mavlo Test]$ time ./t +RTS -N2 >/dev/null
real 0m1.119s user 0m2.221s sys 0m0.005s
When I run the code with Integer instead of Int in the type signature:
[tommd#Mavlo Test]$ time ./t +RTS -N2 >/dev/null
real 0m4.754s
user 0m9.435s
sys 0m0.028s
[tommd#Mavlo Test]$ time ./t +RTS -N1 >/dev/null
real 0m9.008s
user 0m8.952s
sys 0m0.029s
Rock!
EDIT: And a solution that is a bit closer to your earlier attempt is cleaner (it doesn't use functions from three separate modules) and works great:
parVector :: NFData a => Strategy (V.Vector a)
parVector vec =
let vLen = V.length vec
half = vLen `div` 2
minChunk = 10
in if vLen > minChunk
then do
let v1 = V.unsafeSlice 0 half vec
v2 = V.unsafeSlice half (vLen - half) vec
parVector v1
parVector v2
return vec
else
evalChunk (vLen-1) >>
return vec
where
evalChunk 0 = rpar (rdeepseq (vec V.! 0)) >> return vec
evalChunk i = rpar (rdeepseq (vec V.! i)) >> evalChunk (i-1)
Things to learn from this solution:
It uses the Eval monad, which is strict so we're sure to spark everything (compared to wrapping things in let and remembering to use bang patterns).
Contrary to your proposed implementation it (a) doesn't construct a new vector, which is costly (b) evalChunk forces evaluation of each element using rpar and rdeepseq (I don't believe rpar vec forces any of the vector's elements).
Contrary to my belief, slice takes a start index and length, not a start and end index. Oops!
We still need to import Control.DeepSeq (NFData), but I've e-mailed the libraries list to try and fix that issue.
Performance seems similar to the first parVector solution in this answer, so I won't post numbers.

1) As you probably know, vector is a product of the DPH work that has proven harder than the researchers initially expected.
2) Unboxed vectors can't divide up the work for individual elements across multiple CPUs.
3) I'd be a lot more hopeful for boxed vectors. Something like:
using (map (rnf . (vec !)) [0..V.length vec - 1]) (parList rdeepseq)
Or maybe you can avoid constructing the list and using parlist. I think just assigning parts of the array is sufficient. The below code is likely broken, but the concept of making your own parVector using rnf and dividing the vector in half until it is a single element (or some tunable chunk size of elements) should work.
parVector :: Strategy (Vector a)
parVector = let !_ = eval vec in Done vec
where
chunkSize = 1
eval v
| vLen == 0 = ()
| vLen <= chunkSize = rnf (v ! 0) -- FIX this to handle chunks > 1
| otherwise = eval (V.take half v) `par` eval (V.drop half v)
where vLen = V.length v
half = vLen `div` 2

Related

Parallelize computation of mutable vector in ST

How can computations done in ST be made to run in parallel?
I have a vector which needs to be filled in by random access, hence the use of ST, and the computation runs correctly single-threaded, but have been unable to figure out how to use more than one core.
Random access is needed because of the meaning of the indices into the vector. There are n things and every possible way of choosing among n things has an entry in the vector, as in the choice function. Each of these choices corresponds to a binary number (conceptually, a packed [Bool]) and these Int values are the indices. If there are n things, then the size of the vector is 2^n. The natural way the algorithm runs is for every entry corresponding to "n choose 1" to be filled in, then every entry for "n choose 2," etc. The entries corresponding to "n choose k" depends on the entries corresponding to "n choose (k-1)." The integers for the different choices do not occur in numerical order, and that's why random access is needed.
Here's a pointless (but slow) computation that follows the same pattern. The example function shows how I tried to break the computation up so that the bulk of the work is done in a pure world (no ST monad). In the code below, bogus is where most of the work is done, with the intent of calling that in parallel, but only one core is ever used.
import qualified Data.Vector as Vb
import qualified Data.Vector.Mutable as Vm
import qualified Data.Vector.Generic.Mutable as Vg
import qualified Data.Vector.Generic as Gg
import Control.Monad.ST as ST ( ST, runST )
import Data.Foldable(forM_)
import Data.Char(digitToInt)
main :: IO ()
main = do
putStrLn $ show (example 9)
example :: Int -> Vb.Vector Int
example n = runST $ do
m <- Vg.new (2^n) :: ST s (Vm.STVector s Int)
Vg.unsafeWrite m 0 (1)
forM_ [1..n] $ \i -> do
p <- prev m n (i-1)
let newEntries = (choiceList n i) :: [Int]
forM_ newEntries $ \e -> do
let v = bogus p e
Vg.unsafeWrite m e v
Gg.unsafeFreeze m
choiceList :: Int -> Int -> [Int]
choiceList _ 0 = [0]
choiceList n 1 = [ 2^k | k <- [0..(n-1) ] ]
choiceList n k
| n == k = [2^n - 1]
| otherwise = (choiceList (n-1) k) ++ (map ((2^(n-1)) +) $ choiceList (n-1) (k-1))
prev :: Vm.STVector s Int -> Int -> Int -> ST s Integer
prev m n 0 = return 1
prev m n i = do
let chs = choiceList n i
v <- mapM (\k -> Vg.unsafeRead m k ) chs
let e = map (\k -> toInteger k ) v
return (sum e)
bogus :: Integer -> Int -> Int
bogus prior index = do
let f = fac prior
let g = (f^index) :: Integer
let d = (map digitToInt (show g)) :: [Int]
let a = fromIntegral (head d)^2
a
fac :: Integer -> Integer
fac 0 = 1
fac n = n * fac (n - 1)
If anyone tests this, using more than 9 or 10 in show (example 9) will take much longer than you want to wait for such a pointless sequence of numbers.

Just do it in IO. If you need to use the result in pure code, then unsafePerformIO is available.
The following version runs about 3-4 times faster with +RTS -N16 than +RTS -N1. My changes involved converting the ST vectors to IO, changing the forM_ to forConcurrently_, and adding a bang annotation to let !v = bogus ....
Full code:
import qualified Data.Vector as Vb
import qualified Data.Vector.Mutable as Vm
import qualified Data.Vector.Generic.Mutable as Vg
import qualified Data.Vector.Generic as Gg
import Control.Monad.ST as ST ( ST, runST )
import Data.Foldable(forM_)
import Data.Char(digitToInt)
import Control.Concurrent.Async
import System.IO.Unsafe
main :: IO ()
main = do
let m = unsafePerformIO (example 9)
putStrLn $ show m
example :: Int -> IO (Vb.Vector Int)
example n = do
m <- Vg.new (2^n)
Vg.unsafeWrite m 0 (1)
forM_ [1..n] $ \i -> do
p <- prev m n (i-1)
let newEntries = (choiceList n i) :: [Int]
forConcurrently_ newEntries $ \e -> do
let !v = bogus p e
Vg.unsafeWrite m e v
Gg.unsafeFreeze m
choiceList :: Int -> Int -> [Int]
choiceList _ 0 = [0]
choiceList n 1 = [ 2^k | k <- [0..(n-1) ] ]
choiceList n k
| n == k = [2^n - 1]
| otherwise = (choiceList (n-1) k) ++ (map ((2^(n-1)) +) $ choiceList (n-1) (k-1))
prev :: Vm.IOVector Int -> Int -> Int -> IO Integer
prev m n 0 = return 1
prev m n i = do
let chs = choiceList n i
v <- mapM (\k -> Vg.unsafeRead m k ) chs
let e = map (\k -> toInteger k ) v
return (sum e)
bogus :: Integer -> Int -> Int
bogus prior index = do
let f = fac prior
let g = (f^index) :: Integer
let d = (map digitToInt (show g)) :: [Int]
let a = fromIntegral (head d)^2
a
fac :: Integer -> Integer
fac 0 = 1
fac n = n * fac (n - 1)

I think this can not be done in a safe way. In the general case, it seems it would break Haskell's referential transparency.
If we could perform multi-threaded computations within ST s, then we could spawn two threads that race over the same STRef s Bool. Let's say one thread is writing False and the other one True.
After we use runST on the computation, we get an expression of type Bool which is sometimes False and sometimes True. That should not be possible.
If you are absolutely certain that your parallelization does not break referential transparency, you could try using unsafe primitives like unsafeIOToST to spawn new threads. Use with extreme care.
There might be safer ways to achieve something similar. Outside ST, we do have some parallelism available in Control.Parallel.Strategies.

There are a number of ways to do parallelization in Haskell. Usually they will give comparable performance improvements, however some are better then the others and it mostly depends on problem that needs parallelization. This particular use case looked very interesting to me, so I decided to investigate a few approaches.
Approaches
vector-strategies
We are using a boxed vector, therefore we can utilize laziness and built-in spark pool for parallelization. One very simple approach is provided by vector-strategies package, which can iterate over any immutable boxed vector and evaluate all of the thunks in parallel. It is also possible to split the vector in chunks, but as it turns out the chunk size of 1 is the optimal one:
exampleParVector :: Int -> Vb.Vector Int
exampleParVector n = example n `using` parVector 1
parallel
parVector uses par underneath and requires one extra iteration over the vector. In this case we are already iterating over thee vector, thus it would actually make more sense to use par from parallel directly. This would allow us to perform computation in parallel while continue using ST monad:
import Control.Parallel (par)
...
forM_ [1..n] $ \i -> do
p <- prev m n (i-1)
let newEntries = choiceList n i :: [Int]
forM_ newEntries $ \e -> do
let v = bogus p e
v `par` Vg.unsafeWrite m e v
It is important to note that the computation of each element of the vector is expensive when compared to the total number of elements in the vector. That is why using par is a very good solution here. If it was the opposite, namely the vector was very large, but elements weren't too expensive to compute, it would be better to use an unboxed vector and switch it to a different parallelization method.
async
Another way was described by #K.A.Buhr. Switch to IO from ST and use async:
import Control.Concurrent.Async (forConcurrently_)
...
forM_ [1..n] $ \i -> do
p <- prev m n (i-1)
let newEntries = choiceList n i :: [Int]
forConcurrently_ newEntries $ \e -> do
let !v = bogus p e
Vg.unsafeWrite m e v
The concern that #chi has raised is a valid one, however in this particular implementation it is safe to use unsafePerformIO instead of runST, because parallelization does not violate the invariant of deterministic computation. Namely, we can promise that regardless of the input supplied to example function, the output will always be exactly the same.
scheduler
Green threads are pretty cheap in Haskell, but they aren't free. The solution above with async package has one slight drawback: it will spin up at least as many threads as there are elements in the newEntries list each time forConcurrently_ is called. It would be better to spin up as many threads as there are capabilities (the -N RTS option) and let them do all the work. For this we can use scheduler package, which is a work stealing scheduler:
import Control.Scheduler (Comp(Par), runBatch_, withScheduler_)
...
withScheduler_ Par $ \scheduler ->
forM_ [1..n] $ \i -> runBatch_ scheduler $ \_ -> do
p <- prev m n (i-1)
let newEntries = choiceList n i :: [Int]
forM_ newEntries $ \e -> scheduleWork_ scheduler $ do
let !v = bogus p e
Vg.unsafeWrite m e v
Spark pool in GHC also uses a work stealing scheduler, which is built into RTS and is unrelated to the package above in any shape or form, but the idea is very similar: few threads with many units of computation.
Benchmarks
Here are some benchmarks on a 16-core machine for all of the approaches with example 7 (value 9 takes on the order of seconds, which introduces too much noise for criterion). We only get about x5 speedup, because a significant part of the algorithm is sequential in nature and can't be parallelized.

How do I memoize?

I have written this function that computes Collatz sequences, and I see wildly varying times of execution depending on the spin I give it. Apparently it is related to something called "memoization", but I have a hard time understanding what it is and how it works, and, unfortunately, the relevant article on HaskellWiki, as well as the papers it links to, have all proven to not be easily surmountable. They discuss intricate details of the relative performance of highly layman-indifferentiable tree constructions, while what I miss must be some very basic, very trivial point that these sources neglect to mention.
This is the code. It is a complete program, ready to be built and executed.
module Main where
import Data.Function
import Data.List (maximumBy)
size :: (Integral a) => a
size = 10 ^ 6
-- Nail the basics.
collatz :: Integral a => a -> a
collatz n | even n = n `div` 2
| otherwise = n * 3 + 1
recollatz :: Integral a => a -> a
recollatz = fix $ \f x -> if (x /= 1)
then f (collatz x)
else x
-- Now, I want to do the counting with a tuple monad.
mocollatz :: Integral b => b -> ([b], b)
mocollatz n = ([n], collatz n)
remocollatz :: Integral a => a -> ([a], a)
remocollatz = fix $ \f x -> if x /= 1
then f =<< mocollatz x
else return x
-- Trivialities.
collatzLength :: Integral a => a -> Int
collatzLength x = (length . fst $ (remocollatz x)) + 1
collatzPairs :: Integral a => a -> [(a, Int)]
collatzPairs n = zip [1..n] (collatzLength <$> [1..n])
longestCollatz :: Integral a => a -> (a, Int)
longestCollatz n = maximumBy order $ collatzPairs n
where
order :: Ord b => (a, b) -> (a, b) -> Ordering
order x y = snd x `compare` snd y
main :: IO ()
main = print $ longestCollatz size
With ghc -O2 it takes about 17 seconds, without ghc -O2 -- about 22 seconds to deliver the length and the seed of the longest Collatz sequence starting at any point below size.
Now, if I make these changes:
diff --git a/Main.hs b/Main.hs
index c78ad95..9607fe0 100644
--- a/Main.hs
+++ b/Main.hs
## -1,6 +1,7 ##
module Main where
import Data.Function
+import qualified Data.Map.Lazy as M
import Data.List (maximumBy)
size :: (Integral a) => a
## -22,10 +23,15 ## recollatz = fix $ \f x -> if (x /= 1)
mocollatz :: Integral b => b -> ([b], b)
mocollatz n = ([n], collatz n)
-remocollatz :: Integral a => a -> ([a], a)
-remocollatz = fix $ \f x -> if x /= 1
- then f =<< mocollatz x
- else return x
+remocollatz :: (Num a, Integral b) => b -> ([b], a)
+remocollatz 1 = return 1
+remocollatz x = case M.lookup x (table mutate) of
+ Nothing -> mutate x
+ Just y -> y
+ where mutate x = remocollatz =<< mocollatz x
+
+table :: (Ord a, Integral a) => (a -> b) -> M.Map a b
+table f = M.fromList [ (x, f x) | x <- [1..size] ]
-- Trivialities.
-- Then it will take just about 4 seconds with ghc -O2, but I would not live long enough to see it complete without ghc -O2.
Looking at the details of cost centres with ghc -prof -fprof-auto -O2 reveals that the first version enters collatz about a hundred million times, while the patched one -- just about one and a half million times. This must be the reason of the speedup, but I have a hard time understanding the inner workings of this magic. My best idea is that we replace a portion of expensive recursive calls with O(log n) map lookups, but I don't know if it's true and why it depends so much on some godforsaken compiler flags, while, as I see it, such performance swings should all follow solely from the language.
Can I haz an explanation of what happens here, and why the performance differs so vastly between ghc -O2 and plain ghc builds?
P.S. There are two requirements to the achieving of automagical memoization highlighted elsewhere on Stack Overflow:
Make a function to be memoized a top-level name.
Make a function to be memoized a monomorphic one.
In line with these requirements, I rebuilt remocollatz as follows:
remocollatz :: Int -> ([Int], Int)
remocollatz 1 = return 1
remocollatz x = mutate x
mutate :: Int -> ([Int], Int)
mutate x = remocollatz =<< mocollatz x
Now it's as top level and as monomorphic as it gets. Running time is about 11 seconds, versus the similarly monomorphized table version:
remocollatz :: Int -> ([Int], Int)
remocollatz 1 = return 1
remocollatz x = case M.lookup x (table mutate) of
Nothing -> mutate x
Just y -> y
mutate :: Int -> ([Int], Int)
mutate = \x -> remocollatz =<< mocollatz x
table :: (Int -> ([Int], Int)) -> M.Map Int ([Int], Int)
table f = M.fromList [ (x, f x) | x <- [1..size] ]
-- Running in less than 4 seconds.
I wonder why the memoization ghc is supposedly performing in the first case here is almost 3 times slower than my dumb table.

Can I haz an explanation of what happens here, and why the performance differs so vastly between ghc -O2 and plain ghc builds?
Disclaimer: this is a guess, not verified by viewing GHC core output. A careful answer would do so to verify the conjectures outlined below. You can try peering through it yourself: add -ddump-simpl to your compilation line and you will get copious output detailing exactly what GHC has done to your code.
You write:
remocollatz x = {- ... -} table mutate {- ... -}
where mutate x = remocollatz =<< mocollatz x
The expression table mutate in fact does not depend on x; but it appears on the right-hand side of an equation that takes x as an argument. Consequently, without optimizations, this table is recomputed each time remocollatz is called (presumably even from inside the computation of table mutate).
With optimizations, GHC notices that table mutate does not depend on x, and floats it to its own definition, effectively producing:
fresh_variable_name = table mutate
where mutate x = remocollatz =<< mocollatz x
remocollatz x = case M.lookup x fresh_variable_name of
{- ... -}
The table is therefore computed just once for the entire program run.
don't know why it [the performance] depends so much on some godforsaken compiler flags, while, as I see it, such performance swings should all follow solely from the language.
Sorry, but Haskell doesn't work that way. The language definition tells clearly what the meaning of a given Haskell term is, but does not say anything about the runtime or memory performance needed to compute that meaning.

Another approach to memoization that works in some situations, like this one, is to use a boxed vector, whose elements are computed lazily. The function used to initialize each element can use other elements of the vector in its calculation. As long as the evaluation of an element of the vector doesn't loop and refer to itself, just the elements it recursively depends on will be evaluated. Once evaluated, an element is effectively memoized, and this has the further benefit that elements of the vector that are never referenced are never evaluated.
The Collatz sequence is a nearly ideal application for this technique, but there is one complication. The next Collatz value(s) in sequence from a value under the limit may be outside the limit, which would cause a range error when indexing the vector. I solved this by just iterating through the sequence until back under the limit and counting the steps to do so.
The following program takes 0.77 seconds to run unoptimized and 0.30 when optimized:
import qualified Data.Vector as V
limit = 10 ^ 6 :: Int
-- The Collatz function, which given a value returns the next in the sequence.
nextCollatz val
| odd val = 3 * val + 1
| otherwise = val `div` 2
-- Given a value, return the next Collatz value in the sequence that is less
-- than the limit and the number of steps to get there. For example, the
-- sequence starting at 13 is: [13, 40, 20, 10, 5, 16, 8, 4, 2, 1], so if
-- limit is 100, then (nextCollatzWithinLimit 13) is (40, 1), but if limit is
-- 15, then (nextCollatzWithinLimit 13) is (10, 3).
nextCollatzWithinLimit val = (firstInRange, stepsToFirstInRange)
where
firstInRange = head rest
stepsToFirstInRange = 1 + (length biggerThanLimit)
(biggerThanLimit, rest) = span (>= limit) (tail collatzSeqStartingWithVal)
collatzSeqStartingWithVal = iterate nextCollatz val
-- A boxed vector holding Collatz length for each index. The collatzFn used
-- to generate the value for each element refers back to other elements of
-- this vector, but since the vector elements are only evaluated as needed and
-- there aren't any loops in the Collatz sequences, the values are calculated
-- only as needed.
collatzVec :: V.Vector Int
collatzVec = V.generate limit collatzFn
where
collatzFn :: Int -> Int
collatzFn index
| index <= 1 = 1
| otherwise = (collatzVec V.! nextWithinLimit) + stepsToGetThere
where
(nextWithinLimit, stepsToGetThere) = nextCollatzWithinLimit index
main :: IO ()
main = do
-- Use a fold through the vector to find the longest Collatz sequence under
-- the limit, and keep track of both the maximum length and the initial
-- value of the sequence, which is the index.
let (maxLength, maxIndex) = V.ifoldl' accMaxLen (0, 0) collatzVec
accMaxLen acc#(accMaxLen, accMaxIndex) index currLen
| currLen <= accMaxLen = acc
| otherwise = (currLen, index)
putStrLn $ "Max Collatz length below " ++ show limit ++ " is "
++ show maxLength ++ " at index " ++ show maxIndex

Short-circuiting a function over a lower triangular(ish) array in Haskell: speed leads to ugly code

I've got a function, in my minimum example called maybeProduceValue i j, which is only valid when i > j. Note that in my actual code, the js are not uniform and so the data only resembles a triangular matrix, I don't know what the mathematical name for this is.
I'd like my code, which loops over i and j and returns essentially (where js is sorted)
[maximum [f i j | j <- js, j < i] | i <- [0..iMax]]
to not check any more j's once one has failed. In C-like languages, this is simple as
if (j >= i) {break;}
and I'm trying to recreate this behaviour in Haskell. I've got two implementations below:
one which tries to take advantage of laziness by using takeWhile to only inspect at most one value (per i) which fails the test and returns Nothing;
one which remembers the number of js which worked for the previous i and so, for i+1, it doesn't bother doing any safety checks until it exceeds this number.
This latter function is more than twice as fast by my benchmarks but it really is a mess - I'm trying to convince people that Haskell is more concise and safe while still reasonably performant and here is some fast code which is dense, cluttered and does a bunch of unsafe operations.
Is there a solution, perhaps using Cont, Error or Exception, that can achieve my desired behaviour?
n.b. I've tried using Traversable.mapAccumL and Vector.unfoldrN instead of State and they end up being about the same speed and clarity. It's still a very overcomplicated way of solving this problem.
import Criterion.Config
import Criterion.Main
import Control.DeepSeq
import Control.Monad.State
import Data.Maybe
import qualified Data.Traversable as T
import qualified Data.Vector as V
main = deepseq inputs $ defaultMainWith (defaultConfig{cfgSamples = ljust 10}) (return ()) [
bcompare [
bench "whileJust" $ nf whileJust js,
bench "memoised" $ nf memoisedSection js
]]
iMax = 5000
jMax = 10000
-- any sorted vector
js :: V.Vector Int
js = V.enumFromN 0 jMax
maybeProduceValue :: Int -> Int -> Maybe Float
maybeProduceValue i j | j < i = Just (fromIntegral (i+j))
| otherwise = Nothing
unsafeProduceValue :: Int -> Int -> Float
-- unsafeProduceValue i j | j >= i = error "you fool!"
unsafeProduceValue i j = fromIntegral (i+j)
whileJust, memoisedSection
:: V.Vector Int -> V.Vector Float
-- mean: 389ms
-- short circuits properly
whileJust inputs' = V.generate iMax $ \i ->
safeMax . V.map fromJust . V.takeWhile isJust $ V.map (maybeProduceValue i) inputs'
where safeMax v = if V.null v then 0 else V.maximum v
-- mean: 116ms
-- remembers the (monotonically increasing) length of the section of
-- the vector that is safe. I have tested that this doesn't violate the condition that j < i
memoisedSection inputs' = flip evalState 0 $ V.generateM iMax $ \i -> do
validSection <- state $ \oldIx ->
let newIx = oldIx + V.length (V.takeWhile (< i) (V.unsafeDrop oldIx inputs'))
in (V.unsafeTake newIx inputs', newIx)
return $ V.foldl' max 0 $ V.map (unsafeProduceValue i) validSection

Here's a simple way of solving the problem with Applicatives, provided that you don't need to keep the rest of the list once you run into an issue:
import Control.Applicative
memoizeSections :: Ord t => [(t, t)] -> Maybe [t]
memoizeSections [] = Just []
memoizeSections ((x, y):xs) = (:) <$> maybeProduceValue x y <*> memoizeSections xs
This is equivalent to:
import Data.Traversable
memoizeSections :: Ord t => [(t, t)] -> Maybe [t]
memoizeSections = flip traverse (uncurry maybeProduceValue)
and will return Nothing on the first occurrence of failure. Note that I don't know how fast this is, but it's certainly concise, and arguably pretty clear (particularly the first example).

Some minor comments:
-- any sorted vector
js :: V.Vector Int
js = V.enumFromN 0 jMax
If you have a vector of Ints (or Floats, etc), you want to use Data.Vector.Unboxed.
maybeProduceValue :: Int -> Int -> Maybe Float
maybeProduceValue i j | j < i = Just (fromIntegral (i+j))
| otherwise = Nothing
Since Just is lazy in its only field, this will create a thunk for the computation fromIntegral (i+j). You almost always want to apply Just like so
maybeProduceValue i j | j < i = Just $! fromIntegral (i+j)
There are some more thunks in:
memoisedSection inputs' = flip evalState 0 $ V.generateM iMax $ \i -> do
validSection <- state $ \oldIx ->
let newIx = oldIx + V.length (V.takeWhile (< i) (V.unsafeDrop oldIx inputs'))
in (V.unsafeTake newIx inputs', newIx)
return $ V.foldl' max 0 $ V.map (unsafeProduceValue i) validSection
Namely you want to:
let !newIx = oldIx + V.length (V.takeWhile (< i) (V.unsafeDrop oldIx inputs'))
!v = V.unsafeTake newIx inputs'
in (v, newIx)
as the pair is lazy in its fields and
return $! V.foldl' max 0 $ V.map (unsafeProduceValue i) validSection
because return in the state monad is lazy in the value.

You can use a guard in a single list comprehension:
[f i j | j <- js, i <- is, j < i]

If you're trying to get the same results as
[foo i j | i <- is, j <- js, j < i]
when you know that js is increasing, just write
[foo i j | i <- is, j <- takeWhile (< i) js]
There's no need to mess around with Maybe for this. Note that making the input list global has a likely-unfortunate effect: instead of fusing the production of the input list with its transformation(s) and ultimate consumption, it's forced to actually construct the list and then keep it in memory. It's quite possible that it will take longer to pull the list into cache from memory than to generate it piece by piece on the fly!

Caching in Haskell and explicit parallelism

I'm currently trying to optimize my solution to problem 14 at Projet Euler.
I really enjoy Haskell and I think it's a very good fit for these kind of problems, here's three different solutions I've tried:
import Data.List (unfoldr, maximumBy)
import Data.Maybe (fromJust, isNothing)
import Data.Ord (comparing)
import Control.Parallel
next :: Integer -> Maybe (Integer)
next 1 = Nothing
next n
| even n = Just (div n 2)
| odd n = Just (3 * n + 1)
get_sequence :: Integer -> [Integer]
get_sequence n = n : unfoldr (pack . next) n
where pack n = if isNothing n then Nothing else Just (fromJust n, fromJust n)
get_sequence_length :: Integer -> Integer
get_sequence_length n
| isNothing (next n) = 1
| otherwise = 1 + (get_sequence_length $ fromJust (next n))
-- 8 seconds
main1 = print $ maximumBy (comparing length) $ map get_sequence [1..1000000]
-- 5 seconds
main2 = print $ maximum $ map (\n -> (get_sequence_length n, n)) [1..1000000]
-- Never finishes
main3 = print solution
where
s1 = maximumBy (comparing length) $ map get_sequence [1..500000]
s2 = maximumBy (comparing length) $ map get_sequence [500001..10000000]
solution = (s1 `par` s2) `pseq` max s1 s2
Now if you look at the actual problem there's a lot of potential for caching, as most new sequences will contain subsequences that have already been calculated before.
For comparison, I wrote a version in C too:
Running time with caching: 0.03 seconds
Running time without caching: 0.3 seconds
That's just insane! Sure, caching reduced the time by a factor of 10, but even without caching it's still at least 17 times faster than my Haskell code.
What's wrong with my code?
Why doesn't Haskell cache the function calls for me? As the functions are pure caching shouldn't caching be trivial, only a matter of available memory?
What's the problem with my third parallel version? Why doesn't it finish?
Regarding Haskell as a language, does the compiler automatically parallellize some code (folds, maps etc), or does it always have to be done explicitly using Control.Parallel?
Edit: I stumbled upon this similar question. They mentioned that his function wasn't tail-recursive. Is my get_sequence_length tail recursive? If not how can I make it so?
Edit2: To Daniel:
Thanks a lot for the reply, really awesome.
I've been playing around with your improvements and I've found some really bad gotchas.
I'm running the tests on Windws 7 (64-bit), 3.3 GHZ Quad core with 8GB RAM.
The first thing I did was as you say replace all Integer with Int, but whenever I ran any of the mains I ran out of memory,
even with +RTS kSize -RTS set ridiciously high.
Eventually I found this (stackoverflow is awesome...), which means that since all Haskell programs on Windows are run as 32-bit, the Ints were overflowing causing infinite recursion, just wow...
I ran the tests in a Linux virtual machine (with the 64-bit ghc) instead and got similar results.

Alright, let's start from the top. First important thing is to give the exact command line you're using to compile and run; for my answer, I'll use this line for the timings of all programs:
ghc -O2 -threaded -rtsopts test && time ./test +RTS -N
Next up: since timings vary greatly from machine to machine, we'll give some baseline timings for my machine and your programs. Here's the output of uname -a for my computer:
Linux sorghum 3.4.4-2-ARCH #1 SMP PREEMPT Sun Jun 24 18:59:47 CEST 2012 x86_64 Intel(R) Core(TM)2 Quad CPU Q6600 # 2.40GHz GenuineIntel GNU/Linux
The highlights are: quad-core, 2.4GHz, 64-bit.
Using main1: 30.42s user 2.61s system 149% cpu 22.025 total
Using main2: 21.42s user 1.18s system 129% cpu 17.416 total
Using main3: 22.71s user 2.02s system 220% cpu 11.237 total
Actually, I modified main3 in two ways: first, by removing one of the zeros from the end of the range in s2, and second, by changing max s1 s2 to maximumBy (comparing length) [s1, s2], since the former only accidentally computes the right answer. =)
I'll now focus on serial speed. (To answer one of your direct questions: no, GHC does not automatically parallelize or memoize your programs. Both of those things have overheads that are very difficult to estimate, and consequently it's very difficult to decide when doing them will be beneficial. I have no idea why even the serial solutions in this answer are getting >100% CPU utilization; perhaps some garbage collection is happening in another thread or some such thing.) We'll start from main2, since it was the faster of the two serial implementations. The cheapest way to get a little boost is to change all the type signatures from Integer to Int:
Using Int: 11.17s user 0.50s system 129% cpu 8.986 total (about twice as fast)
The next boost comes from reducing allocation in the inner loop (eliminating the intermediate Maybe values).
import Data.List
import Data.Ord
get_sequence_length :: Int -> Int
get_sequence_length 1 = 1
get_sequence_length n
| even n = 1 + get_sequence_length (n `div` 2)
| odd n = 1 + get_sequence_length (3 * n + 1)
lengths :: [(Int,Int)]
lengths = map (\n -> (get_sequence_length n, n)) [1..1000000]
main = print (maximumBy (comparing fst) lengths)
Using this: 4.84s user 0.03s system 101% cpu 4.777 total
The next boost comes from using faster operations than even and div:
import Data.Bits
import Data.List
import Data.Ord
even' n = n .&. 1 == 0
get_sequence_length :: Int -> Int
get_sequence_length 1 = 1
get_sequence_length n = 1 + get_sequence_length next where
next = if even' n then n `quot` 2 else 3 * n + 1
lengths :: [(Int,Int)]
lengths = map (\n -> (get_sequence_length n, n)) [1..1000000]
main = print (maximumBy (comparing fst) lengths)
Using this: 1.27s user 0.03s system 105% cpu 1.232 total
For those following along at home, this is about 17 times faster than the main2 that we started with -- a competitive improvement with switching to C.
For memoization, there's a few choices. The simplest is to use a pre-existing package like data-memocombinators to create an immutable array and read from it. The timings are fairly sensitive to choosing a good size for this array; for this problem, I found 50000 to be a pretty good upper bound.
import Data.Bits
import Data.MemoCombinators
import Data.List
import Data.Ord
even' n = n .&. 1 == 0
pre_length :: (Int -> Int) -> (Int -> Int)
pre_length f 1 = 1
pre_length f n = 1 + f next where
next = if even' n then n `quot` 2 else 3 * n + 1
get_sequence_length :: Int -> Int
get_sequence_length = arrayRange (1,50000) (pre_length get_sequence_length)
lengths :: [(Int,Int)]
lengths = map (\n -> (get_sequence_length n, n)) [1..1000000]
main = print (maximumBy (comparing fst) lengths)
With this: 0.53s user 0.10s system 149% cpu 0.421 total
The fastest of all is to use a mutable, unboxed array for the memoization bit. It's much less idiomatic, but it's bare-metal speed. The speed is much less sensitive on the size of this array, so long as the array is about as large as the biggest thing you want the answer for.
import Control.Monad
import Control.Monad.ST
import Data.Array.Base
import Data.Array.ST
import Data.Bits
import Data.List
import Data.Ord
even' n = n .&. 1 == 0
next n = if even' n then n `quot` 2 else 3 * n + 1
get_sequence_length :: STUArray s Int Int -> Int -> ST s Int
get_sequence_length arr n = do
bounds#(lo,hi) <- getBounds arr
if not (inRange bounds n) then (+1) `fmap` get_sequence_length arr (next n) else do
let ix = n-lo
v <- unsafeRead arr ix
if v > 0 then return v else do
v' <- get_sequence_length arr (next n)
unsafeWrite arr ix (v'+1)
return (v'+1)
maxLength :: (Int,Int)
maxLength = runST $ do
arr <- newArray (1,1000000) 0
writeArray arr 1 1
loop arr 1 1 1000000
where
loop arr n len 1 = return (n,len)
loop arr n len n' = do
len' <- get_sequence_length arr n'
if len' > len then loop arr n' len' (n'-1) else loop arr n len (n'-1)
main = print maxLength
With this: 0.16s user 0.02s system 138% cpu 0.130 total (which is competitive with the memoized C version)

GHC won't parallel-ize anything automatically for you. And as you guess get_sequence_length is not tail-recursive. See here. And consider how the compiler (unless it's doing some nice optimizations for you) can't evaluate all those recursive additions until you hit the end; you're "building up thunks" which isn't usually a good thing.
Try instead calling a recursive helper function and passing an accumulator, or try defining it in terms of foldr.

Haskell Nested Vector Parallel Strategy

Similar to this related question, I would like to perform a parallel map on a Vector, but in my case I have a nested Vector, and I can't seem to get the evaluation correct.
Here is what I have so far:
import qualified Data.Vector as V
import qualified Data.Vector.Unboxed as U
import Data.Vector.Strategies
import Control.DeepSeq
main = do
let res = genVVec 200 `using` parVector 2
print res
genUVec :: Int -> U.Vector Int
genUVec n = U.map (ack 2) $ U.enumFromN n 75
genVVec :: Int -> V.Vector (U.Vector Int)
genVVec n = V.map genUVec $ V.enumFromN 0 n
ack :: Int -> Int -> Int
ack 0 n = n+1
ack m 0 = ack (m-1) 1
ack m n = ack (m-1) (ack m (n-1))
instance (NFData a, U.Unbox a) => NFData (U.Vector a) where
rnf = rnf . U.toList
gives:
$ ./vectorPar +RTS -N8 -s >/dev/null
SPARKS: 200 (17 converted, 183 pruned)
Total time 1.37s ( 1.30s elapsed)
$ ./vectorPar +RTS -s >/dev/null
SPARKS: 200 (0 converted, 200 pruned)
Total time 1.25s ( 1.26s elapsed)
I have also tried modifying the parVector function in vector-strategies directly, but my attempts are clumsy and ineffective.
I realize REPA was designed for nested workloads, but this seems a simple enough problem, and I'd rather not have to rewrite a lot of code.

Note: Guilty author of vector-strategies here (which is a very small title, seeing as this was just a hacked up function I figured others would find useful).
Your observation that parVector is wrong in that it allows the sparks to be GCed prior to use seems to be correct. The advice by SimonM means I must do precisely what I was trying to avoid, construct a new vector, at some cost, in place of the old one. Knowing this is necessary, there is little reason not to change parVector to the much simpler definition of:
parVector2 :: NFData a => Int -> Strategy (V.Vector a)
parVector2 n = liftM V.fromList . parListChunk n rdeepseq . V.toList
Notice the fix given by John L only works because it "beats" the collector by forcing the computations before collection would occur.
I'll be changing the vector-strategies library so this is unnecessary - making your original code work fine. Unfortunately, this will incur the above-mentioned cost of constructing a new Vector (usually minimal).

The problem appears to be that parVector doesn't force evaluation of the elements of the vector. Each element remains a thunk and nothing is sparked until the vector is consumed (by being printed), which is too late for the sparks to do work. You can force evaluation of each element by composing the parVector strategy with rdeepseq.
import qualified Data.Vector as V
import qualified Data.Vector.Unboxed as U
import Data.Vector.Strategies
import Control.DeepSeq
import Control.Parallel.Strategies
main = do
let res = genVVec 200 `using` (rdeepseq `dot` parVector 20)
print res
genUVec :: Int -> U.Vector Int
genUVec n = U.map (ack 2) $ U.enumFromN n 75
genVVec :: Int -> V.Vector (U.Vector Int)
genVVec n = V.map genUVec $ V.enumFromN 0 n
ack :: Int -> Int -> Int
ack 0 n = n+1
ack m 0 = ack (m-1) 1
ack m n = ack (m-1) (ack m (n-1))
instance (NFData a, U.Unbox a) => NFData (U.Vector a) where
rnf vec = seq vec ()
instance (NFData a) => NFData (V.Vector a) where
rnf = rnf . V.toList
I also changed your NFData (U.Vector a) instance. Since a U.Vector is unboxed, evaluation to WHNF is sufficient, and forcing each element via the list conversion is wasteful. In fact the default definition for rnf works fine if you like.
With these two changes, I get the following
SPARKS: 200 (200 converted, 0 pruned)
and the runtime has been reduced by nearly 50%. I also adjusted the vector chunk size to 20, but the result is very similar to a chunk size of 2.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string