I have written this function that computes Collatz sequences, and I see wildly varying times of execution depending on the spin I give it. Apparently it is related to something called "memoization", but I have a hard time understanding what it is and how it works, and, unfortunately, the relevant article on HaskellWiki, as well as the papers it links to, have all proven to not be easily surmountable. They discuss intricate details of the relative performance of highly layman-indifferentiable tree constructions, while what I miss must be some very basic, very trivial point that these sources neglect to mention.
This is the code. It is a complete program, ready to be built and executed.
module Main where
import Data.Function
import Data.List (maximumBy)
size :: (Integral a) => a
size = 10 ^ 6
-- Nail the basics.
collatz :: Integral a => a -> a
collatz n | even n = n `div` 2
| otherwise = n * 3 + 1
recollatz :: Integral a => a -> a
recollatz = fix $ \f x -> if (x /= 1)
then f (collatz x)
else x
-- Now, I want to do the counting with a tuple monad.
mocollatz :: Integral b => b -> ([b], b)
mocollatz n = ([n], collatz n)
remocollatz :: Integral a => a -> ([a], a)
remocollatz = fix $ \f x -> if x /= 1
then f =<< mocollatz x
else return x
-- Trivialities.
collatzLength :: Integral a => a -> Int
collatzLength x = (length . fst $ (remocollatz x)) + 1
collatzPairs :: Integral a => a -> [(a, Int)]
collatzPairs n = zip [1..n] (collatzLength <$> [1..n])
longestCollatz :: Integral a => a -> (a, Int)
longestCollatz n = maximumBy order $ collatzPairs n
where
order :: Ord b => (a, b) -> (a, b) -> Ordering
order x y = snd x `compare` snd y
main :: IO ()
main = print $ longestCollatz size
With ghc -O2 it takes about 17 seconds, without ghc -O2 -- about 22 seconds to deliver the length and the seed of the longest Collatz sequence starting at any point below size.
Now, if I make these changes:
diff --git a/Main.hs b/Main.hs
index c78ad95..9607fe0 100644
--- a/Main.hs
+++ b/Main.hs
## -1,6 +1,7 ##
module Main where
import Data.Function
+import qualified Data.Map.Lazy as M
import Data.List (maximumBy)
size :: (Integral a) => a
## -22,10 +23,15 ## recollatz = fix $ \f x -> if (x /= 1)
mocollatz :: Integral b => b -> ([b], b)
mocollatz n = ([n], collatz n)
-remocollatz :: Integral a => a -> ([a], a)
-remocollatz = fix $ \f x -> if x /= 1
- then f =<< mocollatz x
- else return x
+remocollatz :: (Num a, Integral b) => b -> ([b], a)
+remocollatz 1 = return 1
+remocollatz x = case M.lookup x (table mutate) of
+ Nothing -> mutate x
+ Just y -> y
+ where mutate x = remocollatz =<< mocollatz x
+
+table :: (Ord a, Integral a) => (a -> b) -> M.Map a b
+table f = M.fromList [ (x, f x) | x <- [1..size] ]
-- Trivialities.
-- Then it will take just about 4 seconds with ghc -O2, but I would not live long enough to see it complete without ghc -O2.
Looking at the details of cost centres with ghc -prof -fprof-auto -O2 reveals that the first version enters collatz about a hundred million times, while the patched one -- just about one and a half million times. This must be the reason of the speedup, but I have a hard time understanding the inner workings of this magic. My best idea is that we replace a portion of expensive recursive calls with O(log n) map lookups, but I don't know if it's true and why it depends so much on some godforsaken compiler flags, while, as I see it, such performance swings should all follow solely from the language.
Can I haz an explanation of what happens here, and why the performance differs so vastly between ghc -O2 and plain ghc builds?
P.S. There are two requirements to the achieving of automagical memoization highlighted elsewhere on Stack Overflow:
Make a function to be memoized a top-level name.
Make a function to be memoized a monomorphic one.
In line with these requirements, I rebuilt remocollatz as follows:
remocollatz :: Int -> ([Int], Int)
remocollatz 1 = return 1
remocollatz x = mutate x
mutate :: Int -> ([Int], Int)
mutate x = remocollatz =<< mocollatz x
Now it's as top level and as monomorphic as it gets. Running time is about 11 seconds, versus the similarly monomorphized table version:
remocollatz :: Int -> ([Int], Int)
remocollatz 1 = return 1
remocollatz x = case M.lookup x (table mutate) of
Nothing -> mutate x
Just y -> y
mutate :: Int -> ([Int], Int)
mutate = \x -> remocollatz =<< mocollatz x
table :: (Int -> ([Int], Int)) -> M.Map Int ([Int], Int)
table f = M.fromList [ (x, f x) | x <- [1..size] ]
-- Running in less than 4 seconds.
I wonder why the memoization ghc is supposedly performing in the first case here is almost 3 times slower than my dumb table.
Can I haz an explanation of what happens here, and why the performance differs so vastly between ghc -O2 and plain ghc builds?
Disclaimer: this is a guess, not verified by viewing GHC core output. A careful answer would do so to verify the conjectures outlined below. You can try peering through it yourself: add -ddump-simpl to your compilation line and you will get copious output detailing exactly what GHC has done to your code.
You write:
remocollatz x = {- ... -} table mutate {- ... -}
where mutate x = remocollatz =<< mocollatz x
The expression table mutate in fact does not depend on x; but it appears on the right-hand side of an equation that takes x as an argument. Consequently, without optimizations, this table is recomputed each time remocollatz is called (presumably even from inside the computation of table mutate).
With optimizations, GHC notices that table mutate does not depend on x, and floats it to its own definition, effectively producing:
fresh_variable_name = table mutate
where mutate x = remocollatz =<< mocollatz x
remocollatz x = case M.lookup x fresh_variable_name of
{- ... -}
The table is therefore computed just once for the entire program run.
don't know why it [the performance] depends so much on some godforsaken compiler flags, while, as I see it, such performance swings should all follow solely from the language.
Sorry, but Haskell doesn't work that way. The language definition tells clearly what the meaning of a given Haskell term is, but does not say anything about the runtime or memory performance needed to compute that meaning.
Another approach to memoization that works in some situations, like this one, is to use a boxed vector, whose elements are computed lazily. The function used to initialize each element can use other elements of the vector in its calculation. As long as the evaluation of an element of the vector doesn't loop and refer to itself, just the elements it recursively depends on will be evaluated. Once evaluated, an element is effectively memoized, and this has the further benefit that elements of the vector that are never referenced are never evaluated.
The Collatz sequence is a nearly ideal application for this technique, but there is one complication. The next Collatz value(s) in sequence from a value under the limit may be outside the limit, which would cause a range error when indexing the vector. I solved this by just iterating through the sequence until back under the limit and counting the steps to do so.
The following program takes 0.77 seconds to run unoptimized and 0.30 when optimized:
import qualified Data.Vector as V
limit = 10 ^ 6 :: Int
-- The Collatz function, which given a value returns the next in the sequence.
nextCollatz val
| odd val = 3 * val + 1
| otherwise = val `div` 2
-- Given a value, return the next Collatz value in the sequence that is less
-- than the limit and the number of steps to get there. For example, the
-- sequence starting at 13 is: [13, 40, 20, 10, 5, 16, 8, 4, 2, 1], so if
-- limit is 100, then (nextCollatzWithinLimit 13) is (40, 1), but if limit is
-- 15, then (nextCollatzWithinLimit 13) is (10, 3).
nextCollatzWithinLimit val = (firstInRange, stepsToFirstInRange)
where
firstInRange = head rest
stepsToFirstInRange = 1 + (length biggerThanLimit)
(biggerThanLimit, rest) = span (>= limit) (tail collatzSeqStartingWithVal)
collatzSeqStartingWithVal = iterate nextCollatz val
-- A boxed vector holding Collatz length for each index. The collatzFn used
-- to generate the value for each element refers back to other elements of
-- this vector, but since the vector elements are only evaluated as needed and
-- there aren't any loops in the Collatz sequences, the values are calculated
-- only as needed.
collatzVec :: V.Vector Int
collatzVec = V.generate limit collatzFn
where
collatzFn :: Int -> Int
collatzFn index
| index <= 1 = 1
| otherwise = (collatzVec V.! nextWithinLimit) + stepsToGetThere
where
(nextWithinLimit, stepsToGetThere) = nextCollatzWithinLimit index
main :: IO ()
main = do
-- Use a fold through the vector to find the longest Collatz sequence under
-- the limit, and keep track of both the maximum length and the initial
-- value of the sequence, which is the index.
let (maxLength, maxIndex) = V.ifoldl' accMaxLen (0, 0) collatzVec
accMaxLen acc#(accMaxLen, accMaxIndex) index currLen
| currLen <= accMaxLen = acc
| otherwise = (currLen, index)
putStrLn $ "Max Collatz length below " ++ show limit ++ " is "
++ show maxLength ++ " at index " ++ show maxIndex
Related
How can computations done in ST be made to run in parallel?
I have a vector which needs to be filled in by random access, hence the use of ST, and the computation runs correctly single-threaded, but have been unable to figure out how to use more than one core.
Random access is needed because of the meaning of the indices into the vector. There are n things and every possible way of choosing among n things has an entry in the vector, as in the choice function. Each of these choices corresponds to a binary number (conceptually, a packed [Bool]) and these Int values are the indices. If there are n things, then the size of the vector is 2^n. The natural way the algorithm runs is for every entry corresponding to "n choose 1" to be filled in, then every entry for "n choose 2," etc. The entries corresponding to "n choose k" depends on the entries corresponding to "n choose (k-1)." The integers for the different choices do not occur in numerical order, and that's why random access is needed.
Here's a pointless (but slow) computation that follows the same pattern. The example function shows how I tried to break the computation up so that the bulk of the work is done in a pure world (no ST monad). In the code below, bogus is where most of the work is done, with the intent of calling that in parallel, but only one core is ever used.
import qualified Data.Vector as Vb
import qualified Data.Vector.Mutable as Vm
import qualified Data.Vector.Generic.Mutable as Vg
import qualified Data.Vector.Generic as Gg
import Control.Monad.ST as ST ( ST, runST )
import Data.Foldable(forM_)
import Data.Char(digitToInt)
main :: IO ()
main = do
putStrLn $ show (example 9)
example :: Int -> Vb.Vector Int
example n = runST $ do
m <- Vg.new (2^n) :: ST s (Vm.STVector s Int)
Vg.unsafeWrite m 0 (1)
forM_ [1..n] $ \i -> do
p <- prev m n (i-1)
let newEntries = (choiceList n i) :: [Int]
forM_ newEntries $ \e -> do
let v = bogus p e
Vg.unsafeWrite m e v
Gg.unsafeFreeze m
choiceList :: Int -> Int -> [Int]
choiceList _ 0 = [0]
choiceList n 1 = [ 2^k | k <- [0..(n-1) ] ]
choiceList n k
| n == k = [2^n - 1]
| otherwise = (choiceList (n-1) k) ++ (map ((2^(n-1)) +) $ choiceList (n-1) (k-1))
prev :: Vm.STVector s Int -> Int -> Int -> ST s Integer
prev m n 0 = return 1
prev m n i = do
let chs = choiceList n i
v <- mapM (\k -> Vg.unsafeRead m k ) chs
let e = map (\k -> toInteger k ) v
return (sum e)
bogus :: Integer -> Int -> Int
bogus prior index = do
let f = fac prior
let g = (f^index) :: Integer
let d = (map digitToInt (show g)) :: [Int]
let a = fromIntegral (head d)^2
a
fac :: Integer -> Integer
fac 0 = 1
fac n = n * fac (n - 1)
If anyone tests this, using more than 9 or 10 in show (example 9) will take much longer than you want to wait for such a pointless sequence of numbers.
Just do it in IO. If you need to use the result in pure code, then unsafePerformIO is available.
The following version runs about 3-4 times faster with +RTS -N16 than +RTS -N1. My changes involved converting the ST vectors to IO, changing the forM_ to forConcurrently_, and adding a bang annotation to let !v = bogus ....
Full code:
import qualified Data.Vector as Vb
import qualified Data.Vector.Mutable as Vm
import qualified Data.Vector.Generic.Mutable as Vg
import qualified Data.Vector.Generic as Gg
import Control.Monad.ST as ST ( ST, runST )
import Data.Foldable(forM_)
import Data.Char(digitToInt)
import Control.Concurrent.Async
import System.IO.Unsafe
main :: IO ()
main = do
let m = unsafePerformIO (example 9)
putStrLn $ show m
example :: Int -> IO (Vb.Vector Int)
example n = do
m <- Vg.new (2^n)
Vg.unsafeWrite m 0 (1)
forM_ [1..n] $ \i -> do
p <- prev m n (i-1)
let newEntries = (choiceList n i) :: [Int]
forConcurrently_ newEntries $ \e -> do
let !v = bogus p e
Vg.unsafeWrite m e v
Gg.unsafeFreeze m
choiceList :: Int -> Int -> [Int]
choiceList _ 0 = [0]
choiceList n 1 = [ 2^k | k <- [0..(n-1) ] ]
choiceList n k
| n == k = [2^n - 1]
| otherwise = (choiceList (n-1) k) ++ (map ((2^(n-1)) +) $ choiceList (n-1) (k-1))
prev :: Vm.IOVector Int -> Int -> Int -> IO Integer
prev m n 0 = return 1
prev m n i = do
let chs = choiceList n i
v <- mapM (\k -> Vg.unsafeRead m k ) chs
let e = map (\k -> toInteger k ) v
return (sum e)
bogus :: Integer -> Int -> Int
bogus prior index = do
let f = fac prior
let g = (f^index) :: Integer
let d = (map digitToInt (show g)) :: [Int]
let a = fromIntegral (head d)^2
a
fac :: Integer -> Integer
fac 0 = 1
fac n = n * fac (n - 1)
I think this can not be done in a safe way. In the general case, it seems it would break Haskell's referential transparency.
If we could perform multi-threaded computations within ST s, then we could spawn two threads that race over the same STRef s Bool. Let's say one thread is writing False and the other one True.
After we use runST on the computation, we get an expression of type Bool which is sometimes False and sometimes True. That should not be possible.
If you are absolutely certain that your parallelization does not break referential transparency, you could try using unsafe primitives like unsafeIOToST to spawn new threads. Use with extreme care.
There might be safer ways to achieve something similar. Outside ST, we do have some parallelism available in Control.Parallel.Strategies.
There are a number of ways to do parallelization in Haskell. Usually they will give comparable performance improvements, however some are better then the others and it mostly depends on problem that needs parallelization. This particular use case looked very interesting to me, so I decided to investigate a few approaches.
Approaches
vector-strategies
We are using a boxed vector, therefore we can utilize laziness and built-in spark pool for parallelization. One very simple approach is provided by vector-strategies package, which can iterate over any immutable boxed vector and evaluate all of the thunks in parallel. It is also possible to split the vector in chunks, but as it turns out the chunk size of 1 is the optimal one:
exampleParVector :: Int -> Vb.Vector Int
exampleParVector n = example n `using` parVector 1
parallel
parVector uses par underneath and requires one extra iteration over the vector. In this case we are already iterating over thee vector, thus it would actually make more sense to use par from parallel directly. This would allow us to perform computation in parallel while continue using ST monad:
import Control.Parallel (par)
...
forM_ [1..n] $ \i -> do
p <- prev m n (i-1)
let newEntries = choiceList n i :: [Int]
forM_ newEntries $ \e -> do
let v = bogus p e
v `par` Vg.unsafeWrite m e v
It is important to note that the computation of each element of the vector is expensive when compared to the total number of elements in the vector. That is why using par is a very good solution here. If it was the opposite, namely the vector was very large, but elements weren't too expensive to compute, it would be better to use an unboxed vector and switch it to a different parallelization method.
async
Another way was described by #K.A.Buhr. Switch to IO from ST and use async:
import Control.Concurrent.Async (forConcurrently_)
...
forM_ [1..n] $ \i -> do
p <- prev m n (i-1)
let newEntries = choiceList n i :: [Int]
forConcurrently_ newEntries $ \e -> do
let !v = bogus p e
Vg.unsafeWrite m e v
The concern that #chi has raised is a valid one, however in this particular implementation it is safe to use unsafePerformIO instead of runST, because parallelization does not violate the invariant of deterministic computation. Namely, we can promise that regardless of the input supplied to example function, the output will always be exactly the same.
scheduler
Green threads are pretty cheap in Haskell, but they aren't free. The solution above with async package has one slight drawback: it will spin up at least as many threads as there are elements in the newEntries list each time forConcurrently_ is called. It would be better to spin up as many threads as there are capabilities (the -N RTS option) and let them do all the work. For this we can use scheduler package, which is a work stealing scheduler:
import Control.Scheduler (Comp(Par), runBatch_, withScheduler_)
...
withScheduler_ Par $ \scheduler ->
forM_ [1..n] $ \i -> runBatch_ scheduler $ \_ -> do
p <- prev m n (i-1)
let newEntries = choiceList n i :: [Int]
forM_ newEntries $ \e -> scheduleWork_ scheduler $ do
let !v = bogus p e
Vg.unsafeWrite m e v
Spark pool in GHC also uses a work stealing scheduler, which is built into RTS and is unrelated to the package above in any shape or form, but the idea is very similar: few threads with many units of computation.
Benchmarks
Here are some benchmarks on a 16-core machine for all of the approaches with example 7 (value 9 takes on the order of seconds, which introduces too much noise for criterion). We only get about x5 speedup, because a significant part of the algorithm is sequential in nature and can't be parallelized.
I have random number generator
rand :: Int -> Int -> IO Int
rand low high = getStdRandom (randomR (low,high))
and a helper function to remove an element from a list
removeItem _ [] = []
removeItem x (y:ys) | x == y = removeItem x ys
| otherwise = y : removeItem x ys
I want to shuffle a given list by randomly picking an item from the list, removing it and adding it to the front of the list. I tried
shuffleList :: [a] -> IO [a]
shuffleList [] = []
shuffleList l = do
y <- rand 0 (length l)
return( y:(shuffleList (removeItem y l) ) )
But can't get it to work. I get
hw05.hs:25:33: error:
* Couldn't match expected type `[Int]' with actual type `IO [Int]'
* In the second argument of `(:)', namely
....
Any idea ?
Thanks!
Since shuffleList :: [a] -> IO [a], we have shuffleList (xs :: [a]) :: IO [a].
Obviously, we can't cons (:) :: a -> [a] -> [a] an a element onto an IO [a] value, but instead we want to cons it onto the list [a], the computation of which that IO [a] value describes:
do
y <- rand 0 (length l)
-- return ( y : (shuffleList (removeItem y l) ) )
shuffled <- shuffleList (removeItem y l)
return y : shuffled
In do notation, values to the right of <- have types M a, M b, etc., for some monad M (here, IO), and values to the left of <- have the corresponding types a, b, etc..
The x :: a in x <- mx gets bound to the pure value of type a produced / computed by the M-type computation which the value mx :: M a denotes, when that computation is actually performed, as a part of the combined computation represented by the whole do block, when that combined computation is performed as a whole.
And if e.g. the next line in that do block is y <- foo x, it means that a pure function foo :: a -> M b is applied to x and the result is calculated which is a value of type M b, denoting an M-type computation which then runs and produces / computes a pure value of type b to which the name y is then bound.
The essence of Monad is thus this slicing of the pure inside / between the (potentially) impure, it is these two timelines going on of the pure calculations and the potentially impure computations, with the pure world safely separated and isolated from the impurities of the real world. Or seen from the other side, the pure code being run by the real impure code interacting with the real world (in case M is IO). Which is what computer programs must do, after all.
Your removeItem is wrong. You should pick and remove items positionally, i.e. by index, not by value; and in any case not remove more than one item after having picked one item from the list.
The y in y <- rand 0 (length l) is indeed an index. Treat it as such. Rename it to i, too, as a simple mnemonic.
Generally, with Haskell it works better to maximize the amount of functional code at the expense of non-functional (IO or randomness-related) code.
In your situation, your “maximum” functional component is not removeItem but rather a version of shuffleList that takes the input list and (as mentioned by Will Ness) a deterministic integer position. List function splitAt :: Int -> [a] -> ([a], [a]) can come handy here. Like this:
funcShuffleList :: Int -> [a] -> [a]
funcShuffleList _ [] = []
funcShuffleList pos ls =
if (pos <=0) || (length(take (pos+1) ls) < (pos+1))
then ls -- pos is zero or out of bounds, so leave list unchanged
else let (left,right) = splitAt pos ls
in (head right) : (left ++ (tail right))
Testing:
λ>
λ> funcShuffleList 4 [0,1,2,3,4,5,6,7,8,9]
[4,0,1,2,3,5,6,7,8,9]
λ>
λ> funcShuffleList 5 "#ABCDEFGH"
"E#ABCDFGH"
λ>
Once you've got this, you can introduce randomness concerns in simpler fashion. And you do not need to involve IO explicitely, as any randomness-friendly monad will do:
shuffleList :: MonadRandom mr => [a] -> mr [a]
shuffleList [] = return []
shuffleList ls =
do
let maxPos = (length ls) - 1
pos <- getRandomR (0, maxPos)
return (funcShuffleList pos ls)
... IO being just one instance of MonadRandom.
You can run the code using the default IO-hosted random number generator:
main = do
let inpList = [0,1,2,3,4,5,6,7,8]::[Integer]
putStrLn $ "inpList = " ++ (show inpList)
-- mr automatically instantiated to IO:
outList1 <- shuffleList inpList
putStrLn $ "outList1 = " ++ (show outList1)
outList2 <- shuffleList outList1
putStrLn $ "outList2 = " ++ (show outList2)
Program output:
$ pickShuffle
inpList = [0,1,2,3,4,5,6,7,8]
outList1 = [6,0,1,2,3,4,5,7,8]
outList2 = [8,6,0,1,2,3,4,5,7]
$
$ pickShuffle
inpList = [0,1,2,3,4,5,6,7,8]
outList1 = [4,0,1,2,3,5,6,7,8]
outList2 = [2,4,0,1,3,5,6,7,8]
$
The output is not reproducible here, because the default generator is seeded by its launch time in nanoseconds.
If what you need is a full random permutation, you could have a look here and there - Knuth a.k.a. Fisher-Yates algorithm.
i'm trying to write a function that for n gives matrix n*n with unique rows and columns (latin square).
I got function that gives my list of strings "1" .. "2" .. "n"
numSymbol:: Int -> [String]
I tried to generate all permutations of this, and them all n-length tuples of permutations, and them check if it is unique in row / columns. But complexity (n!)^2 works perfect for 2 and 3, but with n > 3 it takes forever. It is possible to build latin square from permutations directly, for example from
permutation ( numSymbol 3) = [["1","2","3"],["1","3","2"],["2","1","3"],["2","3","1"],["3","1","2"],["3","2","1"]]
get
[[["1","2","3",],["2","1","3"],["3","1","2"]] , ....]
without generating list like [["1",...],["1",...],...], when we know first element disqualify it ?
Note: since we can easily take a Latin square that's been filled with numbers from 1 to n and re-label it with anything we want, we can write code that uses integer symbols without giving anything away, so let's stick with that.
Anyway, the stateful backtracking/nondeterministic monad:
type StateList s = StateT s []
is helpful for this sort of problem.
Here's the idea. We know that every symbol s is going to appear exactly once in each row r, so we can represent this with an urn of all possible ordered pairs (r,s):
my_rs_urn = [(r,s) | r <- [1..n], s <- [1..n]]
Similarly, as every symbol s appears exactly once in each column c, we can use a second urn:
my_cs_urn = [(c,s) | c <- [1..n], s <- [1..n]]
Creating a Latin square is matter of filling in each position (r,c) with a symbol s by removing matching balls (r,s) and (c,s) (i.e., removing two balls, one from each urn) so that every ball is used exactly once. Our state will be the content of the urns.
We need backtracking because we might reach a point where for a particular position (r,c), there is no s such that (r,s) and (c,s) are both still available in their respective urns. Also, a pleasant side-effect of list-based backtracking/nondeterminism is that it'll generate all possible Latin squares, not just the first one it finds.
Given this, our state will look like:
type Urn = [(Int,Int)]
data S = S
{ size :: Int
, rs :: Urn
, cs :: Urn }
I've included the size in the state for convenience. It won't ever be modified, so it actually ought to be in a Reader instead, but this is simpler.
We'll represent a square by a list of cell contents in row-major order (i.e., the symbols in positions [(1,1),(1,2),...,(1,n),(2,1),...,(n,n)]):
data Square = Square
Int -- square size
[Int] -- symbols in row-major order
deriving (Show)
Now, the monadic action to generate latin squares will look like this:
type M = StateT S []
latin :: M Square
latin = do
n <- gets size
-- for each position (r,c), get a valid symbol `s`
cells <- forM (pairs n) (\(r,c) -> getS r c)
return $ Square n cells
pairs :: Int -> [(Int,Int)]
pairs n = -- same as [(x,y) | x <- [1..n], y <- [1..n]]
(,) <$> [1..n] <*> [1..n]
The worker function getS picks an s so that (r,s) and (c,s) are available in the respective urns, removing those pairs from the urns as a side effect. Note that getS is written non-deterministically, so it'll try every possible way of picking an s and associated balls from the urns:
getS :: Int -> Int -> M Int
getS r c = do
-- try each possible `s` in the row
s <- pickSFromRow r
-- can we put `s` in this column?
pickCS c s
-- if so, `s` is good
return s
Most of the work is done by the helpers pickSFromRow and pickCS. The first, pickSFromRow picks an s from the given row:
pickSFromRow :: Int -> M Int
pickSFromRow r = do
balls <- gets rs
-- "lift" here non-determinstically picks balls
((r',s), rest) <- lift $ choices balls
-- only consider balls in matching row
guard $ r == r'
-- remove the ball
modify (\st -> st { rs = rest })
-- return the candidate "s"
return s
It uses a choices helper which generates every possible way of pulling one element out of a list:
choices :: [a] -> [(a,[a])]
choices = init . (zipWith f <$> inits <*> tails)
where f a (x:b) = (x, a++b)
f _ _ = error "choices: internal error"
The second, pickCS checks if (c,s) is available in the cs urn, and removes it if it is:
pickCS :: Int -> Int -> M ()
pickCS c s = do
balls <- gets cs
-- only continue if the required ball is available
guard $ (c,s) `elem` balls
-- remove the ball
modify (\st -> st { cs = delete (c,s) balls })
With an appropriate driver for our monad:
runM :: Int -> M a -> [a]
runM n act = evalStateT act (S n p p)
where p = pairs n
this can generate all 12 Latin square of size 3:
λ> runM 3 latin
[Square 3 [1,2,3,2,3,1,3,1,2],Square 3 [1,2,3,3,1,2,2,3,1],...]
or the 576 Latin squares of size 4:
λ> length $ runM 4 latin
576
Compiled with -O2, it's fast enough to enumerate all 161280 squares of size 5 in a couple seconds:
main :: IO ()
main = print $ length $ runM 5 latin
The list-based urn representation above isn't very efficient. On the other hand, because the lengths of the lists are pretty small, there's not that much to be gained by finding more efficient representations.
Nonetheless, here's complete code that uses efficient Map/Set representations tailored to the way the rs and cs urns are used. Compiled with -O2, it runs in constant space. For n=6, it can process about 100000 Latin squares per second, but that still means it'll need to run for a few hours to enumerate all 800 million of them.
{-# OPTIONS_GHC -Wall #-}
module LatinAll where
import Control.Monad.State
import Data.List
import Data.Set (Set)
import qualified Data.Set as Set
import Data.Map (Map, (!))
import qualified Data.Map as Map
data S = S
{ size :: Int
, rs :: Map Int [Int]
, cs :: Set (Int, Int) }
data Square = Square
Int -- square size
[Int] -- symbols in row-major order
deriving (Show)
type M = StateT S []
-- Get Latin squares
latin :: M Square
latin = do
n <- gets size
cells <- forM (pairs n) (\(r,c) -> getS r c)
return $ Square n cells
-- All locations in row-major order [(1,1),(1,2)..(n,n)]
pairs :: Int -> [(Int,Int)]
pairs n = (,) <$> [1..n] <*> [1..n]
-- Get a valid `s` for position `(r,c)`.
getS :: Int -> Int -> M Int
getS r c = do
s <- pickSFromRow r
pickCS c s
return s
-- Get an available `s` in row `r` from the `rs` urn.
pickSFromRow :: Int -> M Int
pickSFromRow r = do
urn <- gets rs
(s, rest) <- lift $ choices (urn ! r)
modify (\st -> st { rs = Map.insert r rest urn })
return s
-- Remove `(c,s)` from the `cs` urn.
pickCS :: Int -> Int -> M ()
pickCS c s = do
balls <- gets cs
guard $ (c,s) `Set.member` balls
modify (\st -> st { cs = Set.delete (c,s) balls })
-- Return all ways of removing one element from list.
choices :: [a] -> [(a,[a])]
choices = init . (zipWith f <$> inits <*> tails)
where f a (x:b) = (x, a++b)
f _ _ = error "choices: internal error"
-- Run an action in the M monad.
runM :: Int -> M a -> [a]
runM n act = evalStateT act (S n rs0 cs0)
where rs0 = Map.fromAscList $ zip [1..n] (repeat [1..n])
cs0 = Set.fromAscList $ pairs n
main :: IO ()
main = do
print $ runM 3 latin
print $ length (runM 4 latin)
print $ length (runM 5 latin)
Somewhat remarkably, modifying the program to produce only reduced Latin squares (i.e., with symbols [1..n] in order in both the first row and the first column) requires changing only two functions:
-- All locations in row-major order, skipping first row and column
-- i.e., [(2,2),(2,3)..(n,n)]
pairs :: Int -> [(Int,Int)]
pairs n = (,) <$> [2..n] <*> [2..n]
-- Run an action in the M monad.
runM :: Int -> M a -> [a]
runM n act = evalStateT act (S n rs0 cs0)
where -- skip balls [(1,1)..(n,n)] for first row
rs0 = Map.fromAscList $ map (\r -> (r, skip r)) [2..n]
-- skip balls [(1,1)..(n,n)] for first column
cs0 = Set.fromAscList $ [(c,s) | c <- [2..n], s <- skip c]
skip i = [1..(i-1)]++[(i+1)..n]
With these modifications, the resulting Square will include symbols in row-major order but skipping the first row and column. For example:
λ> runM 3 latin
[Square 3 [3,1,1,2]]
means:
1 2 3 fill in question marks 1 2 3
2 ? ? =====================> 2 3 1
3 ? ? in row-major order 3 1 2
This is fast enough to enumerate all 16,942,080 reduced Latin squares of size 7 in a few minutes:
$ stack ghc -- -O2 -main-is LatinReduced LatinReduced.hs && time ./LatinReduced
[1 of 1] Compiling LatinReduced ( LatinReduced.hs, LatinReduced.o )
Linking LatinReduced ...
16942080
real 3m9.342s
user 3m8.494s
sys 0m0.848s
Here's a simple function. It takes an input Int and returns a (possibly empty) list of (Int, Int) pairs, where the input Int is the sum of the cubed elements of any of the pairs.
cubeDecomposition :: Int -> [(Int, Int)]
cubeDecomposition n = [(x, y) | x <- [1..m], y <- [x..m], x^3 + y^3 == n]
where m = truncate $ fromIntegral n ** (1/3)
-- cubeDecomposition 1729
-- [(1,12),(9,10)]
I want to test the property that the above is true; if I cube each element and sum any of the return tuples, then I get my input back:
import Control.Arrow
cubedElementsSumToN :: Int -> Bool
cubedElementsSumToN n = all (== n) d
where d = map (uncurry (+) . ((^3) *** (^3))) (cubeDecomposition n)
For runtime considerations, I'd like to limit the input Ints to a certain size when testing this with QuickCheck. I can define an appropriate type and Arbitrary instance:
{-# LANGUAGE GeneralizedNewtypeDeriving #-}
import Test.QuickCheck
newtype SmallInt = SmallInt Int
deriving (Show, Eq, Enum, Ord, Num, Real, Integral)
instance Arbitrary SmallInt where
arbitrary = fmap SmallInt (choose (-10000000, 10000000))
And then I guess I have to define versions of the function and property that use SmallInt rather than Int:
cubeDecompositionQC :: SmallInt -> [(SmallInt, SmallInt)]
cubeDecompositionQC n = [(x, y) | x <- [1..m], y <- [x..m], x^3 + y^3 == n]
where m = truncate $ fromIntegral n ** (1/3)
cubedElementsSumToN' :: SmallInt -> Bool
cubedElementsSumToN' n = all (== n) d
where d = map (uncurry (+) . ((^3) *** (^3))) (cubeDecompositionQC n)
-- cubeDecompositionQC 1729
-- [(SmallInt 1,SmallInt 12),(SmallInt 9,SmallInt 10)]
This works fine, and the standard 100 tests pass as expected. But it seems unnecessary to define a new type, instance, and function when all I really need is a custom generator. So I tried this:
smallInts :: Gen Int
smallInts = choose (-10000000, 10000000)
cubedElementsSumToN'' :: Int -> Property
cubedElementsSumToN'' n = forAll smallInts $ \m -> all (== n) (d m)
where d = map (uncurry (+) . ((^3) *** (^3)))
. cubeDecomposition
Now, the first few times I ran this, everything worked fine, and all tests pass. But on subsequent runs I observed failures. Bumping up the test size reliably finds one:
*** Failed! Falsifiable (after 674 tests and 1 shrink):
0
8205379
I'm a bit confused here due to the presence of two shrunken inputs - 0 and 8205379 - returned from QuickCheck, where I would intuitively expect one. Also, those inputs work as predicted (on my show-able property, at least):
*Main> cubedElementsSumToN 0
True
*Main> cubedElementsSumToN 8205379
True
So it seems like obviously there's a problem in the property that uses the custom Gen I defined.
What have I done wrong?
I quickly realized that the property as I've written it is obviously incorrect. Here's the proper way to do it, using the original cubedElementsSumToN property:
quickCheck (forAll smallInts cubedElementsSumToN)
which reads quite naturally.
Task: "Sum the first 15,000,000 even numbers."
Haskell:
nats = [1..] :: [Int]
evens = filter even nats :: [Int]
MySum:: Int
MySum= sum $ take 15000000 evens
...but MySum takes ages. More precisely, about 10-20 times slower than C/C++.
Many times I've found, that a Haskell solution coded naturally is something like 10 times slower than C. I expected that GHC was a very neatly optimizing compiler and task such this don't seem that tough.
So, one would expect something like 1.5-2x slower than C. Where is the problem?
Can this be solved better?
This is the C code I'm comparing it with:
long long sum = 0;
int n = 0, i = 1;
for (;;) {
if (i % 2 == 0) {
sum += i;
n++;
}
if (n == 15000000)
break;
i++;
}
Edit 1: I really know, that it can be computed in O(1). Please, resist.
Edit 2: I really know, that evens are [2,4..] but the function even could be something else O(1) and need to be implemented as a function.
Lists are not loops
So don't be surprised if using lists as a loop replacement, you get slower code if the loop body is small.
nats = [1..] :: [Int]
evens = filter even nats :: [Int]
dumbSum :: Int
dumbSum = sum $ take 15000000 evens
sum is not a "good consumer", so GHC is not (yet) able to eliminate the intermediate lists completely.
If you compile with optimisations (and don't export nat), GHC is smart enough to fuse the filter with the enumeration,
Rec {
Main.main_go [Occ=LoopBreaker]
:: GHC.Prim.Int# -> GHC.Prim.Int# -> [GHC.Types.Int]
[GblId, Arity=1, Caf=NoCafRefs, Str=DmdType L]
Main.main_go =
\ (x_aV2 :: GHC.Prim.Int#) ->
let {
r_au7 :: GHC.Prim.Int# -> [GHC.Types.Int]
[LclId, Str=DmdType]
r_au7 =
case x_aV2 of wild_Xl {
__DEFAULT -> Main.main_go (GHC.Prim.+# wild_Xl 1);
9223372036854775807 -> n_r1RR
} } in
case GHC.Prim.remInt# x_aV2 2 of _ {
__DEFAULT -> r_au7;
0 ->
let {
wild_atm :: GHC.Types.Int
[LclId, Str=DmdType m]
wild_atm = GHC.Types.I# x_aV2 } in
let {
lvl_s1Rp :: [GHC.Types.Int]
[LclId]
lvl_s1Rp =
GHC.Types.:
# GHC.Types.Int wild_atm (GHC.Types.[] # GHC.Types.Int) } in
\ (m_aUL :: GHC.Prim.Int#) ->
case GHC.Prim.<=# m_aUL 1 of _ {
GHC.Types.False ->
GHC.Types.: # GHC.Types.Int wild_atm (r_au7 (GHC.Prim.-# m_aUL 1));
GHC.Types.True -> lvl_s1Rp
}
}
end Rec }
but that's as far as GHC's fusion takes it. You are left with boxing Ints and constructing list cells. If you give it a loop, like you give it to the C compiler,
module Main where
import Data.Bits
main :: IO ()
main = print dumbSum
dumbSum :: Int
dumbSum = go 0 0 1
where
go :: Int -> Int -> Int -> Int
go sm ct n
| ct >= 15000000 = sm
| n .&. 1 == 0 = go (sm + n) (ct+1) (n+1)
| otherwise = go sm ct (n+1)
you get the approximate relation of running times between the C and the Haskell version you expected.
This sort of algorithm is not what GHC has been taught to optimise well, there are bigger fish to fry elsewhere before the limited manpower is put into these optimisations.
The problem why list fusion can't work here is actually rather subtle. Say we define the right RULE to fuse the list away:
import GHC.Base
sum2 :: Num a => [a] -> a
sum2 = sum
{-# NOINLINE [1] sum2 #-}
{-# RULES "sum" forall (f :: forall b. (a->b->b)->b->b).
sum2 (build f) = f (+) 0 #-}
(The short explanation is that we define sum2 as an alias of sum, which we forbid GHC to inline early, so the RULE has a chance to fire before sum2 gets eliminated. Then we look for sum2 directly next to the list-builder build (see definition) and replace it by direct arithmetic.)
This has mixed success, as it yields the following Core:
Main.$wgo =
\ (w_s1T4 :: GHC.Prim.Int#) ->
case GHC.Prim.remInt# w_s1T4 2 of _ {
__DEFAULT ->
case w_s1T4 of wild_Xg {
__DEFAULT -> Main.$wgo (GHC.Prim.+# wild_Xg 1);
15000000 -> 0
};
0 ->
case w_s1T4 of wild_Xg {
__DEFAULT ->
case Main.$wgo (GHC.Prim.+# wild_Xg 1) of ww_s1T7 { __DEFAULT ->
GHC.Prim.+# wild_Xg ww_s1T7
};
15000000 -> 15000000
}
}
Which is nice, completely fused code - with the sole problem that we have a call to $wgo in a non-tail-call position. This means that we aren't looking at a loop, but actually at a deeply recursive function, with predictable program results:
Stack space overflow: current size 8388608 bytes.
The root problem here is that the Prelude's list fusion can only fuse right folds, and computing the sum as a right fold directly causes the excessive stack consumption.
The obvious fix would be to use a fusion framework that can actually deal with left folds, such as Duncan's stream-fusion package, which actually implements sum fusion.
Another solution would be to hack around it - and implement the left fold using a right fold:
main = print $ foldr (\x c -> c . (+x)) id [2,4..15000000] 0
This actually produces close-to-perfect code with current versions of GHC. On the other hand, this is generally not a good idea as it relies on GHC being smart enough to eliminate the partially applied functions. Already adding a filter into the chain will break that particular optimization.
Sum first 15,000,000 even numbers:
{-# LANGUAGE BangPatterns #-}
g :: Integer -- 15000000*15000001 = 225000015000000
g = go 1 0 0
where
go i !a c | c == 15000000 = a
go i !a c | even i = go (i+1) (a+i) (c+1)
go i !a c = go (i+1) a c
ought to be the fastest.
If you want to be sure to traverse the list only once, you can write the traversal explicitly:
nats = [1..] :: [Int]
requiredOfX :: Int -> Bool -- this way you can write a different requirement
requiredOfX x = even x
dumbSum :: Int
dumbSum = dumbSum' 0 0 nats
where dumbSum' acc 15000000 _ = acc
dumbSum' acc count (x:xs)
| requiredOfX x = dumbSum' (acc + x) (count + 1) xs
| otherwise = dumbSum' acc (count + 1) xs
First, you can be clever as young Gauss was and compute the sum in O(1).
Fun stuff aside, your Haskell solution uses lists. I'm quite sure your C/C++ solution doesn't. (Haskell lists are very easy to use so one is tempted to use them even in cases where it might not be appropriate.) Try benchmarking this:
sumBy2 :: Integer -> Integer
sumBy2 = f 0
where
f result n | n <= 1 = result
| otherwise = f (n + result) (n - 2)
Compile it using GHC with -O2 argument. This function is tail-recursive so compiler can implement it very efficiently.
Update: If you want it using even function, it's possible:
sumBy2 :: Integer -> Integer
sumBy2 = f 0
where
f result n | n <= 0 = result
| even n = f (n + result) (n - 1)
| otherwise = f result (n - 1)
You can also easily make the filtering function a parameter:
sumFilter :: (Integral a) => (a -> Bool) -> a -> a
sumFilter filtfn = f 0
where
f result n | n <= 0 = result
| filtfn n = f (n + result) (n - 1)
| otherwise = f result (n - 1)
Strict version works much faster:
foldl' (+) 0 $ take 15000000 [2, 4..]
Another thing to note is that nats and evens are so-called Constant Applicative Forms, or CAFs for short. Basically, those correspond to top-level definitions without any arguments. CAFs are a bit of an odd duck, for instance being the reason for the Dreaded Monomorphism Restriction; I'm not sure the language definition even allows CAFs to be inlined.
In my mental model of how Haskell executes, by the time dumbSum returns a value, evens will be evaluated to look something like 2:4: ... : 30000000 : <thunk> and nats to 1:2: ... : 30000000 : <thunk>, where the <thunk>s represent something that's not been looked at yet. If my understanding is correct, these allocations of : do have to happen and can't be optimized away.
So one way of speeding things up without altering your code too much would be to simply write:
dumbSum :: Int
dumbSum = sum . take 15000000 . filter even $ [1..]
or
dumbSum = sum $ take 15000000 evens where
nats = [1..]
evens = filter even nats
On my machine, compiled with -O2, that alone seems to result in a roughly 30% speedup.
I'm no GHC connaisseur (I've never even profiled a Haskell program!), so I could be wildly off the mark, though.