Optimizing Perlin noise in Haskell

Optimizing Perlin noise in Haskell - haskell

(Dependencies for this program: vector --any and JuicyPixels >= 2. Code is available as Gist.)
{-# LANGUAGE Haskell2010 #-}
{-# LANGUAGE BangPatterns #-}
import Control.Arrow
import Data.Bits
import Data.Vector.Unboxed ((!))
import Data.Word
import System.Environment (getArgs)
import qualified Codec.Picture as P
import qualified Data.ByteString as B
import qualified Data.Vector.Unboxed as V
I tried to port Ken Perlin's improved noise
to Haskell, but I'm not entirely sure that my method is correct. The main part
is something that should generalize nicely to higher and lower dimensions, but
that is something for later:
perlin3 :: (Ord a, Num a, RealFrac a, V.Unbox a) => Permutation -> (a, a, a) -> a
perlin3 p (!x', !y', !z')
= let (!xX, !x) = actuallyProperFraction x'
(!yY, !y) = actuallyProperFraction y'
(!zZ, !z) = actuallyProperFraction z'
!u = fade x
!v = fade y
!w = fade z
!h = xX
!a = next p h + yY
!b = next p (h+1) + yY
!aa = next p a + zZ
!ab = next p (a+1) + zZ
!ba = next p b + zZ
!bb = next p (b+1) + zZ
!aaa = next p aa
!aab = next p (aa+1)
!aba = next p ab
!abb = next p (ab+1)
!baa = next p ba
!bab = next p (ba+1)
!bba = next p bb
!bbb = next p (bb+1)
in
lerp w
(lerp v
(lerp u
(grad aaa (x, y, z))
(grad baa (x-1, y, z)))
(lerp u
(grad aba (x, y-1, z))
(grad bba (x-1, y-1, z))))
(lerp v
(lerp u
(grad aab (x, y, z-1))
(grad bab (x-1, y, z-1)))
(lerp u
(grad abb (x, y-1, z-1))
(grad bbb (x-1, y-1, z-1))))
This is of course accompanied by a few functions mentioned in the perlin3
function, of which I hope they are as efficient as possible:
fade :: (Ord a, Num a) => a -> a
fade !t | 0 <= t, t <= 1 = t * t * t * (t * (t * 6 - 15) + 10)
lerp :: (Ord a, Num a) => a -> a -> a -> a
lerp !t !a !b | 0 <= t, t <= 1 = a + t * (b - a)
grad :: (Bits hash, Integral hash, Num a, V.Unbox a) => hash -> (a, a, a) -> a
grad !hash (!x, !y, !z) = dot3 (vks `V.unsafeIndex` fromIntegral (hash .&. 15)) (x, y, z)
where
vks = V.fromList
[ (1,1,0), (-1,1,0), (1,-1,0), (-1,-1,0)
, (1,0,1), (-1,0,1), (1,0,-1), (-1,0,-1)
, (0,1,1), (0,-1,1), (0,1,-1), (0,-1,-1)
, (1,1,0), (-1,1,0), (0,-1,1), (0,-1,-1)
]
dot3 :: Num a => (a, a, a) -> (a, a, a) -> a
dot3 (!x0, !y0, !z0) (!x1, !y1, !z1) = x0 * x1 + y0 * y1 + z0 * z1
-- Unlike `properFraction`, `actuallyProperFraction` rounds as intended.
actuallyProperFraction :: (RealFrac a, Integral b) => a -> (b, a)
actuallyProperFraction x
= let (ipart, fpart) = properFraction x
r = if x >= 0 then (ipart, fpart)
else (ipart-1, 1+fpart)
in r
For the permutation group, I just copied the one Perlin used on his website:
newtype Permutation = Permutation (V.Vector Word8)
mkPermutation :: [Word8] -> Permutation
mkPermutation xs
| length xs >= 256
= Permutation . V.fromList $ xs
permutation :: Permutation
permutation = mkPermutation
[151,160,137,91,90,15,
131,13,201,95,96,53,194,233,7,225,140,36,103,30,69,142,8,99,37,240,21,10,23,
190, 6,148,247,120,234,75,0,26,197,62,94,252,219,203,117,35,11,32,57,177,33,
88,237,149,56,87,174,20,125,136,171,168, 68,175,74,165,71,134,139,48,27,166,
77,146,158,231,83,111,229,122,60,211,133,230,220,105,92,41,55,46,245,40,244,
102,143,54, 65,25,63,161, 1,216,80,73,209,76,132,187,208, 89,18,169,200,196,
135,130,116,188,159,86,164,100,109,198,173,186, 3,64,52,217,226,250,124,123,
5,202,38,147,118,126,255,82,85,212,207,206,59,227,47,16,58,17,182,189,28,42,
223,183,170,213,119,248,152, 2,44,154,163, 70,221,153,101,155,167, 43,172,9,
129,22,39,253, 19,98,108,110,79,113,224,232,178,185, 112,104,218,246,97,228,
251,34,242,193,238,210,144,12,191,179,162,241, 81,51,145,235,249,14,239,107,
49,192,214, 31,181,199,106,157,184, 84,204,176,115,121,50,45,127, 4,150,254,
138,236,205,93,222,114,67,29,24,72,243,141,128,195,78,66,215,61,156,180
]
next :: Permutation -> Word8 -> Word8
next (Permutation !v) !idx'
= v `V.unsafeIndex` (fromIntegral $ idx' .&. 0xFF)
And all this is tied together with JuicyPixels:
main = do
[target] <- getArgs
let image = P.generateImage pixelRenderer 512 512
P.writePng target image
where
pixelRenderer, pixelRenderer' :: Int -> Int -> Word8
pixelRenderer !x !y
= floor $ ((perlin3 permutation ((fromIntegral x - 256) / 32,
(fromIntegral y - 256) / 32, 0 :: Double))+1)/2 * 128
-- This code is much more readable, but also much slower.
pixelRenderer' x y
= (\w -> floor $ ((w+1)/2 * 128)) -- w should be in [-1,+1]
. perlin3 permutation
. (\(x,y,z) -> ((x-256)/32, (y-256)/32, (z-256)/32))
$ (fromIntegral x, fromIntegral y, 0 :: Double)
My problem is that perlin3 seems very slow to me. If I profile it, pixelRenderer
is getting a lot of time as well, but I'll ignore that for now. I don't know
how to optimize perlin3. I tried to hint GHC with bang patterns, which cuts
the execution time in half, so that's nice. Explicitly specializing and inlining
barely helps with ghc -O. Is perlin3 supposed to be this slow?
UPDATE: an earlier version of this question mentioned a bug in my code. This problem has been resolved; it turns out my old version of actuallyProperFraction was buggy. It implicitly rounded the integral part of a floating point number to Word8, and then subtracted it from the floating point number to get the fractional part. Since Word8 can only take values between 0 and 255 inclusive, this won't work properly for numbers outside that range, including negative numbers.

This code appears to be mostly computation-bound. It can be improved a little bit, but not by much unless there's a way to use fewer array lookups and less arithmetic.
There are two useful tools for measuring performance: profiling and code dumps. I added an SCC annotation to perlin3 so that it would show up in the profile. Then I compiled with gcc -O2 -fforce-recomp -ddump-simpl -prof -auto. The -ddump-simpl flag prints the simplified code.
Profiling: On my computer, it takes 0.60 seconds to run the program, and about 20% of execution time (0.12 seconds) is spent in perlin3 according to the profile. Note that the precision of my profile info is about +/-3%.
Simplifier output: The simplifier produces fairly clean code. perlin3 gets inlined into pixelRenderer, so that's the part of the output you want to look at. Most of the code consists of unboxed array reads and unboxed arithmetic. To improve performance, we want to eliminate some of this arithmetic.
An easy change is to eliminate the run-time checks on SomeFraction (which doesn't appear in your question, but is part of the code that you uploaded). This reduces the program's execution time to 0.56 seconds.
-- someFraction t | 0 <= t, t < 1 = SomeFraction t
someFraction t = SomeFraction t
Next, there are several array lookups that show up in the simplifier like this:
case GHC.Prim.indexWord8Array#
ipv3_s23a
(GHC.Prim.+#
ipv1_s21N
(GHC.Prim.word2Int#
(GHC.Prim.and#
(GHC.Prim.narrow8Word#
(GHC.Prim.plusWord# ipv5_s256 (__word 1)))
(__word 255))))
The primitive operation narrow8Word# is for coercing from an Int to a Word8. We can get rid of this coercion by using Int instead of Word8 in the definition of next.
next :: Permutation -> Int -> Int
next (Permutation !v) !idx'
= fromIntegral $ v `V.unsafeIndex` (fromIntegral idx' .&. 0xFF)
This reduces the program's execution time to 0.54 seconds. Considering just the time spent in perlin3, the execution time has fallen (roughly) from 0.12 to 0.06 seconds. Although it's hard to measure where the rest of the time is going, it's most likely spread out among the remaining arithmetic and array accesses.

On my machine reference code with Heatsink's optimisations takes 0.19 secs.
Firstly, I has moved from JuicyPixels to yarr and yarr-image-io with my favourite flags, -Odph -rtsopts -threaded -fno-liberate-case -funbox-strict-fields -fexpose-all-unfoldings -funfolding-keeness-factor1000 -fsimpl-tick-factor=500 -fllvm -optlo-O3 (they are given here):
import Data.Yarr as Y
import Data.Yarr.IO.Image as Y
...
main = do
[target] <- getArgs
image <- dComputeS $ fromFunction (512, 512) (return . pixelRenderer)
Y.writeImage target (Grey image)
where
pixelRenderer, pixelRenderer' :: Dim2 -> Word8
pixelRenderer (y, x)
= floor $ ((perlin3 permutation ((fromIntegral x - 256) / 32,
(fromIntegral y - 256) / 32, 0 :: Double))+1)/2 * 128
-- This code is much more readable, but also much slower.
pixelRenderer' (y, x)
= (\w -> floor $ ((w+1)/2 * 128)) -- w should be in [-1,+1]
. perlin3 permutation
. (\(x,y,z) -> ((x-256)/32, (y-256)/32, (z-256)/32))
$ (fromIntegral x, fromIntegral y, 0 :: Double)
This makes the program 30% faster, 0.13 seconds.
Secondly I has replaced uses of standard floor with
doubleToByte :: Double -> Word8
doubleToByte f = fromIntegral (truncate f :: Int)
It is known issue (google "haskell floor performance"). Execution time is reduced to 52 ms (0.052 secs), in almost 3 times.
Finally, just for fun I tried to compute noise in parallel (dComputeP instead of dComputeS and +RTS -N4 in command line run). Program took 36 ms, including I/O constant of about 10 ms.

Related

Mutable list of mutabale non-integral types in Haskell

I'm trying to parse a huge 3d-data array of complex values from binary. Later this should become l matrices (n x m). Since I'm going to work on these matrices, I'm limited to matrix libraries - hmatrix seems to be promising.
The data layout is not in my requried format, so I have to jump around in positions (i,j,k) -> (k,i,j), where i and j are elements of n and m and k element of l.
I think the only way to read in this in is my using mutables, otherwise I'll end up with several Terrabytes of garbage. My idea was to use boxed mutual arrays or vectors of mututal matrices (STMatrix from Numeric.LinearAlgebra.Devel), so I end up with something like:
data MVector s (STMatrix s t)
But I'm not sure how to use them correctly:
I can modify one single element of the MVector with modify:
modify :: PrimMonad m => MVector (PrimState m) a -> (a -> a) -> Int -> m ()
or use modifyM (Strange: in stack vector-0.12.3.0 does not have modifyM...)
modifyM :: PrimMonad m => MVector (PrimState m) a -> (a -> m a) -> Int -> m ()
so I could use the function call (a -> a) to a runST-routine to modify the SMatrix. I'm not sure, if I should put an ST in an IO (?)
Nevertheless - I think, this should work but is only useful, when I want to modify the whole Matrix, calling this (a->a)-routine n x m x l- times will be a little bit overhead (Maybe it will be optimized out...).
So I'll end up, in marshalling the Array, modify the content via pointers (i,j,k) -> (k,i,j) and read everything Matrix by Matrix - but this does not feel right and I wanted to avoid such dirty tricks.
Do you have any ideas of a way to do this a little but more ...clean?
Ty
Edit:
Thx to K. A. Buhr. His solution works so far. Now, I'm only running into some performance impacts. If I compare the solution:
{-# LANGUAGE BangPatterns #-}
module Main where
import Data.List
import Numeric.LinearAlgebra
import qualified Data.Vector as V
import qualified Data.Vector.Storable as VS
import qualified Data.Vector.Storable.Mutable as VSM
-- Create an l-length list of n x m hmatrix Matrices
toMatrices :: Int -> Int -> Int -> [C] -> [Matrix C]
toMatrices l n m dats = map (reshape m) $ VS.createT $ do
mats <- V.replicateM l $ VSM.unsafeNew (m*n)
sequence_ $ zipWith (\(i,j,k) x ->
VSM.unsafeWrite (mats V.! k) (loc i j) x) idxs (dats ++ repeat 0)
return $ V.toList mats
where idxs = (,,) <$> [0..n-1] <*> [0..m-1] <*> [0..l-1]
loc i j = i*m + j
test1 = toMatrices 1000 1000 100 (fromIntegral <$> [1..])
main = do
let !a = test1
print "done"
With the simpliest C-code:
#include <stdlib.h>
#include <stdio.h>
void main()
{
const int n = 1000;
const int m = 1000;
const int l = 100;
double *src = malloc(n*m*l * sizeof(double));
for (int i = 0; i < n*m*l; i++) {
src[i] = (double)i;
}
double *dest = malloc(n*m*l * sizeof(double));
for (int i = 0; i < n; i++) {
for (int j = 0; j < m; j++) {
for (int k = 0; k < l; k++) {
dest[k*n*m+i*m+j] = src[i*m*l+j*l+k];
}
}
}
printf("done: %f\n", dest[n*m*l - 1]); // Need to access the array, otherwise it'll get lost by -O2
free(src);
free(dest);
}
Both compiled with -O2 give following performance guesses:
real 0m5,611s
user 0m14,845s
sys 0m2,759s
vs.
real 0m0,441s
user 0m0,200s
sys 0m0,240s
This are approx 2 magnitudes per-core performance. From profiling I learn that
VSM.unsafeWrite (mats V.! k) (loc i j) x
is the expensive function.
Since I'll use this procedure in a minute-like intervall, I want to keep the parsing time as low as the disk access time. I'll see, if I can speed this up
PS: This is for some tests, if I could move usual DSP from C-like to Haskell
Edit2 :
Ok, this is what I get after sum trying:
{-# LANGUAGE BangPatterns #-}
module Main where
import Data.List
import qualified Data.Vector as V
import qualified Data.Vector.Storable as VS
import qualified Data.Vector.Storable.Mutable as VSM
import Numeric.LinearAlgebra
-- Create an l-length list of n x m hmatrix Matrices
toMatrices :: Int -> Int -> Int -> VS.Vector C -> V.Vector (Matrix C)
toMatrices l n m dats =
V.map (reshape m) newMat
where
newMat = VS.createT $
V.generateM l $ \k -> do
curMat <- VSM.unsafeNew (m * n)
VS.mapM_
(\i ->
VS.mapM_
(\j -> VSM.unsafeWrite curMat (loc i j) (dats VS.! (oldLoc i j k)))
idjs)
idis
return curMat
loc i j = i * m + j
oldLoc i j k = i * m * l + j * l + k
!idis = VS.generate n (\a->a)
!idjs = VS.generate m (\a->a)
test1 = toMatrices 100 1000 1000 arr
where
arr = VS.generate (1000 * 1000 * 100) fromIntegral :: VS.Vector C
main = do
let !a = test1
print "done"
It gives something about:
real 0m1,816s
user 0m1,636s
sys 0m1,120s
, so ~4 times slower than C code. I think I can live with this.
I guess, I'm destroying all the stream-functionality of the vector with this code. If there are any suggestions to have them back by a comparable speed, I would be grateful!

As I understand it, you have a "huge" set of data in i-major, j-middling, k-minor order, and you want to load it into matrices indexed by k whose elements have i-indexed rows and j-indexed columns, right? So, you want a function something like:
import Numeric.LinearAlgebra
-- load into "l" matrices of size "n x m"
toMatrices :: Int -> Int -> Int -> [C] -> [Matrix C]
toMatrices l n m dats = ...
Note that you've written n x m matrices above, associating i with n and j with m. It would be more usual to flip the roles of n and m, but I've stuck with your notation, so keep an eye on that.
If the entire data list [C] could fit comfortably in memory, you could do this immutably by writing something like:
import Data.List
import Data.List.Split
import Numeric.LinearAlgebra
toMatrices :: Int -> Int -> Int -> [C] -> [Matrix C]
toMatrices l n m = map (reshape m . fromList) . transpose . chunksOf l
This breaks the input data into l-sized chunks, transposes them into l lists, and converts each list to a matrix. If there was some way to force all the Matrix C values in parallel, this could be done with one traversal through the data, without the need to hold on to the whole list. Unfortunately, the individual Matrix C values can only be forced one-by-one, and the whole list needs to be kept around until all of them can be forced.
So, if the "huge" [C] list is too big for memory, you're probably right that you need to load the data into a (partially) mutable structure. The code is somewhat challenging to write, but it's not too bad in its final form. I believe the following will work:
import Data.List
import Numeric.LinearAlgebra
import qualified Data.Vector as V
import qualified Data.Vector.Storable as VS
import qualified Data.Vector.Storable.Mutable as VSM
-- Create an l-length list of n x m hmatrix Matrices
toMatrices :: Int -> Int -> Int -> [C] -> [Matrix C]
toMatrices l n m dats = map (reshape m) $ VS.createT $ do
mats <- V.replicateM l $ VSM.unsafeNew (m*n)
sequence_ $ zipWith (\(i,j,k) x ->
VSM.unsafeWrite (mats V.! k) (loc i j) x) idxs (dats ++ repeat 0)
return $ V.toList mats
where idxs = (,,) <$> [0..n-1] <*> [0..m-1] <*> [0..l-1]
loc i j = i*m + j
test1 = toMatrices 4 3 2 (fromIntegral <$> [1..24])
test2 = toMatrices 1000 1000 100 (fromIntegral <$> [1..])
main = do
print $ test1
print $ norm_Inf . foldl1' (+) $ test2
Compiled with -O2, the maximum residency is about 1.6Gigs, which matches the expected memory needed to hold 100 matrices of one million 16-byte complex values in memory, so that looks right.
Anyway, this version of toMatrices is made somewhat complicated by the use of three different vector variants. There's Vector from hmatrix, which is the same as the immutable storable VS.Vector from vector; and then there are two more types from vector: the immutable boxed V.Vector, and the mutable storable VSM.Vector.
The do-block creates a V.Vector of VSM.Vectors and populates those with a sequence of monadic actions performed across index/value pairs. You can load the data in any order by modifying the definition of idxs to match the order of the data stream. The do-block returns the final VSM.Vectors in a list, the helper function VS.createT freezes them all to VS.Vectors (i.e., Vector from hmatrix), and reshape is mapped across the vectors to turn them into m-column matrices.
Note that you'll have to take care that in your actual application, the list of data items read from the file isn't kept around by code other than toMatrices, either in the original text form or the parsed numeric form. This shouldn't be too tough to get right, but you might want to test on medium-sized test input before locking up your computer on the real dataset.

How do I memoize?

I have written this function that computes Collatz sequences, and I see wildly varying times of execution depending on the spin I give it. Apparently it is related to something called "memoization", but I have a hard time understanding what it is and how it works, and, unfortunately, the relevant article on HaskellWiki, as well as the papers it links to, have all proven to not be easily surmountable. They discuss intricate details of the relative performance of highly layman-indifferentiable tree constructions, while what I miss must be some very basic, very trivial point that these sources neglect to mention.
This is the code. It is a complete program, ready to be built and executed.
module Main where
import Data.Function
import Data.List (maximumBy)
size :: (Integral a) => a
size = 10 ^ 6
-- Nail the basics.
collatz :: Integral a => a -> a
collatz n | even n = n `div` 2
| otherwise = n * 3 + 1
recollatz :: Integral a => a -> a
recollatz = fix $ \f x -> if (x /= 1)
then f (collatz x)
else x
-- Now, I want to do the counting with a tuple monad.
mocollatz :: Integral b => b -> ([b], b)
mocollatz n = ([n], collatz n)
remocollatz :: Integral a => a -> ([a], a)
remocollatz = fix $ \f x -> if x /= 1
then f =<< mocollatz x
else return x
-- Trivialities.
collatzLength :: Integral a => a -> Int
collatzLength x = (length . fst $ (remocollatz x)) + 1
collatzPairs :: Integral a => a -> [(a, Int)]
collatzPairs n = zip [1..n] (collatzLength <$> [1..n])
longestCollatz :: Integral a => a -> (a, Int)
longestCollatz n = maximumBy order $ collatzPairs n
where
order :: Ord b => (a, b) -> (a, b) -> Ordering
order x y = snd x `compare` snd y
main :: IO ()
main = print $ longestCollatz size
With ghc -O2 it takes about 17 seconds, without ghc -O2 -- about 22 seconds to deliver the length and the seed of the longest Collatz sequence starting at any point below size.
Now, if I make these changes:
diff --git a/Main.hs b/Main.hs
index c78ad95..9607fe0 100644
--- a/Main.hs
+++ b/Main.hs
## -1,6 +1,7 ##
module Main where
import Data.Function
+import qualified Data.Map.Lazy as M
import Data.List (maximumBy)
size :: (Integral a) => a
## -22,10 +23,15 ## recollatz = fix $ \f x -> if (x /= 1)
mocollatz :: Integral b => b -> ([b], b)
mocollatz n = ([n], collatz n)
-remocollatz :: Integral a => a -> ([a], a)
-remocollatz = fix $ \f x -> if x /= 1
- then f =<< mocollatz x
- else return x
+remocollatz :: (Num a, Integral b) => b -> ([b], a)
+remocollatz 1 = return 1
+remocollatz x = case M.lookup x (table mutate) of
+ Nothing -> mutate x
+ Just y -> y
+ where mutate x = remocollatz =<< mocollatz x
+
+table :: (Ord a, Integral a) => (a -> b) -> M.Map a b
+table f = M.fromList [ (x, f x) | x <- [1..size] ]
-- Trivialities.
-- Then it will take just about 4 seconds with ghc -O2, but I would not live long enough to see it complete without ghc -O2.
Looking at the details of cost centres with ghc -prof -fprof-auto -O2 reveals that the first version enters collatz about a hundred million times, while the patched one -- just about one and a half million times. This must be the reason of the speedup, but I have a hard time understanding the inner workings of this magic. My best idea is that we replace a portion of expensive recursive calls with O(log n) map lookups, but I don't know if it's true and why it depends so much on some godforsaken compiler flags, while, as I see it, such performance swings should all follow solely from the language.
Can I haz an explanation of what happens here, and why the performance differs so vastly between ghc -O2 and plain ghc builds?
P.S. There are two requirements to the achieving of automagical memoization highlighted elsewhere on Stack Overflow:
Make a function to be memoized a top-level name.
Make a function to be memoized a monomorphic one.
In line with these requirements, I rebuilt remocollatz as follows:
remocollatz :: Int -> ([Int], Int)
remocollatz 1 = return 1
remocollatz x = mutate x
mutate :: Int -> ([Int], Int)
mutate x = remocollatz =<< mocollatz x
Now it's as top level and as monomorphic as it gets. Running time is about 11 seconds, versus the similarly monomorphized table version:
remocollatz :: Int -> ([Int], Int)
remocollatz 1 = return 1
remocollatz x = case M.lookup x (table mutate) of
Nothing -> mutate x
Just y -> y
mutate :: Int -> ([Int], Int)
mutate = \x -> remocollatz =<< mocollatz x
table :: (Int -> ([Int], Int)) -> M.Map Int ([Int], Int)
table f = M.fromList [ (x, f x) | x <- [1..size] ]
-- Running in less than 4 seconds.
I wonder why the memoization ghc is supposedly performing in the first case here is almost 3 times slower than my dumb table.

Can I haz an explanation of what happens here, and why the performance differs so vastly between ghc -O2 and plain ghc builds?
Disclaimer: this is a guess, not verified by viewing GHC core output. A careful answer would do so to verify the conjectures outlined below. You can try peering through it yourself: add -ddump-simpl to your compilation line and you will get copious output detailing exactly what GHC has done to your code.
You write:
remocollatz x = {- ... -} table mutate {- ... -}
where mutate x = remocollatz =<< mocollatz x
The expression table mutate in fact does not depend on x; but it appears on the right-hand side of an equation that takes x as an argument. Consequently, without optimizations, this table is recomputed each time remocollatz is called (presumably even from inside the computation of table mutate).
With optimizations, GHC notices that table mutate does not depend on x, and floats it to its own definition, effectively producing:
fresh_variable_name = table mutate
where mutate x = remocollatz =<< mocollatz x
remocollatz x = case M.lookup x fresh_variable_name of
{- ... -}
The table is therefore computed just once for the entire program run.
don't know why it [the performance] depends so much on some godforsaken compiler flags, while, as I see it, such performance swings should all follow solely from the language.
Sorry, but Haskell doesn't work that way. The language definition tells clearly what the meaning of a given Haskell term is, but does not say anything about the runtime or memory performance needed to compute that meaning.

Another approach to memoization that works in some situations, like this one, is to use a boxed vector, whose elements are computed lazily. The function used to initialize each element can use other elements of the vector in its calculation. As long as the evaluation of an element of the vector doesn't loop and refer to itself, just the elements it recursively depends on will be evaluated. Once evaluated, an element is effectively memoized, and this has the further benefit that elements of the vector that are never referenced are never evaluated.
The Collatz sequence is a nearly ideal application for this technique, but there is one complication. The next Collatz value(s) in sequence from a value under the limit may be outside the limit, which would cause a range error when indexing the vector. I solved this by just iterating through the sequence until back under the limit and counting the steps to do so.
The following program takes 0.77 seconds to run unoptimized and 0.30 when optimized:
import qualified Data.Vector as V
limit = 10 ^ 6 :: Int
-- The Collatz function, which given a value returns the next in the sequence.
nextCollatz val
| odd val = 3 * val + 1
| otherwise = val `div` 2
-- Given a value, return the next Collatz value in the sequence that is less
-- than the limit and the number of steps to get there. For example, the
-- sequence starting at 13 is: [13, 40, 20, 10, 5, 16, 8, 4, 2, 1], so if
-- limit is 100, then (nextCollatzWithinLimit 13) is (40, 1), but if limit is
-- 15, then (nextCollatzWithinLimit 13) is (10, 3).
nextCollatzWithinLimit val = (firstInRange, stepsToFirstInRange)
where
firstInRange = head rest
stepsToFirstInRange = 1 + (length biggerThanLimit)
(biggerThanLimit, rest) = span (>= limit) (tail collatzSeqStartingWithVal)
collatzSeqStartingWithVal = iterate nextCollatz val
-- A boxed vector holding Collatz length for each index. The collatzFn used
-- to generate the value for each element refers back to other elements of
-- this vector, but since the vector elements are only evaluated as needed and
-- there aren't any loops in the Collatz sequences, the values are calculated
-- only as needed.
collatzVec :: V.Vector Int
collatzVec = V.generate limit collatzFn
where
collatzFn :: Int -> Int
collatzFn index
| index <= 1 = 1
| otherwise = (collatzVec V.! nextWithinLimit) + stepsToGetThere
where
(nextWithinLimit, stepsToGetThere) = nextCollatzWithinLimit index
main :: IO ()
main = do
-- Use a fold through the vector to find the longest Collatz sequence under
-- the limit, and keep track of both the maximum length and the initial
-- value of the sequence, which is the index.
let (maxLength, maxIndex) = V.ifoldl' accMaxLen (0, 0) collatzVec
accMaxLen acc#(accMaxLen, accMaxIndex) index currLen
| currLen <= accMaxLen = acc
| otherwise = (currLen, index)
putStrLn $ "Max Collatz length below " ++ show limit ++ " is "
++ show maxLength ++ " at index " ++ show maxIndex

How much space does ridge regression require?

In Haskell, ridge regression can be expressed as:
import Numeric.LinearAlgebra
createReadout :: Matrix Double → Matrix Double → Matrix Double
createReadout a b = oA <\> oB
where
μ = 1e-4
oA = (a <> (tr a)) + (μ * (ident $ rows a))
oB = a <> (tr b)
However, this operation is very memory expensive. Here is a minimalistic example that requires more than 2GB on my machine and takes 3 minutes to execute.
import Numeric.LinearAlgebra
import System.Random
createReadout :: Matrix Double -> Matrix Double -> Matrix Double
createReadout a b = oA <\> oB
where
mu = 1e-4
oA = (a <> (tr a)) + (mu * (ident $ rows a))
oB = a <> (tr b)
teacher :: [Int] -> Int -> Int -> Matrix Double
teacher labelsList cols' correctRow = fromBlocks $ f <$> labelsList
where ones = konst 1.0 (1, cols')
zeros = konst 0.0 (1, cols')
rows' = length labelsList
f i | i == correctRow = [ones]
| otherwise = [zeros]
glue :: Element t => [Matrix t] -> Matrix t
glue xs = fromBlocks [xs]
main :: IO ()
main = do
let n = 1500 -- <- The constant to be increased
m = 10000
cols' = 12
g <- newStdGen
-- Stub data
let labels = take m . map (`mod` 10) . randoms $ g :: [Int]
a = (n >< (cols' * m)) $ take (cols' * m * n) $ randoms g :: Matrix Double
teachers = zipWith (teacher [0..9]) (repeat cols') labels
b = glue teachers
print $ maxElement $ createReadout a b
return ()
$ cabal exec ghc -- -O2 Test.hs
$ time ./Test
./Test 190.16s user 5.22s system 106% cpu 3:03.93 total
The problem is to increase the constant n, at least to n = 4000, while RAM is limited by 5GB. What is minimal space that matrix inversion operation requires in theory? How can this operation be optimized in terms of space? Can ridge regression be efficiently replaced with a cheaper method?

Simple Gauss-Jordan elimination only takes space to store the input and output matrices plus constant auxiliary space. If I'm reading correctly, the matrix oA you need to invert is n x n so that's not a problem.
Your memory usage is completely dominated by storing the input matrix a, which uses at least 1500 * 120000 * 8 = 1.34 GB. n = 4000 would be 4000 * 120000 * 8 = 3.58 GB which is over half of your space budget. I don't know what matrix library you are using or how it stores its matrices, but if they are on the Haskell heap then GC effects could easily account for another factor of 2 in space usage.

Well you can get away with 3*m + nxn space, but how numerically stable this will be I'm not sure.
The basis is the identity
inv( inv(Q) + A'*A)) = Q - Q*A'*R*A*Q
where R = inv( I + A*Q*A')
If A is your A matrix and
Q = inv( mu*I*mu*I) = I/(mu*mu)
then the solution to your ridge regression is
inv( inv(Q) + A'*A)) * A'*b
A little more algebra shows
inv( inv(Q) + A'*A)) = (I - A'*inv( (mu2 + A*A'))*A)/mu2
where mu2 = mu*m
Note that since A is n x m, A*A' is n x n.
So one algorithm would be
Compute C = A*A' + mu2
Do a cholesky decompostion of C, ie find upper triangular U so that U'*U = C
Compute the vector y = A'*b
Compute the vector z = A*y
Solve U'*u = z for u in z
Solve U*v = z for v in z
compute w = A'*z
Compute x = (y - w)/mu2.

How do I optimize numerical integration performance in Haskell (with example)

How do I optimize numerical integration routine (comparing to C)?
What has been done to the moment:
I replaced lists with unboxed vectors (obvious).
I applied profiling techniques described in the book "Read World Haskell" http://book.realworldhaskell.org/read/profiling-and-optimization.html.
I have inlined some trivial functions and inserted a lot of bangs everywhere.
That gave about 10x speedup.
I refactored the code (i.e. extracted iterator function). That gave 3x speedup.
I tried to replace polymorphic signatures with Floats
as in the answer to this question
Optimizing numerical array performance in Haskell.
That gave almost 2x speedup.
I compile like this
cabal exec ghc -- Simul.hs -O2 -fforce-recomp -fllvm -Wall
UPDATE As suggested by cchalmers, type Sample = (F, F) was replaced with
data Sample = Sample {-# UNPACK #-} !F {-# UNPACK #-} !F
The performance now is almost as good as C code. Can we do better?
{-# LANGUAGE BangPatterns #-}
module Main
where
import qualified Data.Vector.Unboxed as U
import qualified Data.Vector.Unboxed.Mutable as UM
import qualified Control.Monad.Primitive as PrimitiveM
import Dynamics.Nonlin ( birefrP )
type F = Float
type Delay = U.Vector F
type Input = U.Vector F
-- Sample can be a vector of any length (x, y, z, ...)
data Sample = Sample {-# UNPACK #-} !F {-# UNPACK #-} !F
-- Pair is used to define exactly a pair of values
data Pair = Pair {-# UNPACK #-} !F {-# UNPACK #-} !F
type ParametrizedDelayFunction = (Sample, F) -> Sample
getX :: Sample -> F
getX (Sample a _) = a
{-# INLINE getX #-}
toDelay :: [F] -> Delay
toDelay = U.fromList
stepsPerNode :: Int
stepsPerNode = 40 -- Number of integration steps per node
infixl 6 ..+..
(..+..) :: Sample -> Sample -> Sample
(..+..) (Sample x1 y1) (Sample x2 y2) = Sample (x1 + x2) (y1 + y2)
{-# INLINE (..+..) #-}
infixl 7 .*..
(.*..) :: F -> Sample -> Sample
(.*..) c (Sample x2 y2) = Sample (c * x2) (c * y2)
{-# INLINE (.*..) #-}
-- | Ikeda model (dynamical system, DDE)
ikeda_model2
:: (F -> F) -> (Sample, F) -> Sample
ikeda_model2 f (!(Sample x y), !x_h) = Sample x' y'
where
! x' = recip_epsilon * (-x + (f x_h))
y' = 0
recip_epsilon = 2^(6 :: Int)
-- | Integrate using improved Euler's method (fixed step).
--
-- hOver2 is already half of step size h
-- f is the function to integrate
-- x_i is current argument (x and y)
-- x_h is historical (delayed) value
-- x_h2 it the value after x_h
heun2 :: F -> ParametrizedDelayFunction
-> Sample -> Pair -> Sample
heun2 hOver2 f !x !(Pair x_h x_h2) = x_1
where
! f1 = f (x, x_h)
! x_1' = x ..+.. 2 * hOver2 .*.. f1
! f2 = f (x_1', x_h2)
! x_1 = x ..+.. hOver2 .*.. (f1 ..+.. f2)
initialCond :: Int -> (Sample, Delay, Int)
initialCond nodesN = (initialSampleXY, initialInterval, samplesPerDelay)
where cdi = 1.1247695e-4 :: F -- A fixed point for birefrP
initialInterval = U.replicate samplesPerDelay cdi
samplesPerDelay = nodesN * stepsPerNode
initialSampleXY = Sample 0.0 0.0
integrator
:: PrimitiveM.PrimMonad m =>
(Sample -> Pair -> Sample)
-> Int
-> Int
-> (Sample, (Delay, Input))
-> m (Sample, U.Vector F)
integrator iterate1 len total (xy0, (history0, input)) = do
! v <- UM.new total
go v 0 xy0
history <- U.unsafeFreeze v
-- Zero y value, currently not used
let xy = Sample (history `U.unsafeIndex` (total - 1)) 0.0
return (xy, history)
where
h i = history0 `U.unsafeIndex` i
go !v !i !xy
-- The first iteration
| i == 0 = do
let !r = iterate1 xy (Pair (h 0) (h 1))
UM.unsafeWrite v i (getX r)
go v 1 r
| i < len - 1 = do
let !r = iterate1 xy (Pair (h i) (h $ i + 1))
UM.unsafeWrite v i (getX r)
go v (i + 1) r
| i == total = do
return ()
-- Iterations after the initial history has been exhausted
| otherwise = do
! newX0 <- if i == len - 1
then return (getX xy0)
else UM.unsafeRead v (i - len - 1)
! newX <- UM.unsafeRead v (i - len)
let !r = iterate1 xy (Pair newX0 newX)
UM.unsafeWrite v i (getX r)
go v (i + 1) r
-- Not used in this version
zero :: Input
zero = U.fromList []
nodes :: Int
nodes = 306
main :: IO ()
main = do
let delays = 4000
(sample0, hist0, delayLength) = initialCond nodes
-- Iterator implements Heun's schema
iterator = heun2 (recip 2^(7::Int) :: F) (ikeda_model2 birefrP)
totalComputedIterations = delayLength * delays
-- Calculates all the time trace
(xy1, history1) <- integrator iterator delayLength totalComputedIterations (sample0, (hist0, zero))
putStrLn $ show $ getX xy1
return ()
The nonlinear function (imported) can look like this:
data Parameters = Parameters { beta :: Float
, alpha :: Float
, phi :: Float } deriving Show
paramA :: Parameters
paramA = Parameters { beta = 1.1
, alpha = 1.0
, phi = 0.01 }
birefr :: Parameters -> Float -> Float
birefr par !x = 0.5 * beta' * (1 - alpha' * (cos $ 2.0 * (x + phi')))
where
! beta' = beta par
! alpha' = alpha par
! phi' = phi par
birefrP :: Float -> Float
birefrP = birefr paramA

Problem with bit swapping in Haskell

As part of a school project I'm implementing some crypthographic algorithms in Haskell. As you probably know this involves quite a lot of low level bit fiddling. Now I am stuck on one particular sub routine which causes me a headache. The routine, which is a permutation on 256 bits, works as follows:
Input: a 256 bit block.
Then all the even numbered bits (0,2,...) in the input block are taken to be the first 128 bits in the output block. While the odd numbered bits are taken to be the 128 last bits in the output block. More specifically, the formula for the i'th bit in the output is given as (ai is the i'th bit in the input block, and b is the output):
bi = a2i
bi+2d-1 = a2i + 1
for i from 0 to 2d-1-1, d = 8.
As a toy example, assume we used a reduced version of the routine which worked with 16 bit blocks instead of 256 bits. Then the following bitstring would be permuted as follows:
1010 1010 1010 1010 -> 1111 1111 0000 0000
I have not been able to come up with a clean implementation for this function. In particular I have been trying with a ByteString -> ByteString signature, but that sort of forces me to work on a Word8 kind of granularity. But each byte in the output bytestring is a function of bits in all the other bytes, which requires some really messy operations.
I will be really grateful for any kind of hint or advice on how to approach this problem.

If you want an efficient implementation, I don't think you can avoid working with bytes. Here is an example solution. It assumes that there is always an even number of bytes in the ByteString. I'm not very familiar with unboxing or strictness tweaking, but I think these would be necessary if you want to be very efficient.
import Data.ByteString (pack, unpack, ByteString)
import Data.Bits
import Data.Word
-- the main attraction
packString :: ByteString -> ByteString
packString = pack . packWords . unpack
-- main attraction equivalent, in [Word8]
packWords :: [Word8] -> [Word8]
packWords ws = evenPacked ++ unevenPacked
where evenBits = map packEven ws
unevenBits = map packUneven ws
evenPacked = consumePairs packNibbles evenBits
unevenPacked = consumePairs packNibbles unevenBits
-- combines 2 low nibbles (first 4 bytes) into a (high nibble, low nibble) word
-- assumes that only the low nibble of both arguments can be non-zero.
packNibbles :: Word8 -> Word8 -> Word8
packNibbles w1 w2 = (shiftL w1 4) .|. w2
packEven w = packBits w [0, 2, 4, 6]
packUneven w = packBits w [1, 3, 5, 7]
-- packBits 254 [0, 2, 4, 6] = 14
-- packBits 254 [1, 3, 5, 7] = 15
packBits :: Word8 -> [Int] -> Word8
packBits w is = foldr (.|.) 0 $ map (packBit w) is
-- packBit 255 0 = 1
-- packBit 255 1 = 1
-- packBit 255 2 = 2
-- packBit 255 3 = 2
-- packBit 255 4 = 4
-- packBit 255 5 = 4
-- packBit 255 6 = 8
-- packBit 255 7 = 8
packBit :: Word8 -> Int -> Word8
packBit w i = shiftR (w .&. 2^i) ((i `div` 2) + (i `mod` 2))
-- sort of like map, but halves the list in size by consuming two elements.
-- Is there a clearer way to write this with built-in function?
consumePairs :: (a -> a -> b) -> [a] -> [b]
consumePairs f (x : x' : xs) = f x x' : consumePairs f xs
consumePairs _ [] = []
consumePairs _ _ = error "list must contain even number of elements"

this should work:
import Data.List
import Data.Function
map fst $ sortBy (compare `on` snd) $ zip yourList $ cycle [0,1]
A bit of explanation:
As sortBy preserve the original order, we can pair each value at an even position a "0" and each value at an odd position a "1", then we simply sort on the second value of the pair. So all values at even positions will be placed before the values at odd positions but their order will be kept.
Chris

Unless performance is critical, I'd recommend using a bit vector representation for a project like this. As you've discovered, randomly accessing individual bits is something of a pain when they're in packed form, but Data.Vector provides a wealth of functions for these sorts of tasks.
import Data.Bits
import qualified Data.Vector as V
type BitVector = V.Vector Bool
unpack :: (Bits a) => a -> BitVector
unpack w = V.generate (bitSize w) (testBit w)
pack :: (Bits a) => BitVector -> a
pack v = V.ifoldl' set 0 v
where
set w i True = w `setBit` i
set w _ _ = w
mkPermutationVector :: Int -> V.Vector Int
mkPermutationVector d = V.generate (2^d) b
where
b i | i < 2^(d-1) = 2*i
| otherwise = let i' = i-2^(d-1)
in 2*i'+1
permute :: Int -> BitVector -> BitVector
permute d v = V.backpermute v (mkPermutationVector d)
Notice how this lets you specify the permutation by closely transcribing the mathematical description. This substantially reduces the likelihood of errors, and is more pleasant to write than bit-twiddly code.
To test with your example vector (in base 10):
*Main> import Data.Word
*Main Data.Word> let permute16 = pack . permute 4 . unpack :: Word16 -> Word16
*Main Data.Word> permute16 43690
65280
Now, by moving to bit vectors as your representation, you lose a lot of what you get for free by using Haskell types, such as Num instances. However, you can always implement the Num operations for your representation; here's a start:
plus :: BitVector -> BitVector -> BitVector
plus as bs = V.tail sums
where
(sums, carries) = V.unzip sumsAndCarries
sumsAndCarries = V.scanl' fullAdd (False, False) (V.zip as bs)
fullAdd (_, cin) (a, b) = ((a /= b) /= cin
, (a && b) || (cin && (a /= b)))
You may also find Levent Erkok's sbv package useful, although I'm not sure it exposes a function as convenient as backpermute for your particular question.
Update: I thought this was a fun question to answer, so I went ahead and fleshed the code out a bit as a library: bit-vector.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string