Optimizing Perlin noise in Haskell - haskell
(Dependencies for this program: vector --any and JuicyPixels >= 2. Code is available as Gist.)
{-# LANGUAGE Haskell2010 #-}
{-# LANGUAGE BangPatterns #-}
import Control.Arrow
import Data.Bits
import Data.Vector.Unboxed ((!))
import Data.Word
import System.Environment (getArgs)
import qualified Codec.Picture as P
import qualified Data.ByteString as B
import qualified Data.Vector.Unboxed as V
I tried to port Ken Perlin's improved noise
to Haskell, but I'm not entirely sure that my method is correct. The main part
is something that should generalize nicely to higher and lower dimensions, but
that is something for later:
perlin3 :: (Ord a, Num a, RealFrac a, V.Unbox a) => Permutation -> (a, a, a) -> a
perlin3 p (!x', !y', !z')
= let (!xX, !x) = actuallyProperFraction x'
(!yY, !y) = actuallyProperFraction y'
(!zZ, !z) = actuallyProperFraction z'
!u = fade x
!v = fade y
!w = fade z
!h = xX
!a = next p h + yY
!b = next p (h+1) + yY
!aa = next p a + zZ
!ab = next p (a+1) + zZ
!ba = next p b + zZ
!bb = next p (b+1) + zZ
!aaa = next p aa
!aab = next p (aa+1)
!aba = next p ab
!abb = next p (ab+1)
!baa = next p ba
!bab = next p (ba+1)
!bba = next p bb
!bbb = next p (bb+1)
in
lerp w
(lerp v
(lerp u
(grad aaa (x, y, z))
(grad baa (x-1, y, z)))
(lerp u
(grad aba (x, y-1, z))
(grad bba (x-1, y-1, z))))
(lerp v
(lerp u
(grad aab (x, y, z-1))
(grad bab (x-1, y, z-1)))
(lerp u
(grad abb (x, y-1, z-1))
(grad bbb (x-1, y-1, z-1))))
This is of course accompanied by a few functions mentioned in the perlin3
function, of which I hope they are as efficient as possible:
fade :: (Ord a, Num a) => a -> a
fade !t | 0 <= t, t <= 1 = t * t * t * (t * (t * 6 - 15) + 10)
lerp :: (Ord a, Num a) => a -> a -> a -> a
lerp !t !a !b | 0 <= t, t <= 1 = a + t * (b - a)
grad :: (Bits hash, Integral hash, Num a, V.Unbox a) => hash -> (a, a, a) -> a
grad !hash (!x, !y, !z) = dot3 (vks `V.unsafeIndex` fromIntegral (hash .&. 15)) (x, y, z)
where
vks = V.fromList
[ (1,1,0), (-1,1,0), (1,-1,0), (-1,-1,0)
, (1,0,1), (-1,0,1), (1,0,-1), (-1,0,-1)
, (0,1,1), (0,-1,1), (0,1,-1), (0,-1,-1)
, (1,1,0), (-1,1,0), (0,-1,1), (0,-1,-1)
]
dot3 :: Num a => (a, a, a) -> (a, a, a) -> a
dot3 (!x0, !y0, !z0) (!x1, !y1, !z1) = x0 * x1 + y0 * y1 + z0 * z1
-- Unlike `properFraction`, `actuallyProperFraction` rounds as intended.
actuallyProperFraction :: (RealFrac a, Integral b) => a -> (b, a)
actuallyProperFraction x
= let (ipart, fpart) = properFraction x
r = if x >= 0 then (ipart, fpart)
else (ipart-1, 1+fpart)
in r
For the permutation group, I just copied the one Perlin used on his website:
newtype Permutation = Permutation (V.Vector Word8)
mkPermutation :: [Word8] -> Permutation
mkPermutation xs
| length xs >= 256
= Permutation . V.fromList $ xs
permutation :: Permutation
permutation = mkPermutation
[151,160,137,91,90,15,
131,13,201,95,96,53,194,233,7,225,140,36,103,30,69,142,8,99,37,240,21,10,23,
190, 6,148,247,120,234,75,0,26,197,62,94,252,219,203,117,35,11,32,57,177,33,
88,237,149,56,87,174,20,125,136,171,168, 68,175,74,165,71,134,139,48,27,166,
77,146,158,231,83,111,229,122,60,211,133,230,220,105,92,41,55,46,245,40,244,
102,143,54, 65,25,63,161, 1,216,80,73,209,76,132,187,208, 89,18,169,200,196,
135,130,116,188,159,86,164,100,109,198,173,186, 3,64,52,217,226,250,124,123,
5,202,38,147,118,126,255,82,85,212,207,206,59,227,47,16,58,17,182,189,28,42,
223,183,170,213,119,248,152, 2,44,154,163, 70,221,153,101,155,167, 43,172,9,
129,22,39,253, 19,98,108,110,79,113,224,232,178,185, 112,104,218,246,97,228,
251,34,242,193,238,210,144,12,191,179,162,241, 81,51,145,235,249,14,239,107,
49,192,214, 31,181,199,106,157,184, 84,204,176,115,121,50,45,127, 4,150,254,
138,236,205,93,222,114,67,29,24,72,243,141,128,195,78,66,215,61,156,180
]
next :: Permutation -> Word8 -> Word8
next (Permutation !v) !idx'
= v `V.unsafeIndex` (fromIntegral $ idx' .&. 0xFF)
And all this is tied together with JuicyPixels:
main = do
[target] <- getArgs
let image = P.generateImage pixelRenderer 512 512
P.writePng target image
where
pixelRenderer, pixelRenderer' :: Int -> Int -> Word8
pixelRenderer !x !y
= floor $ ((perlin3 permutation ((fromIntegral x - 256) / 32,
(fromIntegral y - 256) / 32, 0 :: Double))+1)/2 * 128
-- This code is much more readable, but also much slower.
pixelRenderer' x y
= (\w -> floor $ ((w+1)/2 * 128)) -- w should be in [-1,+1]
. perlin3 permutation
. (\(x,y,z) -> ((x-256)/32, (y-256)/32, (z-256)/32))
$ (fromIntegral x, fromIntegral y, 0 :: Double)
My problem is that perlin3 seems very slow to me. If I profile it, pixelRenderer
is getting a lot of time as well, but I'll ignore that for now. I don't know
how to optimize perlin3. I tried to hint GHC with bang patterns, which cuts
the execution time in half, so that's nice. Explicitly specializing and inlining
barely helps with ghc -O. Is perlin3 supposed to be this slow?
UPDATE: an earlier version of this question mentioned a bug in my code. This problem has been resolved; it turns out my old version of actuallyProperFraction was buggy. It implicitly rounded the integral part of a floating point number to Word8, and then subtracted it from the floating point number to get the fractional part. Since Word8 can only take values between 0 and 255 inclusive, this won't work properly for numbers outside that range, including negative numbers.
This code appears to be mostly computation-bound. It can be improved a little bit, but not by much unless there's a way to use fewer array lookups and less arithmetic.
There are two useful tools for measuring performance: profiling and code dumps. I added an SCC annotation to perlin3 so that it would show up in the profile. Then I compiled with gcc -O2 -fforce-recomp -ddump-simpl -prof -auto. The -ddump-simpl flag prints the simplified code.
Profiling: On my computer, it takes 0.60 seconds to run the program, and about 20% of execution time (0.12 seconds) is spent in perlin3 according to the profile. Note that the precision of my profile info is about +/-3%.
Simplifier output: The simplifier produces fairly clean code. perlin3 gets inlined into pixelRenderer, so that's the part of the output you want to look at. Most of the code consists of unboxed array reads and unboxed arithmetic. To improve performance, we want to eliminate some of this arithmetic.
An easy change is to eliminate the run-time checks on SomeFraction (which doesn't appear in your question, but is part of the code that you uploaded). This reduces the program's execution time to 0.56 seconds.
-- someFraction t | 0 <= t, t < 1 = SomeFraction t
someFraction t = SomeFraction t
Next, there are several array lookups that show up in the simplifier like this:
case GHC.Prim.indexWord8Array#
ipv3_s23a
(GHC.Prim.+#
ipv1_s21N
(GHC.Prim.word2Int#
(GHC.Prim.and#
(GHC.Prim.narrow8Word#
(GHC.Prim.plusWord# ipv5_s256 (__word 1)))
(__word 255))))
The primitive operation narrow8Word# is for coercing from an Int to a Word8. We can get rid of this coercion by using Int instead of Word8 in the definition of next.
next :: Permutation -> Int -> Int
next (Permutation !v) !idx'
= fromIntegral $ v `V.unsafeIndex` (fromIntegral idx' .&. 0xFF)
This reduces the program's execution time to 0.54 seconds. Considering just the time spent in perlin3, the execution time has fallen (roughly) from 0.12 to 0.06 seconds. Although it's hard to measure where the rest of the time is going, it's most likely spread out among the remaining arithmetic and array accesses.
On my machine reference code with Heatsink's optimisations takes 0.19 secs.
Firstly, I has moved from JuicyPixels to yarr and yarr-image-io with my favourite flags, -Odph -rtsopts -threaded -fno-liberate-case -funbox-strict-fields -fexpose-all-unfoldings -funfolding-keeness-factor1000 -fsimpl-tick-factor=500 -fllvm -optlo-O3 (they are given here):
import Data.Yarr as Y
import Data.Yarr.IO.Image as Y
...
main = do
[target] <- getArgs
image <- dComputeS $ fromFunction (512, 512) (return . pixelRenderer)
Y.writeImage target (Grey image)
where
pixelRenderer, pixelRenderer' :: Dim2 -> Word8
pixelRenderer (y, x)
= floor $ ((perlin3 permutation ((fromIntegral x - 256) / 32,
(fromIntegral y - 256) / 32, 0 :: Double))+1)/2 * 128
-- This code is much more readable, but also much slower.
pixelRenderer' (y, x)
= (\w -> floor $ ((w+1)/2 * 128)) -- w should be in [-1,+1]
. perlin3 permutation
. (\(x,y,z) -> ((x-256)/32, (y-256)/32, (z-256)/32))
$ (fromIntegral x, fromIntegral y, 0 :: Double)
This makes the program 30% faster, 0.13 seconds.
Secondly I has replaced uses of standard floor with
doubleToByte :: Double -> Word8
doubleToByte f = fromIntegral (truncate f :: Int)
It is known issue (google "haskell floor performance"). Execution time is reduced to 52 ms (0.052 secs), in almost 3 times.
Finally, just for fun I tried to compute noise in parallel (dComputeP instead of dComputeS and +RTS -N4 in command line run). Program took 36 ms, including I/O constant of about 10 ms.
Related
Mutable list of mutabale non-integral types in Haskell
I'm trying to parse a huge 3d-data array of complex values from binary. Later this should become l matrices (n x m). Since I'm going to work on these matrices, I'm limited to matrix libraries - hmatrix seems to be promising. The data layout is not in my requried format, so I have to jump around in positions (i,j,k) -> (k,i,j), where i and j are elements of n and m and k element of l. I think the only way to read in this in is my using mutables, otherwise I'll end up with several Terrabytes of garbage. My idea was to use boxed mutual arrays or vectors of mututal matrices (STMatrix from Numeric.LinearAlgebra.Devel), so I end up with something like: data MVector s (STMatrix s t) But I'm not sure how to use them correctly: I can modify one single element of the MVector with modify: modify :: PrimMonad m => MVector (PrimState m) a -> (a -> a) -> Int -> m () or use modifyM (Strange: in stack vector-0.12.3.0 does not have modifyM...) modifyM :: PrimMonad m => MVector (PrimState m) a -> (a -> m a) -> Int -> m () so I could use the function call (a -> a) to a runST-routine to modify the SMatrix. I'm not sure, if I should put an ST in an IO (?) Nevertheless - I think, this should work but is only useful, when I want to modify the whole Matrix, calling this (a->a)-routine n x m x l- times will be a little bit overhead (Maybe it will be optimized out...). So I'll end up, in marshalling the Array, modify the content via pointers (i,j,k) -> (k,i,j) and read everything Matrix by Matrix - but this does not feel right and I wanted to avoid such dirty tricks. Do you have any ideas of a way to do this a little but more ...clean? Ty Edit: Thx to K. A. Buhr. His solution works so far. Now, I'm only running into some performance impacts. If I compare the solution: {-# LANGUAGE BangPatterns #-} module Main where import Data.List import Numeric.LinearAlgebra import qualified Data.Vector as V import qualified Data.Vector.Storable as VS import qualified Data.Vector.Storable.Mutable as VSM -- Create an l-length list of n x m hmatrix Matrices toMatrices :: Int -> Int -> Int -> [C] -> [Matrix C] toMatrices l n m dats = map (reshape m) $ VS.createT $ do mats <- V.replicateM l $ VSM.unsafeNew (m*n) sequence_ $ zipWith (\(i,j,k) x -> VSM.unsafeWrite (mats V.! k) (loc i j) x) idxs (dats ++ repeat 0) return $ V.toList mats where idxs = (,,) <$> [0..n-1] <*> [0..m-1] <*> [0..l-1] loc i j = i*m + j test1 = toMatrices 1000 1000 100 (fromIntegral <$> [1..]) main = do let !a = test1 print "done" With the simpliest C-code: #include <stdlib.h> #include <stdio.h> void main() { const int n = 1000; const int m = 1000; const int l = 100; double *src = malloc(n*m*l * sizeof(double)); for (int i = 0; i < n*m*l; i++) { src[i] = (double)i; } double *dest = malloc(n*m*l * sizeof(double)); for (int i = 0; i < n; i++) { for (int j = 0; j < m; j++) { for (int k = 0; k < l; k++) { dest[k*n*m+i*m+j] = src[i*m*l+j*l+k]; } } } printf("done: %f\n", dest[n*m*l - 1]); // Need to access the array, otherwise it'll get lost by -O2 free(src); free(dest); } Both compiled with -O2 give following performance guesses: real 0m5,611s user 0m14,845s sys 0m2,759s vs. real 0m0,441s user 0m0,200s sys 0m0,240s This are approx 2 magnitudes per-core performance. From profiling I learn that VSM.unsafeWrite (mats V.! k) (loc i j) x is the expensive function. Since I'll use this procedure in a minute-like intervall, I want to keep the parsing time as low as the disk access time. I'll see, if I can speed this up PS: This is for some tests, if I could move usual DSP from C-like to Haskell Edit2 : Ok, this is what I get after sum trying: {-# LANGUAGE BangPatterns #-} module Main where import Data.List import qualified Data.Vector as V import qualified Data.Vector.Storable as VS import qualified Data.Vector.Storable.Mutable as VSM import Numeric.LinearAlgebra -- Create an l-length list of n x m hmatrix Matrices toMatrices :: Int -> Int -> Int -> VS.Vector C -> V.Vector (Matrix C) toMatrices l n m dats = V.map (reshape m) newMat where newMat = VS.createT $ V.generateM l $ \k -> do curMat <- VSM.unsafeNew (m * n) VS.mapM_ (\i -> VS.mapM_ (\j -> VSM.unsafeWrite curMat (loc i j) (dats VS.! (oldLoc i j k))) idjs) idis return curMat loc i j = i * m + j oldLoc i j k = i * m * l + j * l + k !idis = VS.generate n (\a->a) !idjs = VS.generate m (\a->a) test1 = toMatrices 100 1000 1000 arr where arr = VS.generate (1000 * 1000 * 100) fromIntegral :: VS.Vector C main = do let !a = test1 print "done" It gives something about: real 0m1,816s user 0m1,636s sys 0m1,120s , so ~4 times slower than C code. I think I can live with this. I guess, I'm destroying all the stream-functionality of the vector with this code. If there are any suggestions to have them back by a comparable speed, I would be grateful!
As I understand it, you have a "huge" set of data in i-major, j-middling, k-minor order, and you want to load it into matrices indexed by k whose elements have i-indexed rows and j-indexed columns, right? So, you want a function something like: import Numeric.LinearAlgebra -- load into "l" matrices of size "n x m" toMatrices :: Int -> Int -> Int -> [C] -> [Matrix C] toMatrices l n m dats = ... Note that you've written n x m matrices above, associating i with n and j with m. It would be more usual to flip the roles of n and m, but I've stuck with your notation, so keep an eye on that. If the entire data list [C] could fit comfortably in memory, you could do this immutably by writing something like: import Data.List import Data.List.Split import Numeric.LinearAlgebra toMatrices :: Int -> Int -> Int -> [C] -> [Matrix C] toMatrices l n m = map (reshape m . fromList) . transpose . chunksOf l This breaks the input data into l-sized chunks, transposes them into l lists, and converts each list to a matrix. If there was some way to force all the Matrix C values in parallel, this could be done with one traversal through the data, without the need to hold on to the whole list. Unfortunately, the individual Matrix C values can only be forced one-by-one, and the whole list needs to be kept around until all of them can be forced. So, if the "huge" [C] list is too big for memory, you're probably right that you need to load the data into a (partially) mutable structure. The code is somewhat challenging to write, but it's not too bad in its final form. I believe the following will work: import Data.List import Numeric.LinearAlgebra import qualified Data.Vector as V import qualified Data.Vector.Storable as VS import qualified Data.Vector.Storable.Mutable as VSM -- Create an l-length list of n x m hmatrix Matrices toMatrices :: Int -> Int -> Int -> [C] -> [Matrix C] toMatrices l n m dats = map (reshape m) $ VS.createT $ do mats <- V.replicateM l $ VSM.unsafeNew (m*n) sequence_ $ zipWith (\(i,j,k) x -> VSM.unsafeWrite (mats V.! k) (loc i j) x) idxs (dats ++ repeat 0) return $ V.toList mats where idxs = (,,) <$> [0..n-1] <*> [0..m-1] <*> [0..l-1] loc i j = i*m + j test1 = toMatrices 4 3 2 (fromIntegral <$> [1..24]) test2 = toMatrices 1000 1000 100 (fromIntegral <$> [1..]) main = do print $ test1 print $ norm_Inf . foldl1' (+) $ test2 Compiled with -O2, the maximum residency is about 1.6Gigs, which matches the expected memory needed to hold 100 matrices of one million 16-byte complex values in memory, so that looks right. Anyway, this version of toMatrices is made somewhat complicated by the use of three different vector variants. There's Vector from hmatrix, which is the same as the immutable storable VS.Vector from vector; and then there are two more types from vector: the immutable boxed V.Vector, and the mutable storable VSM.Vector. The do-block creates a V.Vector of VSM.Vectors and populates those with a sequence of monadic actions performed across index/value pairs. You can load the data in any order by modifying the definition of idxs to match the order of the data stream. The do-block returns the final VSM.Vectors in a list, the helper function VS.createT freezes them all to VS.Vectors (i.e., Vector from hmatrix), and reshape is mapped across the vectors to turn them into m-column matrices. Note that you'll have to take care that in your actual application, the list of data items read from the file isn't kept around by code other than toMatrices, either in the original text form or the parsed numeric form. This shouldn't be too tough to get right, but you might want to test on medium-sized test input before locking up your computer on the real dataset.
How do I memoize?
I have written this function that computes Collatz sequences, and I see wildly varying times of execution depending on the spin I give it. Apparently it is related to something called "memoization", but I have a hard time understanding what it is and how it works, and, unfortunately, the relevant article on HaskellWiki, as well as the papers it links to, have all proven to not be easily surmountable. They discuss intricate details of the relative performance of highly layman-indifferentiable tree constructions, while what I miss must be some very basic, very trivial point that these sources neglect to mention. This is the code. It is a complete program, ready to be built and executed. module Main where import Data.Function import Data.List (maximumBy) size :: (Integral a) => a size = 10 ^ 6 -- Nail the basics. collatz :: Integral a => a -> a collatz n | even n = n `div` 2 | otherwise = n * 3 + 1 recollatz :: Integral a => a -> a recollatz = fix $ \f x -> if (x /= 1) then f (collatz x) else x -- Now, I want to do the counting with a tuple monad. mocollatz :: Integral b => b -> ([b], b) mocollatz n = ([n], collatz n) remocollatz :: Integral a => a -> ([a], a) remocollatz = fix $ \f x -> if x /= 1 then f =<< mocollatz x else return x -- Trivialities. collatzLength :: Integral a => a -> Int collatzLength x = (length . fst $ (remocollatz x)) + 1 collatzPairs :: Integral a => a -> [(a, Int)] collatzPairs n = zip [1..n] (collatzLength <$> [1..n]) longestCollatz :: Integral a => a -> (a, Int) longestCollatz n = maximumBy order $ collatzPairs n where order :: Ord b => (a, b) -> (a, b) -> Ordering order x y = snd x `compare` snd y main :: IO () main = print $ longestCollatz size With ghc -O2 it takes about 17 seconds, without ghc -O2 -- about 22 seconds to deliver the length and the seed of the longest Collatz sequence starting at any point below size. Now, if I make these changes: diff --git a/Main.hs b/Main.hs index c78ad95..9607fe0 100644 --- a/Main.hs +++ b/Main.hs ## -1,6 +1,7 ## module Main where import Data.Function +import qualified Data.Map.Lazy as M import Data.List (maximumBy) size :: (Integral a) => a ## -22,10 +23,15 ## recollatz = fix $ \f x -> if (x /= 1) mocollatz :: Integral b => b -> ([b], b) mocollatz n = ([n], collatz n) -remocollatz :: Integral a => a -> ([a], a) -remocollatz = fix $ \f x -> if x /= 1 - then f =<< mocollatz x - else return x +remocollatz :: (Num a, Integral b) => b -> ([b], a) +remocollatz 1 = return 1 +remocollatz x = case M.lookup x (table mutate) of + Nothing -> mutate x + Just y -> y + where mutate x = remocollatz =<< mocollatz x + +table :: (Ord a, Integral a) => (a -> b) -> M.Map a b +table f = M.fromList [ (x, f x) | x <- [1..size] ] -- Trivialities. -- Then it will take just about 4 seconds with ghc -O2, but I would not live long enough to see it complete without ghc -O2. Looking at the details of cost centres with ghc -prof -fprof-auto -O2 reveals that the first version enters collatz about a hundred million times, while the patched one -- just about one and a half million times. This must be the reason of the speedup, but I have a hard time understanding the inner workings of this magic. My best idea is that we replace a portion of expensive recursive calls with O(log n) map lookups, but I don't know if it's true and why it depends so much on some godforsaken compiler flags, while, as I see it, such performance swings should all follow solely from the language. Can I haz an explanation of what happens here, and why the performance differs so vastly between ghc -O2 and plain ghc builds? P.S. There are two requirements to the achieving of automagical memoization highlighted elsewhere on Stack Overflow: Make a function to be memoized a top-level name. Make a function to be memoized a monomorphic one. In line with these requirements, I rebuilt remocollatz as follows: remocollatz :: Int -> ([Int], Int) remocollatz 1 = return 1 remocollatz x = mutate x mutate :: Int -> ([Int], Int) mutate x = remocollatz =<< mocollatz x Now it's as top level and as monomorphic as it gets. Running time is about 11 seconds, versus the similarly monomorphized table version: remocollatz :: Int -> ([Int], Int) remocollatz 1 = return 1 remocollatz x = case M.lookup x (table mutate) of Nothing -> mutate x Just y -> y mutate :: Int -> ([Int], Int) mutate = \x -> remocollatz =<< mocollatz x table :: (Int -> ([Int], Int)) -> M.Map Int ([Int], Int) table f = M.fromList [ (x, f x) | x <- [1..size] ] -- Running in less than 4 seconds. I wonder why the memoization ghc is supposedly performing in the first case here is almost 3 times slower than my dumb table.
Can I haz an explanation of what happens here, and why the performance differs so vastly between ghc -O2 and plain ghc builds? Disclaimer: this is a guess, not verified by viewing GHC core output. A careful answer would do so to verify the conjectures outlined below. You can try peering through it yourself: add -ddump-simpl to your compilation line and you will get copious output detailing exactly what GHC has done to your code. You write: remocollatz x = {- ... -} table mutate {- ... -} where mutate x = remocollatz =<< mocollatz x The expression table mutate in fact does not depend on x; but it appears on the right-hand side of an equation that takes x as an argument. Consequently, without optimizations, this table is recomputed each time remocollatz is called (presumably even from inside the computation of table mutate). With optimizations, GHC notices that table mutate does not depend on x, and floats it to its own definition, effectively producing: fresh_variable_name = table mutate where mutate x = remocollatz =<< mocollatz x remocollatz x = case M.lookup x fresh_variable_name of {- ... -} The table is therefore computed just once for the entire program run. don't know why it [the performance] depends so much on some godforsaken compiler flags, while, as I see it, such performance swings should all follow solely from the language. Sorry, but Haskell doesn't work that way. The language definition tells clearly what the meaning of a given Haskell term is, but does not say anything about the runtime or memory performance needed to compute that meaning.
Another approach to memoization that works in some situations, like this one, is to use a boxed vector, whose elements are computed lazily. The function used to initialize each element can use other elements of the vector in its calculation. As long as the evaluation of an element of the vector doesn't loop and refer to itself, just the elements it recursively depends on will be evaluated. Once evaluated, an element is effectively memoized, and this has the further benefit that elements of the vector that are never referenced are never evaluated. The Collatz sequence is a nearly ideal application for this technique, but there is one complication. The next Collatz value(s) in sequence from a value under the limit may be outside the limit, which would cause a range error when indexing the vector. I solved this by just iterating through the sequence until back under the limit and counting the steps to do so. The following program takes 0.77 seconds to run unoptimized and 0.30 when optimized: import qualified Data.Vector as V limit = 10 ^ 6 :: Int -- The Collatz function, which given a value returns the next in the sequence. nextCollatz val | odd val = 3 * val + 1 | otherwise = val `div` 2 -- Given a value, return the next Collatz value in the sequence that is less -- than the limit and the number of steps to get there. For example, the -- sequence starting at 13 is: [13, 40, 20, 10, 5, 16, 8, 4, 2, 1], so if -- limit is 100, then (nextCollatzWithinLimit 13) is (40, 1), but if limit is -- 15, then (nextCollatzWithinLimit 13) is (10, 3). nextCollatzWithinLimit val = (firstInRange, stepsToFirstInRange) where firstInRange = head rest stepsToFirstInRange = 1 + (length biggerThanLimit) (biggerThanLimit, rest) = span (>= limit) (tail collatzSeqStartingWithVal) collatzSeqStartingWithVal = iterate nextCollatz val -- A boxed vector holding Collatz length for each index. The collatzFn used -- to generate the value for each element refers back to other elements of -- this vector, but since the vector elements are only evaluated as needed and -- there aren't any loops in the Collatz sequences, the values are calculated -- only as needed. collatzVec :: V.Vector Int collatzVec = V.generate limit collatzFn where collatzFn :: Int -> Int collatzFn index | index <= 1 = 1 | otherwise = (collatzVec V.! nextWithinLimit) + stepsToGetThere where (nextWithinLimit, stepsToGetThere) = nextCollatzWithinLimit index main :: IO () main = do -- Use a fold through the vector to find the longest Collatz sequence under -- the limit, and keep track of both the maximum length and the initial -- value of the sequence, which is the index. let (maxLength, maxIndex) = V.ifoldl' accMaxLen (0, 0) collatzVec accMaxLen acc#(accMaxLen, accMaxIndex) index currLen | currLen <= accMaxLen = acc | otherwise = (currLen, index) putStrLn $ "Max Collatz length below " ++ show limit ++ " is " ++ show maxLength ++ " at index " ++ show maxIndex
How much space does ridge regression require?
In Haskell, ridge regression can be expressed as: import Numeric.LinearAlgebra createReadout :: Matrix Double → Matrix Double → Matrix Double createReadout a b = oA <\> oB where μ = 1e-4 oA = (a <> (tr a)) + (μ * (ident $ rows a)) oB = a <> (tr b) However, this operation is very memory expensive. Here is a minimalistic example that requires more than 2GB on my machine and takes 3 minutes to execute. import Numeric.LinearAlgebra import System.Random createReadout :: Matrix Double -> Matrix Double -> Matrix Double createReadout a b = oA <\> oB where mu = 1e-4 oA = (a <> (tr a)) + (mu * (ident $ rows a)) oB = a <> (tr b) teacher :: [Int] -> Int -> Int -> Matrix Double teacher labelsList cols' correctRow = fromBlocks $ f <$> labelsList where ones = konst 1.0 (1, cols') zeros = konst 0.0 (1, cols') rows' = length labelsList f i | i == correctRow = [ones] | otherwise = [zeros] glue :: Element t => [Matrix t] -> Matrix t glue xs = fromBlocks [xs] main :: IO () main = do let n = 1500 -- <- The constant to be increased m = 10000 cols' = 12 g <- newStdGen -- Stub data let labels = take m . map (`mod` 10) . randoms $ g :: [Int] a = (n >< (cols' * m)) $ take (cols' * m * n) $ randoms g :: Matrix Double teachers = zipWith (teacher [0..9]) (repeat cols') labels b = glue teachers print $ maxElement $ createReadout a b return () $ cabal exec ghc -- -O2 Test.hs $ time ./Test ./Test 190.16s user 5.22s system 106% cpu 3:03.93 total The problem is to increase the constant n, at least to n = 4000, while RAM is limited by 5GB. What is minimal space that matrix inversion operation requires in theory? How can this operation be optimized in terms of space? Can ridge regression be efficiently replaced with a cheaper method?
Simple Gauss-Jordan elimination only takes space to store the input and output matrices plus constant auxiliary space. If I'm reading correctly, the matrix oA you need to invert is n x n so that's not a problem. Your memory usage is completely dominated by storing the input matrix a, which uses at least 1500 * 120000 * 8 = 1.34 GB. n = 4000 would be 4000 * 120000 * 8 = 3.58 GB which is over half of your space budget. I don't know what matrix library you are using or how it stores its matrices, but if they are on the Haskell heap then GC effects could easily account for another factor of 2 in space usage.
Well you can get away with 3*m + nxn space, but how numerically stable this will be I'm not sure. The basis is the identity inv( inv(Q) + A'*A)) = Q - Q*A'*R*A*Q where R = inv( I + A*Q*A') If A is your A matrix and Q = inv( mu*I*mu*I) = I/(mu*mu) then the solution to your ridge regression is inv( inv(Q) + A'*A)) * A'*b A little more algebra shows inv( inv(Q) + A'*A)) = (I - A'*inv( (mu2 + A*A'))*A)/mu2 where mu2 = mu*m Note that since A is n x m, A*A' is n x n. So one algorithm would be Compute C = A*A' + mu2 Do a cholesky decompostion of C, ie find upper triangular U so that U'*U = C Compute the vector y = A'*b Compute the vector z = A*y Solve U'*u = z for u in z Solve U*v = z for v in z compute w = A'*z Compute x = (y - w)/mu2.
How do I optimize numerical integration performance in Haskell (with example)
How do I optimize numerical integration routine (comparing to C)? What has been done to the moment: I replaced lists with unboxed vectors (obvious). I applied profiling techniques described in the book "Read World Haskell" http://book.realworldhaskell.org/read/profiling-and-optimization.html. I have inlined some trivial functions and inserted a lot of bangs everywhere. That gave about 10x speedup. I refactored the code (i.e. extracted iterator function). That gave 3x speedup. I tried to replace polymorphic signatures with Floats as in the answer to this question Optimizing numerical array performance in Haskell. That gave almost 2x speedup. I compile like this cabal exec ghc -- Simul.hs -O2 -fforce-recomp -fllvm -Wall UPDATE As suggested by cchalmers, type Sample = (F, F) was replaced with data Sample = Sample {-# UNPACK #-} !F {-# UNPACK #-} !F The performance now is almost as good as C code. Can we do better? {-# LANGUAGE BangPatterns #-} module Main where import qualified Data.Vector.Unboxed as U import qualified Data.Vector.Unboxed.Mutable as UM import qualified Control.Monad.Primitive as PrimitiveM import Dynamics.Nonlin ( birefrP ) type F = Float type Delay = U.Vector F type Input = U.Vector F -- Sample can be a vector of any length (x, y, z, ...) data Sample = Sample {-# UNPACK #-} !F {-# UNPACK #-} !F -- Pair is used to define exactly a pair of values data Pair = Pair {-# UNPACK #-} !F {-# UNPACK #-} !F type ParametrizedDelayFunction = (Sample, F) -> Sample getX :: Sample -> F getX (Sample a _) = a {-# INLINE getX #-} toDelay :: [F] -> Delay toDelay = U.fromList stepsPerNode :: Int stepsPerNode = 40 -- Number of integration steps per node infixl 6 ..+.. (..+..) :: Sample -> Sample -> Sample (..+..) (Sample x1 y1) (Sample x2 y2) = Sample (x1 + x2) (y1 + y2) {-# INLINE (..+..) #-} infixl 7 .*.. (.*..) :: F -> Sample -> Sample (.*..) c (Sample x2 y2) = Sample (c * x2) (c * y2) {-# INLINE (.*..) #-} -- | Ikeda model (dynamical system, DDE) ikeda_model2 :: (F -> F) -> (Sample, F) -> Sample ikeda_model2 f (!(Sample x y), !x_h) = Sample x' y' where ! x' = recip_epsilon * (-x + (f x_h)) y' = 0 recip_epsilon = 2^(6 :: Int) -- | Integrate using improved Euler's method (fixed step). -- -- hOver2 is already half of step size h -- f is the function to integrate -- x_i is current argument (x and y) -- x_h is historical (delayed) value -- x_h2 it the value after x_h heun2 :: F -> ParametrizedDelayFunction -> Sample -> Pair -> Sample heun2 hOver2 f !x !(Pair x_h x_h2) = x_1 where ! f1 = f (x, x_h) ! x_1' = x ..+.. 2 * hOver2 .*.. f1 ! f2 = f (x_1', x_h2) ! x_1 = x ..+.. hOver2 .*.. (f1 ..+.. f2) initialCond :: Int -> (Sample, Delay, Int) initialCond nodesN = (initialSampleXY, initialInterval, samplesPerDelay) where cdi = 1.1247695e-4 :: F -- A fixed point for birefrP initialInterval = U.replicate samplesPerDelay cdi samplesPerDelay = nodesN * stepsPerNode initialSampleXY = Sample 0.0 0.0 integrator :: PrimitiveM.PrimMonad m => (Sample -> Pair -> Sample) -> Int -> Int -> (Sample, (Delay, Input)) -> m (Sample, U.Vector F) integrator iterate1 len total (xy0, (history0, input)) = do ! v <- UM.new total go v 0 xy0 history <- U.unsafeFreeze v -- Zero y value, currently not used let xy = Sample (history `U.unsafeIndex` (total - 1)) 0.0 return (xy, history) where h i = history0 `U.unsafeIndex` i go !v !i !xy -- The first iteration | i == 0 = do let !r = iterate1 xy (Pair (h 0) (h 1)) UM.unsafeWrite v i (getX r) go v 1 r | i < len - 1 = do let !r = iterate1 xy (Pair (h i) (h $ i + 1)) UM.unsafeWrite v i (getX r) go v (i + 1) r | i == total = do return () -- Iterations after the initial history has been exhausted | otherwise = do ! newX0 <- if i == len - 1 then return (getX xy0) else UM.unsafeRead v (i - len - 1) ! newX <- UM.unsafeRead v (i - len) let !r = iterate1 xy (Pair newX0 newX) UM.unsafeWrite v i (getX r) go v (i + 1) r -- Not used in this version zero :: Input zero = U.fromList [] nodes :: Int nodes = 306 main :: IO () main = do let delays = 4000 (sample0, hist0, delayLength) = initialCond nodes -- Iterator implements Heun's schema iterator = heun2 (recip 2^(7::Int) :: F) (ikeda_model2 birefrP) totalComputedIterations = delayLength * delays -- Calculates all the time trace (xy1, history1) <- integrator iterator delayLength totalComputedIterations (sample0, (hist0, zero)) putStrLn $ show $ getX xy1 return () The nonlinear function (imported) can look like this: data Parameters = Parameters { beta :: Float , alpha :: Float , phi :: Float } deriving Show paramA :: Parameters paramA = Parameters { beta = 1.1 , alpha = 1.0 , phi = 0.01 } birefr :: Parameters -> Float -> Float birefr par !x = 0.5 * beta' * (1 - alpha' * (cos $ 2.0 * (x + phi'))) where ! beta' = beta par ! alpha' = alpha par ! phi' = phi par birefrP :: Float -> Float birefrP = birefr paramA
Problem with bit swapping in Haskell
As part of a school project I'm implementing some crypthographic algorithms in Haskell. As you probably know this involves quite a lot of low level bit fiddling. Now I am stuck on one particular sub routine which causes me a headache. The routine, which is a permutation on 256 bits, works as follows: Input: a 256 bit block. Then all the even numbered bits (0,2,...) in the input block are taken to be the first 128 bits in the output block. While the odd numbered bits are taken to be the 128 last bits in the output block. More specifically, the formula for the i'th bit in the output is given as (ai is the i'th bit in the input block, and b is the output): bi = a2i bi+2d-1 = a2i + 1 for i from 0 to 2d-1-1, d = 8. As a toy example, assume we used a reduced version of the routine which worked with 16 bit blocks instead of 256 bits. Then the following bitstring would be permuted as follows: 1010 1010 1010 1010 -> 1111 1111 0000 0000 I have not been able to come up with a clean implementation for this function. In particular I have been trying with a ByteString -> ByteString signature, but that sort of forces me to work on a Word8 kind of granularity. But each byte in the output bytestring is a function of bits in all the other bytes, which requires some really messy operations. I will be really grateful for any kind of hint or advice on how to approach this problem.
If you want an efficient implementation, I don't think you can avoid working with bytes. Here is an example solution. It assumes that there is always an even number of bytes in the ByteString. I'm not very familiar with unboxing or strictness tweaking, but I think these would be necessary if you want to be very efficient. import Data.ByteString (pack, unpack, ByteString) import Data.Bits import Data.Word -- the main attraction packString :: ByteString -> ByteString packString = pack . packWords . unpack -- main attraction equivalent, in [Word8] packWords :: [Word8] -> [Word8] packWords ws = evenPacked ++ unevenPacked where evenBits = map packEven ws unevenBits = map packUneven ws evenPacked = consumePairs packNibbles evenBits unevenPacked = consumePairs packNibbles unevenBits -- combines 2 low nibbles (first 4 bytes) into a (high nibble, low nibble) word -- assumes that only the low nibble of both arguments can be non-zero. packNibbles :: Word8 -> Word8 -> Word8 packNibbles w1 w2 = (shiftL w1 4) .|. w2 packEven w = packBits w [0, 2, 4, 6] packUneven w = packBits w [1, 3, 5, 7] -- packBits 254 [0, 2, 4, 6] = 14 -- packBits 254 [1, 3, 5, 7] = 15 packBits :: Word8 -> [Int] -> Word8 packBits w is = foldr (.|.) 0 $ map (packBit w) is -- packBit 255 0 = 1 -- packBit 255 1 = 1 -- packBit 255 2 = 2 -- packBit 255 3 = 2 -- packBit 255 4 = 4 -- packBit 255 5 = 4 -- packBit 255 6 = 8 -- packBit 255 7 = 8 packBit :: Word8 -> Int -> Word8 packBit w i = shiftR (w .&. 2^i) ((i `div` 2) + (i `mod` 2)) -- sort of like map, but halves the list in size by consuming two elements. -- Is there a clearer way to write this with built-in function? consumePairs :: (a -> a -> b) -> [a] -> [b] consumePairs f (x : x' : xs) = f x x' : consumePairs f xs consumePairs _ [] = [] consumePairs _ _ = error "list must contain even number of elements"
this should work: import Data.List import Data.Function map fst $ sortBy (compare `on` snd) $ zip yourList $ cycle [0,1] A bit of explanation: As sortBy preserve the original order, we can pair each value at an even position a "0" and each value at an odd position a "1", then we simply sort on the second value of the pair. So all values at even positions will be placed before the values at odd positions but their order will be kept. Chris
Unless performance is critical, I'd recommend using a bit vector representation for a project like this. As you've discovered, randomly accessing individual bits is something of a pain when they're in packed form, but Data.Vector provides a wealth of functions for these sorts of tasks. import Data.Bits import qualified Data.Vector as V type BitVector = V.Vector Bool unpack :: (Bits a) => a -> BitVector unpack w = V.generate (bitSize w) (testBit w) pack :: (Bits a) => BitVector -> a pack v = V.ifoldl' set 0 v where set w i True = w `setBit` i set w _ _ = w mkPermutationVector :: Int -> V.Vector Int mkPermutationVector d = V.generate (2^d) b where b i | i < 2^(d-1) = 2*i | otherwise = let i' = i-2^(d-1) in 2*i'+1 permute :: Int -> BitVector -> BitVector permute d v = V.backpermute v (mkPermutationVector d) Notice how this lets you specify the permutation by closely transcribing the mathematical description. This substantially reduces the likelihood of errors, and is more pleasant to write than bit-twiddly code. To test with your example vector (in base 10): *Main> import Data.Word *Main Data.Word> let permute16 = pack . permute 4 . unpack :: Word16 -> Word16 *Main Data.Word> permute16 43690 65280 Now, by moving to bit vectors as your representation, you lose a lot of what you get for free by using Haskell types, such as Num instances. However, you can always implement the Num operations for your representation; here's a start: plus :: BitVector -> BitVector -> BitVector plus as bs = V.tail sums where (sums, carries) = V.unzip sumsAndCarries sumsAndCarries = V.scanl' fullAdd (False, False) (V.zip as bs) fullAdd (_, cin) (a, b) = ((a /= b) /= cin , (a && b) || (cin && (a /= b))) You may also find Levent Erkok's sbv package useful, although I'm not sure it exposes a function as convenient as backpermute for your particular question. Update: I thought this was a fun question to answer, so I went ahead and fleshed the code out a bit as a library: bit-vector.