How much space does ridge regression require? - haskell

In Haskell, ridge regression can be expressed as:
import Numeric.LinearAlgebra
createReadout :: Matrix Double → Matrix Double → Matrix Double
createReadout a b = oA <\> oB
where
μ = 1e-4
oA = (a <> (tr a)) + (μ * (ident $ rows a))
oB = a <> (tr b)
However, this operation is very memory expensive. Here is a minimalistic example that requires more than 2GB on my machine and takes 3 minutes to execute.
import Numeric.LinearAlgebra
import System.Random
createReadout :: Matrix Double -> Matrix Double -> Matrix Double
createReadout a b = oA <\> oB
where
mu = 1e-4
oA = (a <> (tr a)) + (mu * (ident $ rows a))
oB = a <> (tr b)
teacher :: [Int] -> Int -> Int -> Matrix Double
teacher labelsList cols' correctRow = fromBlocks $ f <$> labelsList
where ones = konst 1.0 (1, cols')
zeros = konst 0.0 (1, cols')
rows' = length labelsList
f i | i == correctRow = [ones]
| otherwise = [zeros]
glue :: Element t => [Matrix t] -> Matrix t
glue xs = fromBlocks [xs]
main :: IO ()
main = do
let n = 1500 -- <- The constant to be increased
m = 10000
cols' = 12
g <- newStdGen
-- Stub data
let labels = take m . map (`mod` 10) . randoms $ g :: [Int]
a = (n >< (cols' * m)) $ take (cols' * m * n) $ randoms g :: Matrix Double
teachers = zipWith (teacher [0..9]) (repeat cols') labels
b = glue teachers
print $ maxElement $ createReadout a b
return ()
$ cabal exec ghc -- -O2 Test.hs
$ time ./Test
./Test 190.16s user 5.22s system 106% cpu 3:03.93 total
The problem is to increase the constant n, at least to n = 4000, while RAM is limited by 5GB. What is minimal space that matrix inversion operation requires in theory? How can this operation be optimized in terms of space? Can ridge regression be efficiently replaced with a cheaper method?

Simple Gauss-Jordan elimination only takes space to store the input and output matrices plus constant auxiliary space. If I'm reading correctly, the matrix oA you need to invert is n x n so that's not a problem.
Your memory usage is completely dominated by storing the input matrix a, which uses at least 1500 * 120000 * 8 = 1.34 GB. n = 4000 would be 4000 * 120000 * 8 = 3.58 GB which is over half of your space budget. I don't know what matrix library you are using or how it stores its matrices, but if they are on the Haskell heap then GC effects could easily account for another factor of 2 in space usage.

Well you can get away with 3*m + nxn space, but how numerically stable this will be I'm not sure.
The basis is the identity
inv( inv(Q) + A'*A)) = Q - Q*A'*R*A*Q
where R = inv( I + A*Q*A')
If A is your A matrix and
Q = inv( mu*I*mu*I) = I/(mu*mu)
then the solution to your ridge regression is
inv( inv(Q) + A'*A)) * A'*b
A little more algebra shows
inv( inv(Q) + A'*A)) = (I - A'*inv( (mu2 + A*A'))*A)/mu2
where mu2 = mu*m
Note that since A is n x m, A*A' is n x n.
So one algorithm would be
Compute C = A*A' + mu2
Do a cholesky decompostion of C, ie find upper triangular U so that U'*U = C
Compute the vector y = A'*b
Compute the vector z = A*y
Solve U'*u = z for u in z
Solve U*v = z for v in z
compute w = A'*z
Compute x = (y - w)/mu2.

Related

Mutable list of mutabale non-integral types in Haskell

I'm trying to parse a huge 3d-data array of complex values from binary. Later this should become l matrices (n x m). Since I'm going to work on these matrices, I'm limited to matrix libraries - hmatrix seems to be promising.
The data layout is not in my requried format, so I have to jump around in positions (i,j,k) -> (k,i,j), where i and j are elements of n and m and k element of l.
I think the only way to read in this in is my using mutables, otherwise I'll end up with several Terrabytes of garbage. My idea was to use boxed mutual arrays or vectors of mututal matrices (STMatrix from Numeric.LinearAlgebra.Devel), so I end up with something like:
data MVector s (STMatrix s t)
But I'm not sure how to use them correctly:
I can modify one single element of the MVector with modify:
modify :: PrimMonad m => MVector (PrimState m) a -> (a -> a) -> Int -> m ()
or use modifyM (Strange: in stack vector-0.12.3.0 does not have modifyM...)
modifyM :: PrimMonad m => MVector (PrimState m) a -> (a -> m a) -> Int -> m ()
so I could use the function call (a -> a) to a runST-routine to modify the SMatrix. I'm not sure, if I should put an ST in an IO (?)
Nevertheless - I think, this should work but is only useful, when I want to modify the whole Matrix, calling this (a->a)-routine n x m x l- times will be a little bit overhead (Maybe it will be optimized out...).
So I'll end up, in marshalling the Array, modify the content via pointers (i,j,k) -> (k,i,j) and read everything Matrix by Matrix - but this does not feel right and I wanted to avoid such dirty tricks.
Do you have any ideas of a way to do this a little but more ...clean?
Ty
Edit:
Thx to K. A. Buhr. His solution works so far. Now, I'm only running into some performance impacts. If I compare the solution:
{-# LANGUAGE BangPatterns #-}
module Main where
import Data.List
import Numeric.LinearAlgebra
import qualified Data.Vector as V
import qualified Data.Vector.Storable as VS
import qualified Data.Vector.Storable.Mutable as VSM
-- Create an l-length list of n x m hmatrix Matrices
toMatrices :: Int -> Int -> Int -> [C] -> [Matrix C]
toMatrices l n m dats = map (reshape m) $ VS.createT $ do
mats <- V.replicateM l $ VSM.unsafeNew (m*n)
sequence_ $ zipWith (\(i,j,k) x ->
VSM.unsafeWrite (mats V.! k) (loc i j) x) idxs (dats ++ repeat 0)
return $ V.toList mats
where idxs = (,,) <$> [0..n-1] <*> [0..m-1] <*> [0..l-1]
loc i j = i*m + j
test1 = toMatrices 1000 1000 100 (fromIntegral <$> [1..])
main = do
let !a = test1
print "done"
With the simpliest C-code:
#include <stdlib.h>
#include <stdio.h>
void main()
{
const int n = 1000;
const int m = 1000;
const int l = 100;
double *src = malloc(n*m*l * sizeof(double));
for (int i = 0; i < n*m*l; i++) {
src[i] = (double)i;
}
double *dest = malloc(n*m*l * sizeof(double));
for (int i = 0; i < n; i++) {
for (int j = 0; j < m; j++) {
for (int k = 0; k < l; k++) {
dest[k*n*m+i*m+j] = src[i*m*l+j*l+k];
}
}
}
printf("done: %f\n", dest[n*m*l - 1]); // Need to access the array, otherwise it'll get lost by -O2
free(src);
free(dest);
}
Both compiled with -O2 give following performance guesses:
real 0m5,611s
user 0m14,845s
sys 0m2,759s
vs.
real 0m0,441s
user 0m0,200s
sys 0m0,240s
This are approx 2 magnitudes per-core performance. From profiling I learn that
VSM.unsafeWrite (mats V.! k) (loc i j) x
is the expensive function.
Since I'll use this procedure in a minute-like intervall, I want to keep the parsing time as low as the disk access time. I'll see, if I can speed this up
PS: This is for some tests, if I could move usual DSP from C-like to Haskell
Edit2 :
Ok, this is what I get after sum trying:
{-# LANGUAGE BangPatterns #-}
module Main where
import Data.List
import qualified Data.Vector as V
import qualified Data.Vector.Storable as VS
import qualified Data.Vector.Storable.Mutable as VSM
import Numeric.LinearAlgebra
-- Create an l-length list of n x m hmatrix Matrices
toMatrices :: Int -> Int -> Int -> VS.Vector C -> V.Vector (Matrix C)
toMatrices l n m dats =
V.map (reshape m) newMat
where
newMat = VS.createT $
V.generateM l $ \k -> do
curMat <- VSM.unsafeNew (m * n)
VS.mapM_
(\i ->
VS.mapM_
(\j -> VSM.unsafeWrite curMat (loc i j) (dats VS.! (oldLoc i j k)))
idjs)
idis
return curMat
loc i j = i * m + j
oldLoc i j k = i * m * l + j * l + k
!idis = VS.generate n (\a->a)
!idjs = VS.generate m (\a->a)
test1 = toMatrices 100 1000 1000 arr
where
arr = VS.generate (1000 * 1000 * 100) fromIntegral :: VS.Vector C
main = do
let !a = test1
print "done"
It gives something about:
real 0m1,816s
user 0m1,636s
sys 0m1,120s
, so ~4 times slower than C code. I think I can live with this.
I guess, I'm destroying all the stream-functionality of the vector with this code. If there are any suggestions to have them back by a comparable speed, I would be grateful!
As I understand it, you have a "huge" set of data in i-major, j-middling, k-minor order, and you want to load it into matrices indexed by k whose elements have i-indexed rows and j-indexed columns, right? So, you want a function something like:
import Numeric.LinearAlgebra
-- load into "l" matrices of size "n x m"
toMatrices :: Int -> Int -> Int -> [C] -> [Matrix C]
toMatrices l n m dats = ...
Note that you've written n x m matrices above, associating i with n and j with m. It would be more usual to flip the roles of n and m, but I've stuck with your notation, so keep an eye on that.
If the entire data list [C] could fit comfortably in memory, you could do this immutably by writing something like:
import Data.List
import Data.List.Split
import Numeric.LinearAlgebra
toMatrices :: Int -> Int -> Int -> [C] -> [Matrix C]
toMatrices l n m = map (reshape m . fromList) . transpose . chunksOf l
This breaks the input data into l-sized chunks, transposes them into l lists, and converts each list to a matrix. If there was some way to force all the Matrix C values in parallel, this could be done with one traversal through the data, without the need to hold on to the whole list. Unfortunately, the individual Matrix C values can only be forced one-by-one, and the whole list needs to be kept around until all of them can be forced.
So, if the "huge" [C] list is too big for memory, you're probably right that you need to load the data into a (partially) mutable structure. The code is somewhat challenging to write, but it's not too bad in its final form. I believe the following will work:
import Data.List
import Numeric.LinearAlgebra
import qualified Data.Vector as V
import qualified Data.Vector.Storable as VS
import qualified Data.Vector.Storable.Mutable as VSM
-- Create an l-length list of n x m hmatrix Matrices
toMatrices :: Int -> Int -> Int -> [C] -> [Matrix C]
toMatrices l n m dats = map (reshape m) $ VS.createT $ do
mats <- V.replicateM l $ VSM.unsafeNew (m*n)
sequence_ $ zipWith (\(i,j,k) x ->
VSM.unsafeWrite (mats V.! k) (loc i j) x) idxs (dats ++ repeat 0)
return $ V.toList mats
where idxs = (,,) <$> [0..n-1] <*> [0..m-1] <*> [0..l-1]
loc i j = i*m + j
test1 = toMatrices 4 3 2 (fromIntegral <$> [1..24])
test2 = toMatrices 1000 1000 100 (fromIntegral <$> [1..])
main = do
print $ test1
print $ norm_Inf . foldl1' (+) $ test2
Compiled with -O2, the maximum residency is about 1.6Gigs, which matches the expected memory needed to hold 100 matrices of one million 16-byte complex values in memory, so that looks right.
Anyway, this version of toMatrices is made somewhat complicated by the use of three different vector variants. There's Vector from hmatrix, which is the same as the immutable storable VS.Vector from vector; and then there are two more types from vector: the immutable boxed V.Vector, and the mutable storable VSM.Vector.
The do-block creates a V.Vector of VSM.Vectors and populates those with a sequence of monadic actions performed across index/value pairs. You can load the data in any order by modifying the definition of idxs to match the order of the data stream. The do-block returns the final VSM.Vectors in a list, the helper function VS.createT freezes them all to VS.Vectors (i.e., Vector from hmatrix), and reshape is mapped across the vectors to turn them into m-column matrices.
Note that you'll have to take care that in your actual application, the list of data items read from the file isn't kept around by code other than toMatrices, either in the original text form or the parsed numeric form. This shouldn't be too tough to get right, but you might want to test on medium-sized test input before locking up your computer on the real dataset.

How to write a function that calculate x^16?

I have to write function for calculating x^16 and x^20 that uses the least amount
of multiplication steps with Haskell. It is not allowed to use the built-in exponentiation operator ^. I`ve tried so many things,but nothing works... I already wrote it till x^8, i guess. What should i do as next?
square :: Integer -> Integer
square x = x * x
pow :: Integer -> Integer
pow x = square (square x)
pow1 :: Integer -> Integer
pow1 x = pow (pow x)
Let's formulate this the general way. (As dfeuer comments, this is actually not quite the general way, because sub-products could also be shared in other ways, not just in squaring; but for the given examples it's good enough.)
data Exponentiation
= Variable
| Multiplication Exponentiation Exponentiation
| Squaring Exponentiation
deriving (Eq, Show)
exponPower :: Exponentiation -> Int
exponPower Variable = 1
exponPower (Multiplication l r) = exponPower l + exponPower r
exponPower (Squaring e) = 2 * exponPower e
nMultiplications :: Exponentiation -> Int
nMultiplications Variable = 0
nMultiplications (Multiplication l r) = 1 + nMultiplications l + nMultiplications r
nMultiplications (Squaring e) = 1 + nMultiplications e
Now we can write a brute-force generator for finding the best possible strategy of multiplying to get a given power:
import Data.List (minimumBy)
import Data.Ord (comparing)
bestMulStrategy :: Int -> Exponentiation
bestMulStrategy 1 = Variable
bestMulStrategy n = minimumBy (comparing nMultiplications)
$ sqStrats ++ mulStrats
where sqStrats
| even n = [Squaring . bestMulStrategy $ n`quot`2]
| otherwise = []
mulStrats
= [ Multiplication l r
| nl <- [1..n-1]
, let l = bestMulStrategy nl
r = bestMulStrategy $ n-nl ]
That works, but it's very inefficient because we keep re-calculating the optimal strategy for the smaller sub-powers over and over, so already for 16 it takes a long time. A simple fix is to memoise the function:
import Data.MemoTrie
bestMulStrategy :: Int -> Exponentiation
bestMulStrategy = memo best
where best 1 = Variable
best n = minimumBy (comparing nMultiplications)
$ sqStrats ++ mulStrats
where sqStrats
| even n = [Squaring . bestMulStrategy $ n`quot`2]
| otherwise = []
mulStrats
= [ Multiplication l r
| nl <- [1..n-1]
, let l = bestMulStrategy nl
r = bestMulStrategy $ n-nl ]
And then we learn, unsurprisingly,
> bestMulStrategy 16
Squaring (Squaring (Squaring (Squaring Variable)))
For 20, it's a bit more interesting:
> bestMulStrategy 20
Squaring (Squaring (Multiplication Variable (Squaring (Squaring Variable))))
I don't think this is the unique optimum, but it certainly is an optimum (5 multiplications).
What's left to be done is, implementing those strategies as Haskell functions. That could be done with Template Haskell, but I think it's more instructive for you to do it yourself.

Can the backpropagation algorithm change the sign of weights?

I have a spare time project which involves training a neural network with a dynamic data set. I think I've implemented it correctly, and for some starting networks I can train them to match sample data. I've noticed that positive weights can't become negative with training and vice-versa. This leads to 2 questions:
Does the "barrier at 0" limit what behaviours a given starting
network can learn?
If yes: How can I modify the delta rule so that
weights can go between positive and negative?
This is the code I am using at the moment (It's in Haskell). There might be mistakes in it that are obvious to someone else, but it's basically working.
import Data.List
import System.Random
type Layer = [[Double]]
type NNet = [Layer]
sigmoid :: Double -> Double
sigmoid t = 1 / (1 + exp (-t))
-- This is the derivative given the sigmoid value.
-- For the derivative given x: use sigmoidDerivative . sigmoid
sigmoidDerivative :: Double -> Double
sigmoidDerivative fx = fx * (1 - fx)
feedforward = flip (foldl' feedlayer)
feedlayer i = map (sigmoid . sum . zipWith (*) (1:i))
backprop :: Double -> [Double] -> [Double] -> NNet -> NNet
backprop rate i t n = fst $ backprop' i t n where
backprop' i t (l:n) = (nw:r,be) where
hs = feedlayer i l
(r,e) = case n of
[] -> ([], zipWith subtract hs t)
x -> backprop' hs t n
we = zipWith (\oe w ->
map (*oe) w
) e l
nw = zipWith3 (\wl dl h -> let sdh = sigmoidDerivative h in
zipWith3 (\w d x ->
w + rate * d * sdh * x
) wl dl (1:i)
) l we hs
be = map sum $ transpose we
randomNNet :: RandomGen g => g -> [Int] -> NNet
randomNNet _ [_] = []
randomNNet gen (i:r#(n:_)) = let
frl g _ = mapAccumL (\g _ -> let
(a,g') = randomR (-0.05,0.05) g
in (g',a)) g $ replicate (i+1) ()
frg g = mapAccumL frl g $ replicate n ()
(gen',l1) = frg gen
in l1 : randomNNet gen' r
I had made a mistake
When an error signal reaches a node, this should happen:
-->Multiply by weights-->Propagate to earlier nodes
/
error
\
-->Update weights.
What was happening in the code I posted above was this:
-->Propagate to earlier nodes
/
error-->Multiply by weights
\
-->Update weights
This is the fixed code, including the comments I inserted to help myself figure out what was going on.
backprop :: Double -> [Double] -> [Double] -> NNet -> NNet
backprop rate i t n = fst $ backprop' i t n where
backprop' i t (l:n) = (nw:r,be) where
-- hs: output of this layer
hs = feedlayer i l
-- r: the next layer updated
-- e: the error of this layer's output
(r,e) = case n of
[] -> ([], zipWith subtract hs t)
x -> backprop' hs t n
-- we: Error divided among weights
we = zipWith (\oe w ->
map (*oe) w
) e l
-- nw: New weights for this layer
-- wl: weights leading to current node
-- h: this node's output
nw = zipWith3 (\wl d h -> let sdh = sigmoidDerivative h in
-- w: The current weight
-- d: The error assigned to the current node
-- x: The input to the current synapse
zipWith (\w x ->
w + rate * d * sdh * x
) wl (1:i)
) l e hs
-- be: Errors to propagate back to earlier nodes
be = map sum $ transpose we

Optimizing Perlin noise in Haskell

(Dependencies for this program: vector --any and JuicyPixels >= 2. Code is available as Gist.)
{-# LANGUAGE Haskell2010 #-}
{-# LANGUAGE BangPatterns #-}
import Control.Arrow
import Data.Bits
import Data.Vector.Unboxed ((!))
import Data.Word
import System.Environment (getArgs)
import qualified Codec.Picture as P
import qualified Data.ByteString as B
import qualified Data.Vector.Unboxed as V
I tried to port Ken Perlin's improved noise
to Haskell, but I'm not entirely sure that my method is correct. The main part
is something that should generalize nicely to higher and lower dimensions, but
that is something for later:
perlin3 :: (Ord a, Num a, RealFrac a, V.Unbox a) => Permutation -> (a, a, a) -> a
perlin3 p (!x', !y', !z')
= let (!xX, !x) = actuallyProperFraction x'
(!yY, !y) = actuallyProperFraction y'
(!zZ, !z) = actuallyProperFraction z'
!u = fade x
!v = fade y
!w = fade z
!h = xX
!a = next p h + yY
!b = next p (h+1) + yY
!aa = next p a + zZ
!ab = next p (a+1) + zZ
!ba = next p b + zZ
!bb = next p (b+1) + zZ
!aaa = next p aa
!aab = next p (aa+1)
!aba = next p ab
!abb = next p (ab+1)
!baa = next p ba
!bab = next p (ba+1)
!bba = next p bb
!bbb = next p (bb+1)
in
lerp w
(lerp v
(lerp u
(grad aaa (x, y, z))
(grad baa (x-1, y, z)))
(lerp u
(grad aba (x, y-1, z))
(grad bba (x-1, y-1, z))))
(lerp v
(lerp u
(grad aab (x, y, z-1))
(grad bab (x-1, y, z-1)))
(lerp u
(grad abb (x, y-1, z-1))
(grad bbb (x-1, y-1, z-1))))
This is of course accompanied by a few functions mentioned in the perlin3
function, of which I hope they are as efficient as possible:
fade :: (Ord a, Num a) => a -> a
fade !t | 0 <= t, t <= 1 = t * t * t * (t * (t * 6 - 15) + 10)
lerp :: (Ord a, Num a) => a -> a -> a -> a
lerp !t !a !b | 0 <= t, t <= 1 = a + t * (b - a)
grad :: (Bits hash, Integral hash, Num a, V.Unbox a) => hash -> (a, a, a) -> a
grad !hash (!x, !y, !z) = dot3 (vks `V.unsafeIndex` fromIntegral (hash .&. 15)) (x, y, z)
where
vks = V.fromList
[ (1,1,0), (-1,1,0), (1,-1,0), (-1,-1,0)
, (1,0,1), (-1,0,1), (1,0,-1), (-1,0,-1)
, (0,1,1), (0,-1,1), (0,1,-1), (0,-1,-1)
, (1,1,0), (-1,1,0), (0,-1,1), (0,-1,-1)
]
dot3 :: Num a => (a, a, a) -> (a, a, a) -> a
dot3 (!x0, !y0, !z0) (!x1, !y1, !z1) = x0 * x1 + y0 * y1 + z0 * z1
-- Unlike `properFraction`, `actuallyProperFraction` rounds as intended.
actuallyProperFraction :: (RealFrac a, Integral b) => a -> (b, a)
actuallyProperFraction x
= let (ipart, fpart) = properFraction x
r = if x >= 0 then (ipart, fpart)
else (ipart-1, 1+fpart)
in r
For the permutation group, I just copied the one Perlin used on his website:
newtype Permutation = Permutation (V.Vector Word8)
mkPermutation :: [Word8] -> Permutation
mkPermutation xs
| length xs >= 256
= Permutation . V.fromList $ xs
permutation :: Permutation
permutation = mkPermutation
[151,160,137,91,90,15,
131,13,201,95,96,53,194,233,7,225,140,36,103,30,69,142,8,99,37,240,21,10,23,
190, 6,148,247,120,234,75,0,26,197,62,94,252,219,203,117,35,11,32,57,177,33,
88,237,149,56,87,174,20,125,136,171,168, 68,175,74,165,71,134,139,48,27,166,
77,146,158,231,83,111,229,122,60,211,133,230,220,105,92,41,55,46,245,40,244,
102,143,54, 65,25,63,161, 1,216,80,73,209,76,132,187,208, 89,18,169,200,196,
135,130,116,188,159,86,164,100,109,198,173,186, 3,64,52,217,226,250,124,123,
5,202,38,147,118,126,255,82,85,212,207,206,59,227,47,16,58,17,182,189,28,42,
223,183,170,213,119,248,152, 2,44,154,163, 70,221,153,101,155,167, 43,172,9,
129,22,39,253, 19,98,108,110,79,113,224,232,178,185, 112,104,218,246,97,228,
251,34,242,193,238,210,144,12,191,179,162,241, 81,51,145,235,249,14,239,107,
49,192,214, 31,181,199,106,157,184, 84,204,176,115,121,50,45,127, 4,150,254,
138,236,205,93,222,114,67,29,24,72,243,141,128,195,78,66,215,61,156,180
]
next :: Permutation -> Word8 -> Word8
next (Permutation !v) !idx'
= v `V.unsafeIndex` (fromIntegral $ idx' .&. 0xFF)
And all this is tied together with JuicyPixels:
main = do
[target] <- getArgs
let image = P.generateImage pixelRenderer 512 512
P.writePng target image
where
pixelRenderer, pixelRenderer' :: Int -> Int -> Word8
pixelRenderer !x !y
= floor $ ((perlin3 permutation ((fromIntegral x - 256) / 32,
(fromIntegral y - 256) / 32, 0 :: Double))+1)/2 * 128
-- This code is much more readable, but also much slower.
pixelRenderer' x y
= (\w -> floor $ ((w+1)/2 * 128)) -- w should be in [-1,+1]
. perlin3 permutation
. (\(x,y,z) -> ((x-256)/32, (y-256)/32, (z-256)/32))
$ (fromIntegral x, fromIntegral y, 0 :: Double)
My problem is that perlin3 seems very slow to me. If I profile it, pixelRenderer
is getting a lot of time as well, but I'll ignore that for now. I don't know
how to optimize perlin3. I tried to hint GHC with bang patterns, which cuts
the execution time in half, so that's nice. Explicitly specializing and inlining
barely helps with ghc -O. Is perlin3 supposed to be this slow?
UPDATE: an earlier version of this question mentioned a bug in my code. This problem has been resolved; it turns out my old version of actuallyProperFraction was buggy. It implicitly rounded the integral part of a floating point number to Word8, and then subtracted it from the floating point number to get the fractional part. Since Word8 can only take values between 0 and 255 inclusive, this won't work properly for numbers outside that range, including negative numbers.
This code appears to be mostly computation-bound. It can be improved a little bit, but not by much unless there's a way to use fewer array lookups and less arithmetic.
There are two useful tools for measuring performance: profiling and code dumps. I added an SCC annotation to perlin3 so that it would show up in the profile. Then I compiled with gcc -O2 -fforce-recomp -ddump-simpl -prof -auto. The -ddump-simpl flag prints the simplified code.
Profiling: On my computer, it takes 0.60 seconds to run the program, and about 20% of execution time (0.12 seconds) is spent in perlin3 according to the profile. Note that the precision of my profile info is about +/-3%.
Simplifier output: The simplifier produces fairly clean code. perlin3 gets inlined into pixelRenderer, so that's the part of the output you want to look at. Most of the code consists of unboxed array reads and unboxed arithmetic. To improve performance, we want to eliminate some of this arithmetic.
An easy change is to eliminate the run-time checks on SomeFraction (which doesn't appear in your question, but is part of the code that you uploaded). This reduces the program's execution time to 0.56 seconds.
-- someFraction t | 0 <= t, t < 1 = SomeFraction t
someFraction t = SomeFraction t
Next, there are several array lookups that show up in the simplifier like this:
case GHC.Prim.indexWord8Array#
ipv3_s23a
(GHC.Prim.+#
ipv1_s21N
(GHC.Prim.word2Int#
(GHC.Prim.and#
(GHC.Prim.narrow8Word#
(GHC.Prim.plusWord# ipv5_s256 (__word 1)))
(__word 255))))
The primitive operation narrow8Word# is for coercing from an Int to a Word8. We can get rid of this coercion by using Int instead of Word8 in the definition of next.
next :: Permutation -> Int -> Int
next (Permutation !v) !idx'
= fromIntegral $ v `V.unsafeIndex` (fromIntegral idx' .&. 0xFF)
This reduces the program's execution time to 0.54 seconds. Considering just the time spent in perlin3, the execution time has fallen (roughly) from 0.12 to 0.06 seconds. Although it's hard to measure where the rest of the time is going, it's most likely spread out among the remaining arithmetic and array accesses.
On my machine reference code with Heatsink's optimisations takes 0.19 secs.
Firstly, I has moved from JuicyPixels to yarr and yarr-image-io with my favourite flags, -Odph -rtsopts -threaded -fno-liberate-case -funbox-strict-fields -fexpose-all-unfoldings -funfolding-keeness-factor1000 -fsimpl-tick-factor=500 -fllvm -optlo-O3 (they are given here):
import Data.Yarr as Y
import Data.Yarr.IO.Image as Y
...
main = do
[target] <- getArgs
image <- dComputeS $ fromFunction (512, 512) (return . pixelRenderer)
Y.writeImage target (Grey image)
where
pixelRenderer, pixelRenderer' :: Dim2 -> Word8
pixelRenderer (y, x)
= floor $ ((perlin3 permutation ((fromIntegral x - 256) / 32,
(fromIntegral y - 256) / 32, 0 :: Double))+1)/2 * 128
-- This code is much more readable, but also much slower.
pixelRenderer' (y, x)
= (\w -> floor $ ((w+1)/2 * 128)) -- w should be in [-1,+1]
. perlin3 permutation
. (\(x,y,z) -> ((x-256)/32, (y-256)/32, (z-256)/32))
$ (fromIntegral x, fromIntegral y, 0 :: Double)
This makes the program 30% faster, 0.13 seconds.
Secondly I has replaced uses of standard floor with
doubleToByte :: Double -> Word8
doubleToByte f = fromIntegral (truncate f :: Int)
It is known issue (google "haskell floor performance"). Execution time is reduced to 52 ms (0.052 secs), in almost 3 times.
Finally, just for fun I tried to compute noise in parallel (dComputeP instead of dComputeS and +RTS -N4 in command line run). Program took 36 ms, including I/O constant of about 10 ms.

Is it possible to use Haskell to reasonably solve large DP problems

I wrote code for solving the local alignment problem with Smith–Waterman algorithm.
I want to do this with input of strings with length 10000, with reasonable memory(under 2GB ram) and reasonable time (under 5 minutes).
At first I was using bio library's built in function for this, and it runs way too slow and eat up 4GB of ram before I killed it.
Note the java program jAligner, which implements the same algorithm, can solve this problem with less than 1GB of memory and less than 20 seconds.
When I wrote an unboxed version of this, the program gives me <<loop>>. I think it's because the array need to access items in the array before the array gets built entirely.
So I wonder is it even possible to write Haskell code with similar performance for this kind of larger dynamic programming problems.
module LocalAlign where
--import Data.Array.Unboxed
import Data.Tuple
import Data.Array
localAffineAlignment :: (Char -> Char -> Int)
-> Int
-> Int
-> String
-> String
-> (Int, (String, String, String, String))
localAffineAlignment f g e s' t' = (score, best) where
n = length s'
m = length t'
s= array (0,n-1) $ zip [0..n-1] s'
t= array (0,m-1) $ zip [0..m-1] t'
table :: (Array (Int,Int) Int,Array (Int,Int) Int)
table = (c,d)
where --a :: UArray (Int,Int) Int
a = array ((0,0),(n,m)) [((x,y),a' x y)|x<-[0..n],y<-[0..m]] --s end with gap
b = array ((0,0),(n,m)) [((x,y),b' x y)|x<-[0..n],y<-[0..m]] --t end with gap
c = array ((0,0),(n,m)) [((x,y),fst (c' x y))|x<-[0..n],y<-[0..m]] -- best
d = array ((0,0),(n,m)) [((x,y),snd (c' x y))|x<-[0..n],y<-[0..m]] -- direction
a' i j
| i==0 || j==0 = inf
| otherwise = max (a!(i-1,j)-e) (c!(i-1,j)-g-e)
b' i j
| i==0 || j==0 = inf
| otherwise = max (b!(i,j-1)-e) (c!(i,j-1)-g-e)
c' i j
| min i j == 0 = (0,0)
| otherwise = maximum [(b!(i,j),3),(a!(i,j),2),(c!(i-1,j-1) + f u v,1),(0,0)]
where u = s!(i-1)
v = t!(j-1)
inf = -1073741824
score :: Int
score = maximum $ elems $ fst table
best :: (String, String, String, String)
best = (drop si $ take ei s',drop sj $ take ej t',b1,b2)
where (a,d') = table
(si,sj,b1,b2) = build ei ej [] []
(ei,ej) = snd $ maximum $ map swap $ assocs a
build x y ss tt
| o == 0 = (x,y,ss,tt)
| d == 1 = build (x-1) (y-1) (u:ss) (v:tt)
| d == 2 = build (x-1) y (u:ss) ('-':tt)
| otherwise = build x (y-1) ('-':ss) (v:tt)
where o = a!(x,y)
d = d'!(x,y)
u = s!(x-1)
v = t!(y-1)
is it even possible to write Haskell code with similar performance for this kind of larger dynamic programming problems.
Yes, of course. Use the same data structures and the same algorithms, and you will get same (or better, or worse, by constant factors) performance.
You are using (intermediate) lists and boxed arrays heavily. Consider using the vector package instead.
You might be interested in the MemoCombinators library, which makes doing dynamic programming much easier. You can basically write the algorithm without memoization, then just annotate which variables you want memoized, and the compiler takes it from there.

Resources