I'm trying to parse a huge 3d-data array of complex values from binary. Later this should become l matrices (n x m). Since I'm going to work on these matrices, I'm limited to matrix libraries - hmatrix seems to be promising.
The data layout is not in my requried format, so I have to jump around in positions (i,j,k) -> (k,i,j), where i and j are elements of n and m and k element of l.
I think the only way to read in this in is my using mutables, otherwise I'll end up with several Terrabytes of garbage. My idea was to use boxed mutual arrays or vectors of mututal matrices (STMatrix from Numeric.LinearAlgebra.Devel), so I end up with something like:
data MVector s (STMatrix s t)
But I'm not sure how to use them correctly:
I can modify one single element of the MVector with modify:
modify :: PrimMonad m => MVector (PrimState m) a -> (a -> a) -> Int -> m ()
or use modifyM (Strange: in stack vector-0.12.3.0 does not have modifyM...)
modifyM :: PrimMonad m => MVector (PrimState m) a -> (a -> m a) -> Int -> m ()
so I could use the function call (a -> a) to a runST-routine to modify the SMatrix. I'm not sure, if I should put an ST in an IO (?)
Nevertheless - I think, this should work but is only useful, when I want to modify the whole Matrix, calling this (a->a)-routine n x m x l- times will be a little bit overhead (Maybe it will be optimized out...).
So I'll end up, in marshalling the Array, modify the content via pointers (i,j,k) -> (k,i,j) and read everything Matrix by Matrix - but this does not feel right and I wanted to avoid such dirty tricks.
Do you have any ideas of a way to do this a little but more ...clean?
Ty
Edit:
Thx to K. A. Buhr. His solution works so far. Now, I'm only running into some performance impacts. If I compare the solution:
{-# LANGUAGE BangPatterns #-}
module Main where
import Data.List
import Numeric.LinearAlgebra
import qualified Data.Vector as V
import qualified Data.Vector.Storable as VS
import qualified Data.Vector.Storable.Mutable as VSM
-- Create an l-length list of n x m hmatrix Matrices
toMatrices :: Int -> Int -> Int -> [C] -> [Matrix C]
toMatrices l n m dats = map (reshape m) $ VS.createT $ do
mats <- V.replicateM l $ VSM.unsafeNew (m*n)
sequence_ $ zipWith (\(i,j,k) x ->
VSM.unsafeWrite (mats V.! k) (loc i j) x) idxs (dats ++ repeat 0)
return $ V.toList mats
where idxs = (,,) <$> [0..n-1] <*> [0..m-1] <*> [0..l-1]
loc i j = i*m + j
test1 = toMatrices 1000 1000 100 (fromIntegral <$> [1..])
main = do
let !a = test1
print "done"
With the simpliest C-code:
#include <stdlib.h>
#include <stdio.h>
void main()
{
const int n = 1000;
const int m = 1000;
const int l = 100;
double *src = malloc(n*m*l * sizeof(double));
for (int i = 0; i < n*m*l; i++) {
src[i] = (double)i;
}
double *dest = malloc(n*m*l * sizeof(double));
for (int i = 0; i < n; i++) {
for (int j = 0; j < m; j++) {
for (int k = 0; k < l; k++) {
dest[k*n*m+i*m+j] = src[i*m*l+j*l+k];
}
}
}
printf("done: %f\n", dest[n*m*l - 1]); // Need to access the array, otherwise it'll get lost by -O2
free(src);
free(dest);
}
Both compiled with -O2 give following performance guesses:
real 0m5,611s
user 0m14,845s
sys 0m2,759s
vs.
real 0m0,441s
user 0m0,200s
sys 0m0,240s
This are approx 2 magnitudes per-core performance. From profiling I learn that
VSM.unsafeWrite (mats V.! k) (loc i j) x
is the expensive function.
Since I'll use this procedure in a minute-like intervall, I want to keep the parsing time as low as the disk access time. I'll see, if I can speed this up
PS: This is for some tests, if I could move usual DSP from C-like to Haskell
Edit2 :
Ok, this is what I get after sum trying:
{-# LANGUAGE BangPatterns #-}
module Main where
import Data.List
import qualified Data.Vector as V
import qualified Data.Vector.Storable as VS
import qualified Data.Vector.Storable.Mutable as VSM
import Numeric.LinearAlgebra
-- Create an l-length list of n x m hmatrix Matrices
toMatrices :: Int -> Int -> Int -> VS.Vector C -> V.Vector (Matrix C)
toMatrices l n m dats =
V.map (reshape m) newMat
where
newMat = VS.createT $
V.generateM l $ \k -> do
curMat <- VSM.unsafeNew (m * n)
VS.mapM_
(\i ->
VS.mapM_
(\j -> VSM.unsafeWrite curMat (loc i j) (dats VS.! (oldLoc i j k)))
idjs)
idis
return curMat
loc i j = i * m + j
oldLoc i j k = i * m * l + j * l + k
!idis = VS.generate n (\a->a)
!idjs = VS.generate m (\a->a)
test1 = toMatrices 100 1000 1000 arr
where
arr = VS.generate (1000 * 1000 * 100) fromIntegral :: VS.Vector C
main = do
let !a = test1
print "done"
It gives something about:
real 0m1,816s
user 0m1,636s
sys 0m1,120s
, so ~4 times slower than C code. I think I can live with this.
I guess, I'm destroying all the stream-functionality of the vector with this code. If there are any suggestions to have them back by a comparable speed, I would be grateful!
As I understand it, you have a "huge" set of data in i-major, j-middling, k-minor order, and you want to load it into matrices indexed by k whose elements have i-indexed rows and j-indexed columns, right? So, you want a function something like:
import Numeric.LinearAlgebra
-- load into "l" matrices of size "n x m"
toMatrices :: Int -> Int -> Int -> [C] -> [Matrix C]
toMatrices l n m dats = ...
Note that you've written n x m matrices above, associating i with n and j with m. It would be more usual to flip the roles of n and m, but I've stuck with your notation, so keep an eye on that.
If the entire data list [C] could fit comfortably in memory, you could do this immutably by writing something like:
import Data.List
import Data.List.Split
import Numeric.LinearAlgebra
toMatrices :: Int -> Int -> Int -> [C] -> [Matrix C]
toMatrices l n m = map (reshape m . fromList) . transpose . chunksOf l
This breaks the input data into l-sized chunks, transposes them into l lists, and converts each list to a matrix. If there was some way to force all the Matrix C values in parallel, this could be done with one traversal through the data, without the need to hold on to the whole list. Unfortunately, the individual Matrix C values can only be forced one-by-one, and the whole list needs to be kept around until all of them can be forced.
So, if the "huge" [C] list is too big for memory, you're probably right that you need to load the data into a (partially) mutable structure. The code is somewhat challenging to write, but it's not too bad in its final form. I believe the following will work:
import Data.List
import Numeric.LinearAlgebra
import qualified Data.Vector as V
import qualified Data.Vector.Storable as VS
import qualified Data.Vector.Storable.Mutable as VSM
-- Create an l-length list of n x m hmatrix Matrices
toMatrices :: Int -> Int -> Int -> [C] -> [Matrix C]
toMatrices l n m dats = map (reshape m) $ VS.createT $ do
mats <- V.replicateM l $ VSM.unsafeNew (m*n)
sequence_ $ zipWith (\(i,j,k) x ->
VSM.unsafeWrite (mats V.! k) (loc i j) x) idxs (dats ++ repeat 0)
return $ V.toList mats
where idxs = (,,) <$> [0..n-1] <*> [0..m-1] <*> [0..l-1]
loc i j = i*m + j
test1 = toMatrices 4 3 2 (fromIntegral <$> [1..24])
test2 = toMatrices 1000 1000 100 (fromIntegral <$> [1..])
main = do
print $ test1
print $ norm_Inf . foldl1' (+) $ test2
Compiled with -O2, the maximum residency is about 1.6Gigs, which matches the expected memory needed to hold 100 matrices of one million 16-byte complex values in memory, so that looks right.
Anyway, this version of toMatrices is made somewhat complicated by the use of three different vector variants. There's Vector from hmatrix, which is the same as the immutable storable VS.Vector from vector; and then there are two more types from vector: the immutable boxed V.Vector, and the mutable storable VSM.Vector.
The do-block creates a V.Vector of VSM.Vectors and populates those with a sequence of monadic actions performed across index/value pairs. You can load the data in any order by modifying the definition of idxs to match the order of the data stream. The do-block returns the final VSM.Vectors in a list, the helper function VS.createT freezes them all to VS.Vectors (i.e., Vector from hmatrix), and reshape is mapped across the vectors to turn them into m-column matrices.
Note that you'll have to take care that in your actual application, the list of data items read from the file isn't kept around by code other than toMatrices, either in the original text form or the parsed numeric form. This shouldn't be too tough to get right, but you might want to test on medium-sized test input before locking up your computer on the real dataset.
(Dependencies for this program: vector --any and JuicyPixels >= 2. Code is available as Gist.)
{-# LANGUAGE Haskell2010 #-}
{-# LANGUAGE BangPatterns #-}
import Control.Arrow
import Data.Bits
import Data.Vector.Unboxed ((!))
import Data.Word
import System.Environment (getArgs)
import qualified Codec.Picture as P
import qualified Data.ByteString as B
import qualified Data.Vector.Unboxed as V
I tried to port Ken Perlin's improved noise
to Haskell, but I'm not entirely sure that my method is correct. The main part
is something that should generalize nicely to higher and lower dimensions, but
that is something for later:
perlin3 :: (Ord a, Num a, RealFrac a, V.Unbox a) => Permutation -> (a, a, a) -> a
perlin3 p (!x', !y', !z')
= let (!xX, !x) = actuallyProperFraction x'
(!yY, !y) = actuallyProperFraction y'
(!zZ, !z) = actuallyProperFraction z'
!u = fade x
!v = fade y
!w = fade z
!h = xX
!a = next p h + yY
!b = next p (h+1) + yY
!aa = next p a + zZ
!ab = next p (a+1) + zZ
!ba = next p b + zZ
!bb = next p (b+1) + zZ
!aaa = next p aa
!aab = next p (aa+1)
!aba = next p ab
!abb = next p (ab+1)
!baa = next p ba
!bab = next p (ba+1)
!bba = next p bb
!bbb = next p (bb+1)
in
lerp w
(lerp v
(lerp u
(grad aaa (x, y, z))
(grad baa (x-1, y, z)))
(lerp u
(grad aba (x, y-1, z))
(grad bba (x-1, y-1, z))))
(lerp v
(lerp u
(grad aab (x, y, z-1))
(grad bab (x-1, y, z-1)))
(lerp u
(grad abb (x, y-1, z-1))
(grad bbb (x-1, y-1, z-1))))
This is of course accompanied by a few functions mentioned in the perlin3
function, of which I hope they are as efficient as possible:
fade :: (Ord a, Num a) => a -> a
fade !t | 0 <= t, t <= 1 = t * t * t * (t * (t * 6 - 15) + 10)
lerp :: (Ord a, Num a) => a -> a -> a -> a
lerp !t !a !b | 0 <= t, t <= 1 = a + t * (b - a)
grad :: (Bits hash, Integral hash, Num a, V.Unbox a) => hash -> (a, a, a) -> a
grad !hash (!x, !y, !z) = dot3 (vks `V.unsafeIndex` fromIntegral (hash .&. 15)) (x, y, z)
where
vks = V.fromList
[ (1,1,0), (-1,1,0), (1,-1,0), (-1,-1,0)
, (1,0,1), (-1,0,1), (1,0,-1), (-1,0,-1)
, (0,1,1), (0,-1,1), (0,1,-1), (0,-1,-1)
, (1,1,0), (-1,1,0), (0,-1,1), (0,-1,-1)
]
dot3 :: Num a => (a, a, a) -> (a, a, a) -> a
dot3 (!x0, !y0, !z0) (!x1, !y1, !z1) = x0 * x1 + y0 * y1 + z0 * z1
-- Unlike `properFraction`, `actuallyProperFraction` rounds as intended.
actuallyProperFraction :: (RealFrac a, Integral b) => a -> (b, a)
actuallyProperFraction x
= let (ipart, fpart) = properFraction x
r = if x >= 0 then (ipart, fpart)
else (ipart-1, 1+fpart)
in r
For the permutation group, I just copied the one Perlin used on his website:
newtype Permutation = Permutation (V.Vector Word8)
mkPermutation :: [Word8] -> Permutation
mkPermutation xs
| length xs >= 256
= Permutation . V.fromList $ xs
permutation :: Permutation
permutation = mkPermutation
[151,160,137,91,90,15,
131,13,201,95,96,53,194,233,7,225,140,36,103,30,69,142,8,99,37,240,21,10,23,
190, 6,148,247,120,234,75,0,26,197,62,94,252,219,203,117,35,11,32,57,177,33,
88,237,149,56,87,174,20,125,136,171,168, 68,175,74,165,71,134,139,48,27,166,
77,146,158,231,83,111,229,122,60,211,133,230,220,105,92,41,55,46,245,40,244,
102,143,54, 65,25,63,161, 1,216,80,73,209,76,132,187,208, 89,18,169,200,196,
135,130,116,188,159,86,164,100,109,198,173,186, 3,64,52,217,226,250,124,123,
5,202,38,147,118,126,255,82,85,212,207,206,59,227,47,16,58,17,182,189,28,42,
223,183,170,213,119,248,152, 2,44,154,163, 70,221,153,101,155,167, 43,172,9,
129,22,39,253, 19,98,108,110,79,113,224,232,178,185, 112,104,218,246,97,228,
251,34,242,193,238,210,144,12,191,179,162,241, 81,51,145,235,249,14,239,107,
49,192,214, 31,181,199,106,157,184, 84,204,176,115,121,50,45,127, 4,150,254,
138,236,205,93,222,114,67,29,24,72,243,141,128,195,78,66,215,61,156,180
]
next :: Permutation -> Word8 -> Word8
next (Permutation !v) !idx'
= v `V.unsafeIndex` (fromIntegral $ idx' .&. 0xFF)
And all this is tied together with JuicyPixels:
main = do
[target] <- getArgs
let image = P.generateImage pixelRenderer 512 512
P.writePng target image
where
pixelRenderer, pixelRenderer' :: Int -> Int -> Word8
pixelRenderer !x !y
= floor $ ((perlin3 permutation ((fromIntegral x - 256) / 32,
(fromIntegral y - 256) / 32, 0 :: Double))+1)/2 * 128
-- This code is much more readable, but also much slower.
pixelRenderer' x y
= (\w -> floor $ ((w+1)/2 * 128)) -- w should be in [-1,+1]
. perlin3 permutation
. (\(x,y,z) -> ((x-256)/32, (y-256)/32, (z-256)/32))
$ (fromIntegral x, fromIntegral y, 0 :: Double)
My problem is that perlin3 seems very slow to me. If I profile it, pixelRenderer
is getting a lot of time as well, but I'll ignore that for now. I don't know
how to optimize perlin3. I tried to hint GHC with bang patterns, which cuts
the execution time in half, so that's nice. Explicitly specializing and inlining
barely helps with ghc -O. Is perlin3 supposed to be this slow?
UPDATE: an earlier version of this question mentioned a bug in my code. This problem has been resolved; it turns out my old version of actuallyProperFraction was buggy. It implicitly rounded the integral part of a floating point number to Word8, and then subtracted it from the floating point number to get the fractional part. Since Word8 can only take values between 0 and 255 inclusive, this won't work properly for numbers outside that range, including negative numbers.
This code appears to be mostly computation-bound. It can be improved a little bit, but not by much unless there's a way to use fewer array lookups and less arithmetic.
There are two useful tools for measuring performance: profiling and code dumps. I added an SCC annotation to perlin3 so that it would show up in the profile. Then I compiled with gcc -O2 -fforce-recomp -ddump-simpl -prof -auto. The -ddump-simpl flag prints the simplified code.
Profiling: On my computer, it takes 0.60 seconds to run the program, and about 20% of execution time (0.12 seconds) is spent in perlin3 according to the profile. Note that the precision of my profile info is about +/-3%.
Simplifier output: The simplifier produces fairly clean code. perlin3 gets inlined into pixelRenderer, so that's the part of the output you want to look at. Most of the code consists of unboxed array reads and unboxed arithmetic. To improve performance, we want to eliminate some of this arithmetic.
An easy change is to eliminate the run-time checks on SomeFraction (which doesn't appear in your question, but is part of the code that you uploaded). This reduces the program's execution time to 0.56 seconds.
-- someFraction t | 0 <= t, t < 1 = SomeFraction t
someFraction t = SomeFraction t
Next, there are several array lookups that show up in the simplifier like this:
case GHC.Prim.indexWord8Array#
ipv3_s23a
(GHC.Prim.+#
ipv1_s21N
(GHC.Prim.word2Int#
(GHC.Prim.and#
(GHC.Prim.narrow8Word#
(GHC.Prim.plusWord# ipv5_s256 (__word 1)))
(__word 255))))
The primitive operation narrow8Word# is for coercing from an Int to a Word8. We can get rid of this coercion by using Int instead of Word8 in the definition of next.
next :: Permutation -> Int -> Int
next (Permutation !v) !idx'
= fromIntegral $ v `V.unsafeIndex` (fromIntegral idx' .&. 0xFF)
This reduces the program's execution time to 0.54 seconds. Considering just the time spent in perlin3, the execution time has fallen (roughly) from 0.12 to 0.06 seconds. Although it's hard to measure where the rest of the time is going, it's most likely spread out among the remaining arithmetic and array accesses.
On my machine reference code with Heatsink's optimisations takes 0.19 secs.
Firstly, I has moved from JuicyPixels to yarr and yarr-image-io with my favourite flags, -Odph -rtsopts -threaded -fno-liberate-case -funbox-strict-fields -fexpose-all-unfoldings -funfolding-keeness-factor1000 -fsimpl-tick-factor=500 -fllvm -optlo-O3 (they are given here):
import Data.Yarr as Y
import Data.Yarr.IO.Image as Y
...
main = do
[target] <- getArgs
image <- dComputeS $ fromFunction (512, 512) (return . pixelRenderer)
Y.writeImage target (Grey image)
where
pixelRenderer, pixelRenderer' :: Dim2 -> Word8
pixelRenderer (y, x)
= floor $ ((perlin3 permutation ((fromIntegral x - 256) / 32,
(fromIntegral y - 256) / 32, 0 :: Double))+1)/2 * 128
-- This code is much more readable, but also much slower.
pixelRenderer' (y, x)
= (\w -> floor $ ((w+1)/2 * 128)) -- w should be in [-1,+1]
. perlin3 permutation
. (\(x,y,z) -> ((x-256)/32, (y-256)/32, (z-256)/32))
$ (fromIntegral x, fromIntegral y, 0 :: Double)
This makes the program 30% faster, 0.13 seconds.
Secondly I has replaced uses of standard floor with
doubleToByte :: Double -> Word8
doubleToByte f = fromIntegral (truncate f :: Int)
It is known issue (google "haskell floor performance"). Execution time is reduced to 52 ms (0.052 secs), in almost 3 times.
Finally, just for fun I tried to compute noise in parallel (dComputeP instead of dComputeS and +RTS -N4 in command line run). Program took 36 ms, including I/O constant of about 10 ms.
I wrote code for solving the local alignment problem with Smith–Waterman algorithm.
I want to do this with input of strings with length 10000, with reasonable memory(under 2GB ram) and reasonable time (under 5 minutes).
At first I was using bio library's built in function for this, and it runs way too slow and eat up 4GB of ram before I killed it.
Note the java program jAligner, which implements the same algorithm, can solve this problem with less than 1GB of memory and less than 20 seconds.
When I wrote an unboxed version of this, the program gives me <<loop>>. I think it's because the array need to access items in the array before the array gets built entirely.
So I wonder is it even possible to write Haskell code with similar performance for this kind of larger dynamic programming problems.
module LocalAlign where
--import Data.Array.Unboxed
import Data.Tuple
import Data.Array
localAffineAlignment :: (Char -> Char -> Int)
-> Int
-> Int
-> String
-> String
-> (Int, (String, String, String, String))
localAffineAlignment f g e s' t' = (score, best) where
n = length s'
m = length t'
s= array (0,n-1) $ zip [0..n-1] s'
t= array (0,m-1) $ zip [0..m-1] t'
table :: (Array (Int,Int) Int,Array (Int,Int) Int)
table = (c,d)
where --a :: UArray (Int,Int) Int
a = array ((0,0),(n,m)) [((x,y),a' x y)|x<-[0..n],y<-[0..m]] --s end with gap
b = array ((0,0),(n,m)) [((x,y),b' x y)|x<-[0..n],y<-[0..m]] --t end with gap
c = array ((0,0),(n,m)) [((x,y),fst (c' x y))|x<-[0..n],y<-[0..m]] -- best
d = array ((0,0),(n,m)) [((x,y),snd (c' x y))|x<-[0..n],y<-[0..m]] -- direction
a' i j
| i==0 || j==0 = inf
| otherwise = max (a!(i-1,j)-e) (c!(i-1,j)-g-e)
b' i j
| i==0 || j==0 = inf
| otherwise = max (b!(i,j-1)-e) (c!(i,j-1)-g-e)
c' i j
| min i j == 0 = (0,0)
| otherwise = maximum [(b!(i,j),3),(a!(i,j),2),(c!(i-1,j-1) + f u v,1),(0,0)]
where u = s!(i-1)
v = t!(j-1)
inf = -1073741824
score :: Int
score = maximum $ elems $ fst table
best :: (String, String, String, String)
best = (drop si $ take ei s',drop sj $ take ej t',b1,b2)
where (a,d') = table
(si,sj,b1,b2) = build ei ej [] []
(ei,ej) = snd $ maximum $ map swap $ assocs a
build x y ss tt
| o == 0 = (x,y,ss,tt)
| d == 1 = build (x-1) (y-1) (u:ss) (v:tt)
| d == 2 = build (x-1) y (u:ss) ('-':tt)
| otherwise = build x (y-1) ('-':ss) (v:tt)
where o = a!(x,y)
d = d'!(x,y)
u = s!(x-1)
v = t!(y-1)
is it even possible to write Haskell code with similar performance for this kind of larger dynamic programming problems.
Yes, of course. Use the same data structures and the same algorithms, and you will get same (or better, or worse, by constant factors) performance.
You are using (intermediate) lists and boxed arrays heavily. Consider using the vector package instead.
You might be interested in the MemoCombinators library, which makes doing dynamic programming much easier. You can basically write the algorithm without memoization, then just annotate which variables you want memoized, and the compiler takes it from there.