Parallel calls to HMatrix (or FFI in general) - multithreading

I am working with point clouds in haskell using the repa library (3 as well as 4). At least I am trying to.
There are a few operations I need to do massively where parallelism really helps a lot. Most of these are simple linear algebra operations on the (metric) neighborhood of a point. For example a principal component analysis where I need to compute a SVD on a small matrix where each row is a point.
Now I use the linear package for the vector type
type Vec3 = V3 Float
and a 1-dimensional array of these vectors for point clouds
type Cloud = Array F DIM1 Vec3
So now I have the problem of calling a function using matrix decompositions of hmatrix in parallel using computeP. I tried hmatrix itself as well as the repa-linear-algebra package for that. The problem I have is that with all these calls (no matter how I provide the data and no matter what I call (svd, eigen decompositions, qr decomp. etc.)) the application always crashes randomly with a bus error or segfault.
I also didn't find any way to get any stacktrace that would at least point me in the right direction. Stack traces usually end at pthread.
Additionally I wrote my own C code which I call like e.g.:
foreign import ccall safe "pca.hpp pca"
c_pca :: CUInt -> Ptr Float -> Ptr Float -> Ptr Float -> IO ()
{-# INLINE foreignPCA #-}
foreignPCA :: forall r . (Source r Vec3) => Array r DIM1 Vec3 -> ([Vec3], Vec3)
foreignPCA !vs = unsafePerformIO $ do
n <- return $ Arrays.length vs
ps <- mallocForeignPtrArray n :: IO (ForeignPtr Vec3) -- point matrix
computeIntoS ps (delay vs)
as <- mallocForeignPtrArray 3 :: IO (ForeignPtr Float) -- singular values
av <- mallocForeignPtrArray 3 :: IO (ForeignPtr Vec3) -- right singular vectors
withForeignPtr (castForeignPtr ps) $ \pps ->
withForeignPtr as $ \pas ->
withForeignPtr (castForeignPtr av) $ \pav -> do
c_pca (fromIntegral n) pps pas pav
svalues <- peekArray 3 pas :: IO [Float]
svecs <- peekArray 3 (castPtr pav :: Ptr Vec3) :: IO [Vec3]
let [sx, sy, sz] = svalues in
return (svecs, (V3 sx sy sz))
This works perfectly fine on a massive point cloud with 20 cores in parallel. Never crashed in any way.
Now my very vague idea is that hmatrix calls C/Fortran code with "safe" thereby allowing pthread forks and without actually being thread-safe.
I can't try to verify this assumption as debugging seems a foreign concept to the haskell tool chain (at least for a complete newbie that I am).
In conclusion I have three questions:
Is hmatrix known to have problems working in parallel
Is there anyone working on native implementations of these fundamental algorithms?
How do I prevent FFI wrapped code to spawn fork()'ed threads without having access to the import call?
How do I debug things like hmatrix?
The second one is of particular interest to me since I find hmatrix to be incredibly ugly (subhask looks very promising but is too incomplete to be feasible). My goal is to change to haskell, but if I have to use my own C++ code for any trivial matter like above, I can just keep coding in C++ as I do now...

Related

linear congruent generator in haskell

This is a very simple linear-congruent pseudo-random number generator. It works fine when I seed it, but I want to make it so that it self-seeds with every produced number. Problem is that I don't know how to do that in Haskell where the notion of variables does not exist. I can feed the produced number recursively, but then my result would be a list of integers instead of a single number.
linCongGen :: Int -> Int
linCongGen seed = ((2*seed) + 3) `mod` 100
I'll summarize the comments a bit more meaningfully. The simplest solution is, like you observed, an infinite list of the sequence of generated elements. Then, every time you want to get a new number, pop off the head of that list.
linCongGen :: Integral a => a -> [a]
linCongGen = iterate $ \x -> ((2*x) + 3) `mod` 100
That said, here is a solution (which I do not agree with), but which does what I think you want. For mutable state, we usually use IORef, which is sort of like a reference or pointer. Here is the code. Please read the disclaimer afterwards though.
import Data.IORef
import System.IO.Unsafe
seed :: IORef Int
seed = unsafePerformIO $ newIORef 71
linCongGen :: IO Int
linCongGen = do previous <- readIORef seed
modifyIORef' seed $ \x -> ((2*x) + 3) `mod` 100
return previous
And here is a sample usage printing out the first hundred numbers generated: main = replicateM_ 100 $ getRandom >>= print (you'll need to have Control.Monad imported too for replicateM_).
DISCLAIMER
This is a bit of a hacky approach described here. As the link says "Maybe the need for global mutable state is a symptom of bad design." The link also has a good description of a more intelligent workaround. Making an IORef is an inherently IO operation, and we really shouldn't be using unsafePerformIO on it. If you find yourself fighting Haskell in this way, it's because Haskell was designed to get in your way when you are doing things you shouldn't.
That said, I find comfort in knowing that this approach is also the one using in System.Random (the standard random number module) to define the initial seed (check out the source).

How to wrap unsafe FFI? (Haskell)

This is a followup question to Is there ever a good reason to use unsafePerformIO?
So we know that
p_sin(double *p) { return sin(*p); }
is unsafe, and cannot be used with unsafePerformIO.
But the p_sin function is still a mathematical function, the fact that it was implemented in an unsafe way is an implementation detail. We don't exactly want, say, matrix multiplication to be in IO just because it involves allocating temporary memory.
How can we wrap this function in a safe way? Do we need to lock, allocate memory ourselves, etc? Is there a guide/tutorial for dealing with this?
Actually, if you incorporate the way p_sin is unsafe from that answer, it depends on p_sin not being a mathematical function, at least not one from numbers to numbers -- it depends on giving different answers when the memory the same pointer points to is different. So, mathematically speaking, there is something different between the two calls; with a formal model of pointers we might be able to tell. E.g.
type Ptr = Int
type Heap = [Double]
p_sin :: Heap -> Ptr -> Double
and then the C function would be equivalent to
p_sin h p = sin (h !! p)
The reason the results would differ is because of a different Heap argument, which is unnamed but implicit in the C definition.
If p_sin used temporary memory internally, but did not depend on the state of memory through its interface, e.g.
double p_sin(double x) {
double* y = (double*)malloc(sizeof(double));
*y = sin(x);
x = *y;
free(y);
return x;
}
then we do have an actual mathematical function Double -> Double, and we can
foreign import ccall safe "p_sin"
p_sin :: Double -> Double
and we're be fine. Pointers in the interface are killing the purity here, not C functions.
More practically, let's say you have a C matrix multiplication function implemented with pointers, since that's how you model arrays in C. In this case you'd probably expand the abstraction boundary, so there would be a few unsafe things going on in your program, but they would all be hidden from the module user. In this case, I recommend annotating everything unsafe with IO in your implementation, and then unsafePerformIOing right before you give it to the module user. This minimizes the surface area of impurity.
module Matrix
-- only export things guaranteed to interact together purely
(Matrix, makeMatrix, multMatrix)
where
newtype Matrix = Matrix (Ptr Double)
makeMatrix :: [[Double]] -> Matrix
makeMatrix = unsafePerformIO $ ...
foreign import ccall safe "multMatrix"
multMatrix_ :: Ptr Double -> IO (Ptr Double)
multMatrix :: Matrix -> Matrix
multMatrix (Matrix p) = unsafePerformIO $ multMatrix_ p
etc.

Pseudorandom number generators in Haskell

I'm working on solutions to the latest Programming Praxis puzzles—the first on implementing the minimal standard random number generator and the second on implementing a shuffle box to go with either that one or a different pseudorandom number generator. Implementing the math is pretty straightforward. The tricky bit for me is figuring out how to put the pieces together properly.
Conceptually, a pseudorandom number generator is a function stepRandom :: s -> (s, a) where s is the type of the internal state of the generator and a is the type of randomly chosen object produced. For a linear congruential PRNG, we could have s = a = Int64, for example, or perhaps s = Int64 and a = Double. This post on PSE does a pretty good job of showing how to use a monad to thread the PRNG state through a random computation, and finish things off with runRandom to run a computation with a certain initial state (seed).
Conceptually, a shuffle box is a function shuffle :: box -> a -> (box, a) along with a function to initialize a new box of the desired size with values from a PRNG. In practice, however, the representation of this box is a bit trickier. For efficiency, it should be represented as a mutable array, which forces it into ST or IO. Something vaguely like this:
mkShuffle :: (Integral i, Ix i, MArray a e m) => i -> m e -> m (a i e)
mkShuffle size getRandom = do
thelist <- replicateM (fromInteger.fromIntegral $ size) getRandom
newListArray (0,size-1) thelist
shuffle :: (Integral b, Ix b, MArray a b m) => a b b -> b -> m b
shuffle box n = do
(start,end) <- getBounds box
let index = start + n `quot` (end-start+1)
value <- readArray box index
writeArray box index n
return value
What I really want to do, however, is attach an (initialized?) shuffle box to a PRNG, so as to "pipe" the output from the PRNG into the shuffle box. I don't understand how to set up that plumbing properly.
I'm assuming that the goal is to implement an algorithm as follows: we have a random generator of some sort which we can think of as somehow producing a stream of random values
import Pipes
prng :: Monad m => Producer Int m r
-- produces Ints using the effects of m never stops, thus the
-- return type r is polymorphic
We would like to modify this PRNG via a shuffle box. Shuffle boxes have a mutable state Box which is an array of random integers and they modify a stream of random integers in a particular way
shuffle :: Monad m => Box -> Pipe Int Int m r
-- given a box, convert a stream of integers into a different
-- stream of integers using the effects of m without stopping
-- (polymorphic r)
shuffle works on an integer-by-integer basis by indexing into its Box by the incoming random value modulo the size of the box, storing the incoming value there, and emitting the value which was previously stored there. In some sense it's like a stochastic delay function.
So with that spec let's get to a real implementation. We want to use a mutable array so we'll use the vector library and the ST monad. ST requires that we pass around a phantom s parameter that matches throughout a particular ST monad invocation, so when we write Box it'll need to expose that parameter.
import qualified Data.Vector.Mutable as Vm
import Control.Monad.ST
data Box s = Box { sz :: Int, vc :: Vm.STVector s Int }
The sz parameter is the size of the Box's memory and the Vm.STVector s is a mutable ST Vector linked to the s ST thread. We can immediately use this to build our shuffle algorithm, now knowing that the Monad m must actually be ST s.
import Control.Monad
shuffle :: Box s -> Pipe Int Int (ST s) r
shuffle box = forever $ do -- this pipe runs forever
up <- await -- wait for upstream
next <- lift $ do let index = up `rem` sz box -- perform the shuffle
prior <- Vm.read (vc box) index -- using our mutation
Vm.write (vc box) index up -- primitives in the ST
return prior -- monad
yield next -- then yield the result
Now we'd just like to be able to attach this shuffle to some prng Producer. Since we're using vector it's nice to use the high-performance mwc-random library.
import qualified System.Random.MWC as MWC
-- | Produce a uniformly distributed positive integer
uniformPos :: MWC.GenST s -> ST s Int
uniformPos gen = liftM abs (MWC.uniform gen)
prng :: MWC.GenST s -> Int -> ST s (Box s)
prng gen = forever $ do
val <- lift (uniformPos gen)
yield val
Notice that since we're passing the PRNG seed, MWC.GenST s, along in an ST s thread we don't need to catch modifications and thread them along as well. Instead, mwc-random uses a mutable STRef s behind the scenes. Also notice that we modify MWC.uniform to return positive indices only as this is required for our indexing scheme in shuffle.
We can also use mwc-random to generate our initial box.
mkBox :: MWC.GenST s -> Int -> ST s (Box s)
mkBox gen size = do
vec <- Vm.replicateM size (uniformPos gen)
return (Box size vec)
The only trick here is the very nice Vm.replicateM function which effectively has the constrained type
Vm.replicateM :: Int -> ST s Int -> Vm.STVector s Int
where the second argument is an ST s action which generates a new element of the vector.
Finally we have all the pieces. We just need to assemble them. Fortunately, the modularity we get from using pipes makes this trivial.
import qualified Pipes.Prelude as P
run10 :: MWC.GenST s -> ST s [Int]
run10 gen = do
box <- mkBox gen 1000
P.toListM (prng gen >-> shuffle box >-> P.take 10)
Here we use (>->) to build a production pipeline and P.toListM to run that pipeline and produce a list. Finally we just need to execute this ST s thread in IO which is also where we can create our initial MWC.GenST s seed and feed it to run10 using MWC.withSystemRandom which generates the initial seed from, as it says, SystemRandom.
main :: IO ()
main = do
result <- MWC.withSystemRandom run10
print result
And we have our pipeline.
*ShuffleBox> main
[743244324568658487,8970293000346490947,7840610233495392020,6500616573179099831,1849346693432591466,4270856297964802595,3520304355004706754,7475836204488259316,1099932102382049619,7752192194581108062]
Note that the actual operations of these pieces is not terrifically complex. Unfortunately, the types in ST, mwc-random, vector, and pipes are all each individually highly generalized and thus can be quite burdensome to comprehend at first. Hopefully the above, where I've deliberately weakened and specialized nearly every type to this exact problem, will be much easier to follow and provide a little bit of intuition for how each of these wonderful libraries works individually and together.

Writing fusible O(1) update for vector

It is continuation of this question. Since vector library doesn't seem to have a fusible O(1) update function, I am wondering if it is possible to write a fusible O(1) update function that doesn't involve unsafeFreeze and unsafeThaw. It would use vector stream representation, I guess - I am not familiar with how to write one using stream and unstream - hence, this question. The reason is this will give us the ability to write a cache-friendly update function on vector where only a narrow region of vector is being modified, and so, we don't want to walk through entire vector just to process that narrow region (and this operation can happen billions of times in each function call - so, the motivation to keep the overhead really low). The transformation functions like map process entire vector - so they will be too slow.
I have a toy example of what I want to do, but the upd function below uses unsafeThaw and unsafeFreeze - it doesn't seem to be optimized away in the core, and also breaks the promise of not using the buffer further:
module Main where
import Data.Vector.Unboxed as U
import Data.Vector.Unboxed.Mutable as MU
import Control.Monad.ST
upd :: Vector Int -> Int -> Int -> Vector Int
upd v i x = runST $ do
v' <- U.unsafeThaw v
MU.write v' i x
U.unsafeFreeze v'
sum :: Vector Int -> Int
sum = U.sum . (\x -> upd x 0 73) . (\x -> upd x 1 61)
main = print $ Main.sum $ U.fromList [1..3]
I know how to implement imperative algorithms using STVector. In case you are wondering why this alternative approach, I want to try out this approach of using pure vectors to check how GHC transformation of a particular algorithm differs when written using fusible pure vector streams (with monadic operations under the hood of course).
When the algorithm is written using STVector, it doesn't seem to be as nicely iterative as I would like it to be (I guess it is harder for GHC optimizer to spot loops when there is lot of mutability strewn around). So, I am investigating this alternative approach to see I can get a nicer loop in there.
The upd function you have written does not look correct, let alone fusable. Fusion is a library level optimization and requires you to write your code out of certain primatives. In this case what you want is not just fusion, but recycling which can be easily achieved via the bulk update operations such as // and update. These operations will fuse, and even happen in place much of the time.
If you really want to write your own destructive update based code DO NOT use unsafeThaw--use modify
Any function is a fusible update function; you seem to be trying to escape from the programming model the vector package is trying to get you to use
module Main where
import Data.Vector.Unboxed as U
change :: Int -> Int -> Int
change 0 n = 73
change 1 n = 61
change m n = n
myfun2 = U.sum . U.imap change . U.enumFromStepN 1 1
main = print $ myfun2 30000000
-- this doesn't create any vectors much less 'update' them, as you will see if you study the core.

How practical is it to embed the core of a language with an effectful function space (like ML) into Haskell?

As Moggi proposed 20 years ago, the effectful function space -> of languages like ML can be decomposed into the standard total function space => plus a strong monad T to capture effects.
A -> B decomposes to A => (T B)
Now, Haskell supports monads, including an IO monad that appears sufficient for the effects in ML, and it has a function space that contains => (but also includes partial functions). So, we should be able to translate a considerable fragment of ML into Haskell via this decomposition. In theory I think this works.
My question is whether an embedding like this can be practical: is it possible to design a Haskell library that allows programming in Haskell in a style not too far from ML? And if so how will the performance be?
My criteria for "practical" is that existing ML code with extensive use of effects could be relatively easily transcribed into Haskell via the embedding, including complicated cases involving higher-order functions.
To make this concrete, my own attempt at such a transcription via the embedding is below. The main function is a transcription of some simple ML code that imperatively generates 5 distinct variable names. Rather than use the decomposition directly, my version lifts functions so that they evaluate their arguments - the definitions prior to main are a mini-library including lifted primitives. This works okay, but some aspects aren't totally satisfactory.
There's a little too much syntactic noise for the injection of values into computations via val. Having unlifted versions of functions (like rdV) would help this, at the cost of requiring these to be defined.
Non-value definitions like varNum require monadic binding via <- in a do. This then forces any definitions that depend on them to also be in the same do expression.
It seems then that the whole program might end up being in one huge do expression. This is how ML programs are often considered, but in Haskell it's not quite as well supported - e.g., you're forced to use case instead of equations.
I guess there will be some laziness despite threading the IO monad throughout. Given that the ML program would be designed for strict evaluation, the laziness should probably be removed. I'm uncertain what the best way to do this is though.
So, any advice on improving this, or on better approaches using the same decomposition, or even quite different ways of achieving the same broad goal of programming in Haskell using a style that mirrors ML?
(It's not that I dislike the style of Haskell, it's just that I'd like to be able to map existing ML code easily.)
import Data.IORef
import Control.Monad
val :: Monad m => a -> m a
val = return
ref = join . liftM newIORef
rdV = readIORef -- Unlifted, hence takes a value
(!=) r x = do { rr <- r; xx <- x; writeIORef rr xx }
(.+),(.-) :: IO Int -> IO Int -> IO Int
( (.+),(.-) ) = ( liftM2(+), liftM2(-) )
(.:) :: IO a -> IO [a] -> IO [a]
(.:) = liftM2(:)
showIO :: Show a => IO a -> IO String
showIO = liftM show
main = do
varNum <- ref (val 0)
let newVar = (=<<) $ \() -> val varNum != (rdV varNum .+ val 1) >>
val 'v' .: (showIO (rdV varNum))
let gen = (=<<) $ \n -> case n of 0 -> return []
nn -> (newVar $ val ()) .: (gen (val n .- val 1))
gen (val 5)
Here's a possible way, by sigfpe. It doesn't cover lambdas, but it seems it can be extended to them.

Resources