Procedurally generating large list of values in Haskell -- most idiomatic approach? memory management? - haskell

I have a function that takes a series of random numbers/floats, and uses them to generate a value/structure (ie, taking a random velocity and position of the point a ball is thrown from and outputting the coordinates of where it would land). And I need to generate several thousands in succession.
The way I have everything implemented is each calculation takes in an stdGen, uses it to generate several numbers, and passes out a new stdGen to allow it to be chained to another one.
And to do this for 10000 items, I make a sort of list from generate_item n which basically outputs a (value,gen) tuple (the value being the value i'm trying to calculate), where the value of gen is the recursively outputted stdGen from the calculations involved in getting the value from generate_item n-1
However, this program seems to crawl to be impractically slow at around a thousand results or so. And seems to definitely not be scalable. Could it have to do with the fact that I am storing all of the generate_item results in memory?
Or is there a more idomatic way of approaching this problem in Haskell using Monads or something than what I have describe above?
Note that the code to generate the algorithm from the random value generates 10k within seconds even in high-level scripting languages like ruby and python; these calculations are hardly intensive.
Code
-- helper functions that take in StdGen and return (Result,new StdGen)
plum_radius :: StdGen -> (Float,StdGen)
unitpoint :: Float -> StdGen -> ((Float,Float,Float),StdGen)
plum_speed :: Float -> StdGen -> (Float,StdGen)
-- The overall calculation of the value
plum_point :: StdGen -> (((Float,Float,Float),(Float,Float,Float)),StdGen)
plum_point gen = (((px,py,pz),(vx,vy,vz)),gen_out)
where
(r, gen2) = plum_radius gen
((px,py,pz),gen3) = unitpoint r gen2
(s, gen4) = plum_speed r gen3
((vx,vy,vz),gen5) = unitpoint s gen4
gen_out = gen5
-- Turning it into some kind of list
plum_data_list :: StdGen -> Int -> (((Float,Float,Float),(Float,Float,Float)),StdGen)
plum_data_list seed_gen 0 = plum_point seed_gen
plum_data_list seed_gen i = plum_point gen2
where
(_,gen2) = plum_data_list seed_gen (i-1)
-- Getting 100 results
main = do
gen <- getStdGen
let data_list = map (plum_data_list gen) [1..100]
putStrLn List.intercalate " " (map show data_list)

Consider just using the mersenne-twister and the vector-random package , which is specifically optimized to generate large amounts of high-quality random data.
Lists are unsuitable for allocating large amounts of data -- better to use a packed representation -- unless you're streaming.

First of all, the pattern you are describing -- taking an StdGen and then returning a tuple with a value and another StdGen to be chained into the next computation -- is exactly the pattern the State monad encodes. Refactoring your code to use it might be a good way to start to become familiar with monadic patterns.
As for your performance problem, StdGen is notoriously slow. I haven't done a lot with this stuff, but I've heard mersenne twister is faster.
However, you might also want to post your code, since in cases where you are generating large lists, laziness can really work to your advantage or disadvantage depending on how you are doing it. But it is hard to give specific advice without seeing what you are doing. One rule of thumb just in case you are coming from another functional language such as Lisp -- when generating a list (or other lazy data structure -- e.g. a tree, but not a Int), avoid tail recursion. The intuition for it being faster does not transfer to lazy languages. E.g. use (written without the monadic style that I would acutally use in practice)
randoms :: Int -> StdGen -> (StdGen, [Int])
randoms 0 g = (g, [])
randoms n g = let (g', x) = next g
(g'', xs) = randoms (n-1) g'
in (g'', x : xs)
This will allow the result list to be "streamed", so you can access the earlier parts of it before generating the later parts. (In this state case, it's a little subtle because accessing the resulting StdGen will have to generate the whole list, so you'll have to carefully avoid doing that until after you have consumed the list -- I wish there was a fast random generator that supported a good split operation, then you could get around having to return a generator at all).
Oh, just in case you're having trouble getting going with the monads thing, here's the above function written with a state monad:
randomsM :: Int -> State StdGen [Int]
randomsM 0 = return []
randomsM n = do
x <- state next
xs <- randomsM (n-1)
return (x : xs)
See the correspondence?

The other posters have good points, StdGen doesn't perform very well, and you should probably try to use State instead of manually passing the generator along. But I think the biggest problem is your plum_data_list function.
It seems to be intended to be some kind of lookup, but since it's implemented recursively without any memoization, the calls you make have to recurse to the base case. That is, plum_data_list seed_gen 100 needs the random generator from plum_data_list seed_gen 99 and so on, until plum_data_list seed_gen 0. This will give you quadratic performance when you try to generate a list of these values.
Probably the more idiomatic way is to let plum_data_list seed_gen generate an infinite list of points like so:
plum_data_list :: StdGen -> [((Float,Float,Float),(Float,Float,Float))]
plum_data_list seed_gen = first_point : plum_data_list seed_gen'
where
(first_point, seed_gen') = plum_point seed_gen
Then you just need to modify the code in main to something like take 100 $ plum_data_list gen, and you are back to linear performance.

Related

QuickCheck Generator not terminating

This QuickCheck generator
notEqual :: Gen (Int, Int)
notEqual = do
(a, b) <- arbitrary
if a /= b
then return (a, b)
else notEqual
Doesn't seem to terminate when passed to forAll. I'm guessing the recursion is going into an infinite loop somehow, but why?
If you are in doubt of termination/finding results you can always try to circumvent this:
notEqual' :: Gen (Int, Int)
notEqual' = do
start <- arbitrary
delta <- oneof [pos, neg]
pure (start, start + delta)
where
pos = getPositive <$> arbitrary
neg = getNegative <$> arbitrary
Of course internally both Postitive and Negative use suchThat so as Ashesh mentioned
notEqual :: Gen (Int, Int)
notEqual = genPair `suchThat` uncurry (/=)
where genPair = arbitrary
might be easier
The suchThat combinator that #Ashesh and #Carsten pointed out is definitely what I am looking for, to succinctly and idiomatically generate a non-equal pair.
An explanation for the infinite recursion (Thanks to #oisdk):
All QuckCheck runners (quickCheck, forAll etc.) pass a size parameter to test genarators. This has no defined semantics, but early tests use a small parameter, starting at 0*, and gradually growing. Generators use this to generate samples of different 'sizes,' whatever that may mean for a specific datatype.
arbitrary for integral types (called recursively by arbitrary for (Int, Int)), uses this for the magnitude of the generated value - generate an integral between 0 and size.
This means, unfortunately, that the first test attempted by quickCheck, (or in my case forAll,) uses the size 0, which can only generate (0, 0). This always fails the test of /=, causing the action to recurse infinitely, looking for more.
* I'm assuming this, as the behaviour of the size parameter doesn't seem to be documented anywhere.

How to make a custom Attoparsec parser combinator that returns a Vector instead of a list?

{-# LANGUAGE OverloadedStrings #-}
import Data.Attoparsec.Text
import Control.Applicative(many)
import Data.Word
parseManyNumbers :: Parser [Int] -- I'd like many to return a Vector instead
parseManyNumbers = many (decimal <* skipSpace)
main :: IO ()
main = print $ parseOnly parseManyNumbers "131 45 68 214"
The above is just an example, but I need to parse a large amount of primitive values in Haskell and need to use arrays instead of lists. This is something that possible in the F#'s Fparsec, so I've went as far as looking at Attoparsec's source, but I can't figure out a way to do it. In fact, I can't figure out where many from Control.Applicative is defined in the base Haskell library. I thought it would be there as that is where documentation on Hackage points to, but no such luck.
Also, I am having trouble deciding what data structure to use here as I can't find something as convenient as a resizable array in Haskell, but I would rather not use inefficient tree based structures.
An option to me would be to skip Attoparsec and implement an entire parser inside the ST monad, but I would rather avoid it except as a very last resort.
There is a growable vector implementation in Haskell, which is based on the great AMT algorithm: "persistent-vector". Unfortunately, the library isn't that much known in the community so far. However to give you a clue about the performance of the algorithm, I'll say that it is the algorithm that drives the standard vector implementations in Scala and Clojure.
I suggest you implement your parser around that data-structure under the influence of the list-specialized implementations. Here the functions are, btw:
-- | One or more.
some :: f a -> f [a]
some v = some_v
where
many_v = some_v <|> pure []
some_v = (fmap (:) v) <*> many_v
-- | Zero or more.
many :: f a -> f [a]
many v = many_v
where
many_v = some_v <|> pure []
some_v = (fmap (:) v) <*> many_v
Some ideas:
Data Structures
I think the most practical data structure to use for the list of Ints is something like [Vector Int]. If each component Vector is sufficiently long (i.e. has length 1k) you'll get good space economy. You'll have
to write your own "list operations" to traverse it, but you'll avoid re-copying data that you would have to perform to return the data in a single Vector Int.
Also consider using a Dequeue instead of a list.
Stateful Parsing
Unlike Parsec, Attoparsec does not provide for user state. However, you
might be able to make use of the runScanner function (link):
runScanner :: s -> (s -> Word8 -> Maybe s) -> Parser (ByteString, s)
(It also returns the parsed ByteString which in your case may be problematic since it will be very large. Perhaps you can write an alternate version which doesn't do this.)
Using unsafeFreeze and unsafeThaw you can incrementally fill in a Vector. Your s data structure might look
something like:
data MyState = MyState
{ inNumber :: Bool -- True if seen a digit
, val :: Int -- value of int being parsed
, vecs :: [ Vector Int ] -- past parsed vectors
, v :: Vector Int -- current vector we are filling
, vsize :: Int -- number of items filled in current vector
}
Maybe instead of a [Vector Int] you use a Dequeue (Vector Int).
I imagine, however, that this approach will be slow since your parsing function will get called for every single character.
Represent the list as a single token
Parsec can be used to parse a stream of tokens, so how about writing
your own tokenizer and letting Parsec create the AST.
The key idea is to represent these large sequences of Ints as a single token. This gives you a lot more latitude in how you parse them.
Defer Conversion
Instead of converting the numbers to Ints at parse time, just have parseManyNumbers return a ByteString and defer the conversion until
you actually need the values. This much enable you to avoid reifying
the values as an actual list.
Vectors are arrays, under the hood. The tricky thing about arrays is that they are fixed-length. You pre-allocate an array of a certain length, and the only way of extending it is to copy the elements into a larger array.
This makes linked lists simply better at representing variable-length sequences. (It's also why list implementations in imperative languages amortise the cost of copying by allocating arrays with extra space and copying only when the space runs out.) If you don't know in advance how many elements there are going to be, your best bet is to use a list (and perhaps copy the list into a Vector afterwards using fromList, if you need to). That's why many returns a list: it runs the parser as many times as it can with no prior knowledge of how many that'll be.
On the other hand, if you happen to know how many numbers you're parsing, then a Vector could be more efficient. Perhaps you know a priori that there are always n numbers, or perhaps the protocol specifies before the start of the sequence how many numbers there'll be. Then you can use replicateM to allocate and populate the vector efficiently.

linear congruent generator in haskell

This is a very simple linear-congruent pseudo-random number generator. It works fine when I seed it, but I want to make it so that it self-seeds with every produced number. Problem is that I don't know how to do that in Haskell where the notion of variables does not exist. I can feed the produced number recursively, but then my result would be a list of integers instead of a single number.
linCongGen :: Int -> Int
linCongGen seed = ((2*seed) + 3) `mod` 100
I'll summarize the comments a bit more meaningfully. The simplest solution is, like you observed, an infinite list of the sequence of generated elements. Then, every time you want to get a new number, pop off the head of that list.
linCongGen :: Integral a => a -> [a]
linCongGen = iterate $ \x -> ((2*x) + 3) `mod` 100
That said, here is a solution (which I do not agree with), but which does what I think you want. For mutable state, we usually use IORef, which is sort of like a reference or pointer. Here is the code. Please read the disclaimer afterwards though.
import Data.IORef
import System.IO.Unsafe
seed :: IORef Int
seed = unsafePerformIO $ newIORef 71
linCongGen :: IO Int
linCongGen = do previous <- readIORef seed
modifyIORef' seed $ \x -> ((2*x) + 3) `mod` 100
return previous
And here is a sample usage printing out the first hundred numbers generated: main = replicateM_ 100 $ getRandom >>= print (you'll need to have Control.Monad imported too for replicateM_).
DISCLAIMER
This is a bit of a hacky approach described here. As the link says "Maybe the need for global mutable state is a symptom of bad design." The link also has a good description of a more intelligent workaround. Making an IORef is an inherently IO operation, and we really shouldn't be using unsafePerformIO on it. If you find yourself fighting Haskell in this way, it's because Haskell was designed to get in your way when you are doing things you shouldn't.
That said, I find comfort in knowing that this approach is also the one using in System.Random (the standard random number module) to define the initial seed (check out the source).

Pseudorandom number generators in Haskell

I'm working on solutions to the latest Programming Praxis puzzles—the first on implementing the minimal standard random number generator and the second on implementing a shuffle box to go with either that one or a different pseudorandom number generator. Implementing the math is pretty straightforward. The tricky bit for me is figuring out how to put the pieces together properly.
Conceptually, a pseudorandom number generator is a function stepRandom :: s -> (s, a) where s is the type of the internal state of the generator and a is the type of randomly chosen object produced. For a linear congruential PRNG, we could have s = a = Int64, for example, or perhaps s = Int64 and a = Double. This post on PSE does a pretty good job of showing how to use a monad to thread the PRNG state through a random computation, and finish things off with runRandom to run a computation with a certain initial state (seed).
Conceptually, a shuffle box is a function shuffle :: box -> a -> (box, a) along with a function to initialize a new box of the desired size with values from a PRNG. In practice, however, the representation of this box is a bit trickier. For efficiency, it should be represented as a mutable array, which forces it into ST or IO. Something vaguely like this:
mkShuffle :: (Integral i, Ix i, MArray a e m) => i -> m e -> m (a i e)
mkShuffle size getRandom = do
thelist <- replicateM (fromInteger.fromIntegral $ size) getRandom
newListArray (0,size-1) thelist
shuffle :: (Integral b, Ix b, MArray a b m) => a b b -> b -> m b
shuffle box n = do
(start,end) <- getBounds box
let index = start + n `quot` (end-start+1)
value <- readArray box index
writeArray box index n
return value
What I really want to do, however, is attach an (initialized?) shuffle box to a PRNG, so as to "pipe" the output from the PRNG into the shuffle box. I don't understand how to set up that plumbing properly.
I'm assuming that the goal is to implement an algorithm as follows: we have a random generator of some sort which we can think of as somehow producing a stream of random values
import Pipes
prng :: Monad m => Producer Int m r
-- produces Ints using the effects of m never stops, thus the
-- return type r is polymorphic
We would like to modify this PRNG via a shuffle box. Shuffle boxes have a mutable state Box which is an array of random integers and they modify a stream of random integers in a particular way
shuffle :: Monad m => Box -> Pipe Int Int m r
-- given a box, convert a stream of integers into a different
-- stream of integers using the effects of m without stopping
-- (polymorphic r)
shuffle works on an integer-by-integer basis by indexing into its Box by the incoming random value modulo the size of the box, storing the incoming value there, and emitting the value which was previously stored there. In some sense it's like a stochastic delay function.
So with that spec let's get to a real implementation. We want to use a mutable array so we'll use the vector library and the ST monad. ST requires that we pass around a phantom s parameter that matches throughout a particular ST monad invocation, so when we write Box it'll need to expose that parameter.
import qualified Data.Vector.Mutable as Vm
import Control.Monad.ST
data Box s = Box { sz :: Int, vc :: Vm.STVector s Int }
The sz parameter is the size of the Box's memory and the Vm.STVector s is a mutable ST Vector linked to the s ST thread. We can immediately use this to build our shuffle algorithm, now knowing that the Monad m must actually be ST s.
import Control.Monad
shuffle :: Box s -> Pipe Int Int (ST s) r
shuffle box = forever $ do -- this pipe runs forever
up <- await -- wait for upstream
next <- lift $ do let index = up `rem` sz box -- perform the shuffle
prior <- Vm.read (vc box) index -- using our mutation
Vm.write (vc box) index up -- primitives in the ST
return prior -- monad
yield next -- then yield the result
Now we'd just like to be able to attach this shuffle to some prng Producer. Since we're using vector it's nice to use the high-performance mwc-random library.
import qualified System.Random.MWC as MWC
-- | Produce a uniformly distributed positive integer
uniformPos :: MWC.GenST s -> ST s Int
uniformPos gen = liftM abs (MWC.uniform gen)
prng :: MWC.GenST s -> Int -> ST s (Box s)
prng gen = forever $ do
val <- lift (uniformPos gen)
yield val
Notice that since we're passing the PRNG seed, MWC.GenST s, along in an ST s thread we don't need to catch modifications and thread them along as well. Instead, mwc-random uses a mutable STRef s behind the scenes. Also notice that we modify MWC.uniform to return positive indices only as this is required for our indexing scheme in shuffle.
We can also use mwc-random to generate our initial box.
mkBox :: MWC.GenST s -> Int -> ST s (Box s)
mkBox gen size = do
vec <- Vm.replicateM size (uniformPos gen)
return (Box size vec)
The only trick here is the very nice Vm.replicateM function which effectively has the constrained type
Vm.replicateM :: Int -> ST s Int -> Vm.STVector s Int
where the second argument is an ST s action which generates a new element of the vector.
Finally we have all the pieces. We just need to assemble them. Fortunately, the modularity we get from using pipes makes this trivial.
import qualified Pipes.Prelude as P
run10 :: MWC.GenST s -> ST s [Int]
run10 gen = do
box <- mkBox gen 1000
P.toListM (prng gen >-> shuffle box >-> P.take 10)
Here we use (>->) to build a production pipeline and P.toListM to run that pipeline and produce a list. Finally we just need to execute this ST s thread in IO which is also where we can create our initial MWC.GenST s seed and feed it to run10 using MWC.withSystemRandom which generates the initial seed from, as it says, SystemRandom.
main :: IO ()
main = do
result <- MWC.withSystemRandom run10
print result
And we have our pipeline.
*ShuffleBox> main
[743244324568658487,8970293000346490947,7840610233495392020,6500616573179099831,1849346693432591466,4270856297964802595,3520304355004706754,7475836204488259316,1099932102382049619,7752192194581108062]
Note that the actual operations of these pieces is not terrifically complex. Unfortunately, the types in ST, mwc-random, vector, and pipes are all each individually highly generalized and thus can be quite burdensome to comprehend at first. Hopefully the above, where I've deliberately weakened and specialized nearly every type to this exact problem, will be much easier to follow and provide a little bit of intuition for how each of these wonderful libraries works individually and together.

SIMPLE random number generation

I'm writing this after a good while of frustrating research, and I'm hoping someone here can enlighten me about the topic.
I want to generate a simple random number in a haskell function, but alas, this seems impossible to do without all sorts of non-trivial elements, such as Monads, asignation in "do", creating generators, etc.
Ideally I was looking for an equivalent of C's "rand()". But after much searching I'm pretty convinced there is no such thing, because of how the language is designed. (If there is, please someone enlighten me). As that doesn't seem feasible, I'd like to find a way to get a random number for my particular problem, and a general explanation on how it works to get a random number.
prefixGenerator :: (Ord a, Arbitrary a) => Gen ([a],[a])
prefixGenerator = frequency [
(1, return ([],[])),
(2, do {
xs1 <- orderedListEj13 ;
xs2 <- orderedListEj13 ;
return (xs1,xs2)
}),
(2, do {
xs2 <- orderedListEj13 ;
return ((take RANDOMNUMBERHERE xs2),xs2)
})
]
I'm trying to get to grips with QuickCheck but my inability to use random numbers is making it hard. I've tried something like this (by putting an drawInt 0 (length xs2) instead of RANDOMNUMBERHERE)but I get stuck with the fact that take requires an Int and that method leaves me with a IO Int, which seems impossible to transform to an Int according to this.
As haskell is a pure functional programming language, functions are referentially transparent which means essentially that only a function's arguments determine its result. If you were able to pull a random number out of the air, you can imagine how that would cause problems.
I suppose you need something like this:
prefixGenerator :: (Ord a, Arbitrary a) => Gen ([a],[a])
prefixGenerator = do
randn <- choose (1,999) -- number in range 1-999
frequency [
(1, return ([],[])),
(2, do {
xs1 <- orderedListEj13 ;
xs2 <- orderedListEj13 ;
return (xs1,xs2)
}),
(2, do {
xs2 <- orderedListEj13 ;
return ((take randn xs2),xs2)
})
]
In general in haskell you approach random number generation by either pulling some randomness from the IO monad, or by maintaining a PRNG that is initialized with some integer seed hard-coded, or pulled from IO (gspr's comment is excellent).
Reading about how pseudo random number generators work might help you understand System.Random, and this might help as well (scroll down to section on randomness).
You're right in that nondeterministic random (by which I mean "pseudo-random") number generation is impossible without trickery. Functions in Haskell are pure which means that the same input will always produce the same output.
The good news is that you don't seem to need a nondeterministic PRNG. In fact, it would be better if your QuickCheck test used the same sequence of "random" numbers each time, to make your tests fully reproducible.
You can do this with the mkStdGen function from System.Random. Adapted from the Haskell wiki:
import System.Random
import Data.List
randomInts :: Int -> [Int]
randomInts n = take n $ unfoldr (Just . random) (mkStdGen 4)
Here, 4 is the seed that you may want to choose by a fair dice roll.
The standard library provides a monad for random-number generation. The monadic stuff is not that hard to learn, but if you want to avoid it, find a pseudo-random function next that takes an Int to an Int in a pseudorandom way, and then just create and pass an infinite list of random numbers:
next :: Int -> Int
randoms :: [Int]
randoms = iterate next 73
You can then pass this list of random numbers wherever you need it.
Here's a linear congruential next from Wikipedia:
next n = (1103515245 * n + 12345) `mod` 1073741824
And here are the first 20 pseudorandom numbers following 73:
Prelude> take 20 $ iterate next 73
[73,25988430,339353199,182384508,910120965,1051209818,737424011,14815080,325218177,1034483750,267480167,394050068,4555453,647786674,916350979,980771712,491556281,584902142,110461279,160249772]

Resources