Haskell: RandomGen drops half of values - haskell

I am writing a simple deterministic Random Number generator, based on the xorshift. The goal here is not to get a cryptographically secure or statistically perfect (pseudo-)random number generator, but to be able to archieve the same deterministic sequence of semi-random numbers across programming languages.
My Haskell program looks like follows:
{-# LANGUAGE GeneralizedNewtypeDeriving #-}
module SimpleRNG where
import Data.Word (Word32)
import Data.Bits (xor, shift)
import System.Random (RandomGen(..))
import Control.Arrow
(|>) :: a -> (a -> b) -> b
(|>) x f = f x
infixl 0 |>
newtype SeedState = SeedState Word32
deriving (Eq, Show, Enum, Bounded)
seed :: Integral a => a -> SeedState
seed = SeedState . fromIntegral
rand_r :: SeedState -> (Word32, SeedState)
rand_r (SeedState num) = (res, SeedState res)
where
res = num
|> xorshift 13
|> xorshift (-17)
|> xorshift 5
xorshift :: Int -> Word32 -> Word32
xorshift amount x = x `xor` (shift x amount)
instance RandomGen SeedState where
next seed_state = (first fromIntegral) $ rand_r seed_state
where
genRange seed_state = (fromEnum (minBound `asTypeOf` seed_state),
fromEnum (maxBound `asTypeOf` seed_state))
split seed_state#(SeedState num) = (seed_state', inverted_seed_state')
where
(_, seed_state') = next seed_state
(_, inverted_seed_state') = next inverted_seed_state
inverted_seed_state = SeedState (maxBound - num)
Now, for some reason, when running
take 10 $ System.Random.randoms (seed 42) :: [Word32]
it returns only the 'odd' results, compared to the output of the following Python program:
class SeedState(object):
def __init__(self, seed = 42):
self.data = seed
def rand_r(rng_state):
num = rng_state.data
num ^= (num << 13) % (2 ** 32)
num ^= (num >> 17) % (2 ** 32)
num ^= (num << 5) % (2 ** 32)
rng_state.data = num
return num
__global_rng_state = SeedState(42)
def rand():
global __global_rng_state
return rand_r(__global_rng_state)
def seed(seed):
global __global_rng_state
__global_rng_state = SeedState(seed)
if __name__ == '__main__':
for x in range(0, 10):
print(rand())
It seems like the internals of the System.Random module do some weird trickery with the return result of the generator
(calling
map fst $ take 10 $ iterate (\(_, rng) -> rand_r rng) (rand_r $ seed 42)
gives the result I'd expect).
This is odd, since the type returned by the generator is already a Word32, so it could/should just be passed on unaltered without any remapping happening.
What is going on here, and is there a way to plug this xorshift-generator into System.Random in a way that returns the same results?

This is to do with the behaviour of System.Random.randoms, which repeatedly applies random to a RandomGen, not next.
class Random a where
...
random :: (RandomGen g) => g -> (a, g)
The Random class is what allows you to reuse RandomGen instances across different enums, and the instance for Word32 (as well as nearly all other types) is defined as
instance Random Word32 where randomR = randomIvalIntegral; random = randomBounded
randomBounded just calls randomR, so the behaviour of random is decided by `
randomIvalIntegral (l,h) = randomIvalInteger (toInteger l, toInteger h)
randomIvalInteger is an interesting function, you can read the source here. It's actually causing your problem because the function will discard a certain number of intermediate values based on the range of the generator and the range being generated over.
To get the values you want, you just need to use next instead - the easiest way would just be to define
randoms' g = x : (randoms' g') where (x, g') = next g

Related

Mutable list of mutabale non-integral types in Haskell

I'm trying to parse a huge 3d-data array of complex values from binary. Later this should become l matrices (n x m). Since I'm going to work on these matrices, I'm limited to matrix libraries - hmatrix seems to be promising.
The data layout is not in my requried format, so I have to jump around in positions (i,j,k) -> (k,i,j), where i and j are elements of n and m and k element of l.
I think the only way to read in this in is my using mutables, otherwise I'll end up with several Terrabytes of garbage. My idea was to use boxed mutual arrays or vectors of mututal matrices (STMatrix from Numeric.LinearAlgebra.Devel), so I end up with something like:
data MVector s (STMatrix s t)
But I'm not sure how to use them correctly:
I can modify one single element of the MVector with modify:
modify :: PrimMonad m => MVector (PrimState m) a -> (a -> a) -> Int -> m ()
or use modifyM (Strange: in stack vector-0.12.3.0 does not have modifyM...)
modifyM :: PrimMonad m => MVector (PrimState m) a -> (a -> m a) -> Int -> m ()
so I could use the function call (a -> a) to a runST-routine to modify the SMatrix. I'm not sure, if I should put an ST in an IO (?)
Nevertheless - I think, this should work but is only useful, when I want to modify the whole Matrix, calling this (a->a)-routine n x m x l- times will be a little bit overhead (Maybe it will be optimized out...).
So I'll end up, in marshalling the Array, modify the content via pointers (i,j,k) -> (k,i,j) and read everything Matrix by Matrix - but this does not feel right and I wanted to avoid such dirty tricks.
Do you have any ideas of a way to do this a little but more ...clean?
Ty
Edit:
Thx to K. A. Buhr. His solution works so far. Now, I'm only running into some performance impacts. If I compare the solution:
{-# LANGUAGE BangPatterns #-}
module Main where
import Data.List
import Numeric.LinearAlgebra
import qualified Data.Vector as V
import qualified Data.Vector.Storable as VS
import qualified Data.Vector.Storable.Mutable as VSM
-- Create an l-length list of n x m hmatrix Matrices
toMatrices :: Int -> Int -> Int -> [C] -> [Matrix C]
toMatrices l n m dats = map (reshape m) $ VS.createT $ do
mats <- V.replicateM l $ VSM.unsafeNew (m*n)
sequence_ $ zipWith (\(i,j,k) x ->
VSM.unsafeWrite (mats V.! k) (loc i j) x) idxs (dats ++ repeat 0)
return $ V.toList mats
where idxs = (,,) <$> [0..n-1] <*> [0..m-1] <*> [0..l-1]
loc i j = i*m + j
test1 = toMatrices 1000 1000 100 (fromIntegral <$> [1..])
main = do
let !a = test1
print "done"
With the simpliest C-code:
#include <stdlib.h>
#include <stdio.h>
void main()
{
const int n = 1000;
const int m = 1000;
const int l = 100;
double *src = malloc(n*m*l * sizeof(double));
for (int i = 0; i < n*m*l; i++) {
src[i] = (double)i;
}
double *dest = malloc(n*m*l * sizeof(double));
for (int i = 0; i < n; i++) {
for (int j = 0; j < m; j++) {
for (int k = 0; k < l; k++) {
dest[k*n*m+i*m+j] = src[i*m*l+j*l+k];
}
}
}
printf("done: %f\n", dest[n*m*l - 1]); // Need to access the array, otherwise it'll get lost by -O2
free(src);
free(dest);
}
Both compiled with -O2 give following performance guesses:
real 0m5,611s
user 0m14,845s
sys 0m2,759s
vs.
real 0m0,441s
user 0m0,200s
sys 0m0,240s
This are approx 2 magnitudes per-core performance. From profiling I learn that
VSM.unsafeWrite (mats V.! k) (loc i j) x
is the expensive function.
Since I'll use this procedure in a minute-like intervall, I want to keep the parsing time as low as the disk access time. I'll see, if I can speed this up
PS: This is for some tests, if I could move usual DSP from C-like to Haskell
Edit2 :
Ok, this is what I get after sum trying:
{-# LANGUAGE BangPatterns #-}
module Main where
import Data.List
import qualified Data.Vector as V
import qualified Data.Vector.Storable as VS
import qualified Data.Vector.Storable.Mutable as VSM
import Numeric.LinearAlgebra
-- Create an l-length list of n x m hmatrix Matrices
toMatrices :: Int -> Int -> Int -> VS.Vector C -> V.Vector (Matrix C)
toMatrices l n m dats =
V.map (reshape m) newMat
where
newMat = VS.createT $
V.generateM l $ \k -> do
curMat <- VSM.unsafeNew (m * n)
VS.mapM_
(\i ->
VS.mapM_
(\j -> VSM.unsafeWrite curMat (loc i j) (dats VS.! (oldLoc i j k)))
idjs)
idis
return curMat
loc i j = i * m + j
oldLoc i j k = i * m * l + j * l + k
!idis = VS.generate n (\a->a)
!idjs = VS.generate m (\a->a)
test1 = toMatrices 100 1000 1000 arr
where
arr = VS.generate (1000 * 1000 * 100) fromIntegral :: VS.Vector C
main = do
let !a = test1
print "done"
It gives something about:
real 0m1,816s
user 0m1,636s
sys 0m1,120s
, so ~4 times slower than C code. I think I can live with this.
I guess, I'm destroying all the stream-functionality of the vector with this code. If there are any suggestions to have them back by a comparable speed, I would be grateful!
As I understand it, you have a "huge" set of data in i-major, j-middling, k-minor order, and you want to load it into matrices indexed by k whose elements have i-indexed rows and j-indexed columns, right? So, you want a function something like:
import Numeric.LinearAlgebra
-- load into "l" matrices of size "n x m"
toMatrices :: Int -> Int -> Int -> [C] -> [Matrix C]
toMatrices l n m dats = ...
Note that you've written n x m matrices above, associating i with n and j with m. It would be more usual to flip the roles of n and m, but I've stuck with your notation, so keep an eye on that.
If the entire data list [C] could fit comfortably in memory, you could do this immutably by writing something like:
import Data.List
import Data.List.Split
import Numeric.LinearAlgebra
toMatrices :: Int -> Int -> Int -> [C] -> [Matrix C]
toMatrices l n m = map (reshape m . fromList) . transpose . chunksOf l
This breaks the input data into l-sized chunks, transposes them into l lists, and converts each list to a matrix. If there was some way to force all the Matrix C values in parallel, this could be done with one traversal through the data, without the need to hold on to the whole list. Unfortunately, the individual Matrix C values can only be forced one-by-one, and the whole list needs to be kept around until all of them can be forced.
So, if the "huge" [C] list is too big for memory, you're probably right that you need to load the data into a (partially) mutable structure. The code is somewhat challenging to write, but it's not too bad in its final form. I believe the following will work:
import Data.List
import Numeric.LinearAlgebra
import qualified Data.Vector as V
import qualified Data.Vector.Storable as VS
import qualified Data.Vector.Storable.Mutable as VSM
-- Create an l-length list of n x m hmatrix Matrices
toMatrices :: Int -> Int -> Int -> [C] -> [Matrix C]
toMatrices l n m dats = map (reshape m) $ VS.createT $ do
mats <- V.replicateM l $ VSM.unsafeNew (m*n)
sequence_ $ zipWith (\(i,j,k) x ->
VSM.unsafeWrite (mats V.! k) (loc i j) x) idxs (dats ++ repeat 0)
return $ V.toList mats
where idxs = (,,) <$> [0..n-1] <*> [0..m-1] <*> [0..l-1]
loc i j = i*m + j
test1 = toMatrices 4 3 2 (fromIntegral <$> [1..24])
test2 = toMatrices 1000 1000 100 (fromIntegral <$> [1..])
main = do
print $ test1
print $ norm_Inf . foldl1' (+) $ test2
Compiled with -O2, the maximum residency is about 1.6Gigs, which matches the expected memory needed to hold 100 matrices of one million 16-byte complex values in memory, so that looks right.
Anyway, this version of toMatrices is made somewhat complicated by the use of three different vector variants. There's Vector from hmatrix, which is the same as the immutable storable VS.Vector from vector; and then there are two more types from vector: the immutable boxed V.Vector, and the mutable storable VSM.Vector.
The do-block creates a V.Vector of VSM.Vectors and populates those with a sequence of monadic actions performed across index/value pairs. You can load the data in any order by modifying the definition of idxs to match the order of the data stream. The do-block returns the final VSM.Vectors in a list, the helper function VS.createT freezes them all to VS.Vectors (i.e., Vector from hmatrix), and reshape is mapped across the vectors to turn them into m-column matrices.
Note that you'll have to take care that in your actual application, the list of data items read from the file isn't kept around by code other than toMatrices, either in the original text form or the parsed numeric form. This shouldn't be too tough to get right, but you might want to test on medium-sized test input before locking up your computer on the real dataset.

How can I detect zero denominator when reading Ratios?

I want to read a Ratio from a String, but I don't want my program to crash when the denominator is zero. How can I detect a zero denominator and avoid an error? Just using readMaybe doesn't work:
Prelude Text.Read> readMaybe "1 % 0" :: Maybe Rational
Just *** Exception: Ratio has zero denominator
I created this far from perfect solution:
readMaybeRational :: String -> Maybe Rational
readMaybeRational s =
case ((readMaybe $ drop 1 $ dropWhile (/='%') s) :: Maybe Int)
of Just 0 -> Nothing
_ -> readMaybe s
But I don't know how to handle a nested Ratio nicely:
"Just (1 % 0)"
If I could override Ratio's Read instance, I could get readMaybe to return Nothing when the denominator is zero:
instance (Integral a, Read a) => Read (Ratio a) where
readPrec =
parens
( prec ratioPrec
( do x <- step readPrec
expectP (L.Symbol "%")
y <- step readPrec
-- is y 0? If so, do something here
return (x % y)
)
)
But I'm pretty sure I can't do that.
I think your best solution is a newtype wrapper around Ratio, like this:
import Control.Monad
import GHC.Read
import GHC.Real
import qualified Text.Read.Lex as L
import Text.ParserCombinators.ReadPrec
newtype SaneReadRatio a = SaneReadRatio (Ratio a)
type SaneReadRational = SaneReadRatio Integer
instance (Integral a, Read a) => Read (SaneReadRatio a) where
readPrec =
parens
( prec ratioPrec
( do x <- step readPrec
expectP (L.Symbol "%")
y <- step readPrec
guard (y /= 0)
return (SaneReadRatio (x % y))
)
)
readListPrec = readListPrecDefault
readList = readListDefault
Use it by reading in your data with SaneReadRational in place of Rational, then using coerce from Data.Coerce on the result, which will change it back to the underlying Rational no matter how deeply it's buried inside your type.

How do I memoize?

I have written this function that computes Collatz sequences, and I see wildly varying times of execution depending on the spin I give it. Apparently it is related to something called "memoization", but I have a hard time understanding what it is and how it works, and, unfortunately, the relevant article on HaskellWiki, as well as the papers it links to, have all proven to not be easily surmountable. They discuss intricate details of the relative performance of highly layman-indifferentiable tree constructions, while what I miss must be some very basic, very trivial point that these sources neglect to mention.
This is the code. It is a complete program, ready to be built and executed.
module Main where
import Data.Function
import Data.List (maximumBy)
size :: (Integral a) => a
size = 10 ^ 6
-- Nail the basics.
collatz :: Integral a => a -> a
collatz n | even n = n `div` 2
| otherwise = n * 3 + 1
recollatz :: Integral a => a -> a
recollatz = fix $ \f x -> if (x /= 1)
then f (collatz x)
else x
-- Now, I want to do the counting with a tuple monad.
mocollatz :: Integral b => b -> ([b], b)
mocollatz n = ([n], collatz n)
remocollatz :: Integral a => a -> ([a], a)
remocollatz = fix $ \f x -> if x /= 1
then f =<< mocollatz x
else return x
-- Trivialities.
collatzLength :: Integral a => a -> Int
collatzLength x = (length . fst $ (remocollatz x)) + 1
collatzPairs :: Integral a => a -> [(a, Int)]
collatzPairs n = zip [1..n] (collatzLength <$> [1..n])
longestCollatz :: Integral a => a -> (a, Int)
longestCollatz n = maximumBy order $ collatzPairs n
where
order :: Ord b => (a, b) -> (a, b) -> Ordering
order x y = snd x `compare` snd y
main :: IO ()
main = print $ longestCollatz size
With ghc -O2 it takes about 17 seconds, without ghc -O2 -- about 22 seconds to deliver the length and the seed of the longest Collatz sequence starting at any point below size.
Now, if I make these changes:
diff --git a/Main.hs b/Main.hs
index c78ad95..9607fe0 100644
--- a/Main.hs
+++ b/Main.hs
## -1,6 +1,7 ##
module Main where
import Data.Function
+import qualified Data.Map.Lazy as M
import Data.List (maximumBy)
size :: (Integral a) => a
## -22,10 +23,15 ## recollatz = fix $ \f x -> if (x /= 1)
mocollatz :: Integral b => b -> ([b], b)
mocollatz n = ([n], collatz n)
-remocollatz :: Integral a => a -> ([a], a)
-remocollatz = fix $ \f x -> if x /= 1
- then f =<< mocollatz x
- else return x
+remocollatz :: (Num a, Integral b) => b -> ([b], a)
+remocollatz 1 = return 1
+remocollatz x = case M.lookup x (table mutate) of
+ Nothing -> mutate x
+ Just y -> y
+ where mutate x = remocollatz =<< mocollatz x
+
+table :: (Ord a, Integral a) => (a -> b) -> M.Map a b
+table f = M.fromList [ (x, f x) | x <- [1..size] ]
-- Trivialities.
-- Then it will take just about 4 seconds with ghc -O2, but I would not live long enough to see it complete without ghc -O2.
Looking at the details of cost centres with ghc -prof -fprof-auto -O2 reveals that the first version enters collatz about a hundred million times, while the patched one -- just about one and a half million times. This must be the reason of the speedup, but I have a hard time understanding the inner workings of this magic. My best idea is that we replace a portion of expensive recursive calls with O(log n) map lookups, but I don't know if it's true and why it depends so much on some godforsaken compiler flags, while, as I see it, such performance swings should all follow solely from the language.
Can I haz an explanation of what happens here, and why the performance differs so vastly between ghc -O2 and plain ghc builds?
P.S. There are two requirements to the achieving of automagical memoization highlighted elsewhere on Stack Overflow:
Make a function to be memoized a top-level name.
Make a function to be memoized a monomorphic one.
In line with these requirements, I rebuilt remocollatz as follows:
remocollatz :: Int -> ([Int], Int)
remocollatz 1 = return 1
remocollatz x = mutate x
mutate :: Int -> ([Int], Int)
mutate x = remocollatz =<< mocollatz x
Now it's as top level and as monomorphic as it gets. Running time is about 11 seconds, versus the similarly monomorphized table version:
remocollatz :: Int -> ([Int], Int)
remocollatz 1 = return 1
remocollatz x = case M.lookup x (table mutate) of
Nothing -> mutate x
Just y -> y
mutate :: Int -> ([Int], Int)
mutate = \x -> remocollatz =<< mocollatz x
table :: (Int -> ([Int], Int)) -> M.Map Int ([Int], Int)
table f = M.fromList [ (x, f x) | x <- [1..size] ]
-- Running in less than 4 seconds.
I wonder why the memoization ghc is supposedly performing in the first case here is almost 3 times slower than my dumb table.
Can I haz an explanation of what happens here, and why the performance differs so vastly between ghc -O2 and plain ghc builds?
Disclaimer: this is a guess, not verified by viewing GHC core output. A careful answer would do so to verify the conjectures outlined below. You can try peering through it yourself: add -ddump-simpl to your compilation line and you will get copious output detailing exactly what GHC has done to your code.
You write:
remocollatz x = {- ... -} table mutate {- ... -}
where mutate x = remocollatz =<< mocollatz x
The expression table mutate in fact does not depend on x; but it appears on the right-hand side of an equation that takes x as an argument. Consequently, without optimizations, this table is recomputed each time remocollatz is called (presumably even from inside the computation of table mutate).
With optimizations, GHC notices that table mutate does not depend on x, and floats it to its own definition, effectively producing:
fresh_variable_name = table mutate
where mutate x = remocollatz =<< mocollatz x
remocollatz x = case M.lookup x fresh_variable_name of
{- ... -}
The table is therefore computed just once for the entire program run.
don't know why it [the performance] depends so much on some godforsaken compiler flags, while, as I see it, such performance swings should all follow solely from the language.
Sorry, but Haskell doesn't work that way. The language definition tells clearly what the meaning of a given Haskell term is, but does not say anything about the runtime or memory performance needed to compute that meaning.
Another approach to memoization that works in some situations, like this one, is to use a boxed vector, whose elements are computed lazily. The function used to initialize each element can use other elements of the vector in its calculation. As long as the evaluation of an element of the vector doesn't loop and refer to itself, just the elements it recursively depends on will be evaluated. Once evaluated, an element is effectively memoized, and this has the further benefit that elements of the vector that are never referenced are never evaluated.
The Collatz sequence is a nearly ideal application for this technique, but there is one complication. The next Collatz value(s) in sequence from a value under the limit may be outside the limit, which would cause a range error when indexing the vector. I solved this by just iterating through the sequence until back under the limit and counting the steps to do so.
The following program takes 0.77 seconds to run unoptimized and 0.30 when optimized:
import qualified Data.Vector as V
limit = 10 ^ 6 :: Int
-- The Collatz function, which given a value returns the next in the sequence.
nextCollatz val
| odd val = 3 * val + 1
| otherwise = val `div` 2
-- Given a value, return the next Collatz value in the sequence that is less
-- than the limit and the number of steps to get there. For example, the
-- sequence starting at 13 is: [13, 40, 20, 10, 5, 16, 8, 4, 2, 1], so if
-- limit is 100, then (nextCollatzWithinLimit 13) is (40, 1), but if limit is
-- 15, then (nextCollatzWithinLimit 13) is (10, 3).
nextCollatzWithinLimit val = (firstInRange, stepsToFirstInRange)
where
firstInRange = head rest
stepsToFirstInRange = 1 + (length biggerThanLimit)
(biggerThanLimit, rest) = span (>= limit) (tail collatzSeqStartingWithVal)
collatzSeqStartingWithVal = iterate nextCollatz val
-- A boxed vector holding Collatz length for each index. The collatzFn used
-- to generate the value for each element refers back to other elements of
-- this vector, but since the vector elements are only evaluated as needed and
-- there aren't any loops in the Collatz sequences, the values are calculated
-- only as needed.
collatzVec :: V.Vector Int
collatzVec = V.generate limit collatzFn
where
collatzFn :: Int -> Int
collatzFn index
| index <= 1 = 1
| otherwise = (collatzVec V.! nextWithinLimit) + stepsToGetThere
where
(nextWithinLimit, stepsToGetThere) = nextCollatzWithinLimit index
main :: IO ()
main = do
-- Use a fold through the vector to find the longest Collatz sequence under
-- the limit, and keep track of both the maximum length and the initial
-- value of the sequence, which is the index.
let (maxLength, maxIndex) = V.ifoldl' accMaxLen (0, 0) collatzVec
accMaxLen acc#(accMaxLen, accMaxIndex) index currLen
| currLen <= accMaxLen = acc
| otherwise = (currLen, index)
putStrLn $ "Max Collatz length below " ++ show limit ++ " is "
++ show maxLength ++ " at index " ++ show maxIndex

Using a custom generator vs Arbitrary instance in QuickCheck

Here's a simple function. It takes an input Int and returns a (possibly empty) list of (Int, Int) pairs, where the input Int is the sum of the cubed elements of any of the pairs.
cubeDecomposition :: Int -> [(Int, Int)]
cubeDecomposition n = [(x, y) | x <- [1..m], y <- [x..m], x^3 + y^3 == n]
where m = truncate $ fromIntegral n ** (1/3)
-- cubeDecomposition 1729
-- [(1,12),(9,10)]
I want to test the property that the above is true; if I cube each element and sum any of the return tuples, then I get my input back:
import Control.Arrow
cubedElementsSumToN :: Int -> Bool
cubedElementsSumToN n = all (== n) d
where d = map (uncurry (+) . ((^3) *** (^3))) (cubeDecomposition n)
For runtime considerations, I'd like to limit the input Ints to a certain size when testing this with QuickCheck. I can define an appropriate type and Arbitrary instance:
{-# LANGUAGE GeneralizedNewtypeDeriving #-}
import Test.QuickCheck
newtype SmallInt = SmallInt Int
deriving (Show, Eq, Enum, Ord, Num, Real, Integral)
instance Arbitrary SmallInt where
arbitrary = fmap SmallInt (choose (-10000000, 10000000))
And then I guess I have to define versions of the function and property that use SmallInt rather than Int:
cubeDecompositionQC :: SmallInt -> [(SmallInt, SmallInt)]
cubeDecompositionQC n = [(x, y) | x <- [1..m], y <- [x..m], x^3 + y^3 == n]
where m = truncate $ fromIntegral n ** (1/3)
cubedElementsSumToN' :: SmallInt -> Bool
cubedElementsSumToN' n = all (== n) d
where d = map (uncurry (+) . ((^3) *** (^3))) (cubeDecompositionQC n)
-- cubeDecompositionQC 1729
-- [(SmallInt 1,SmallInt 12),(SmallInt 9,SmallInt 10)]
This works fine, and the standard 100 tests pass as expected. But it seems unnecessary to define a new type, instance, and function when all I really need is a custom generator. So I tried this:
smallInts :: Gen Int
smallInts = choose (-10000000, 10000000)
cubedElementsSumToN'' :: Int -> Property
cubedElementsSumToN'' n = forAll smallInts $ \m -> all (== n) (d m)
where d = map (uncurry (+) . ((^3) *** (^3)))
. cubeDecomposition
Now, the first few times I ran this, everything worked fine, and all tests pass. But on subsequent runs I observed failures. Bumping up the test size reliably finds one:
*** Failed! Falsifiable (after 674 tests and 1 shrink):
0
8205379
I'm a bit confused here due to the presence of two shrunken inputs - 0 and 8205379 - returned from QuickCheck, where I would intuitively expect one. Also, those inputs work as predicted (on my show-able property, at least):
*Main> cubedElementsSumToN 0
True
*Main> cubedElementsSumToN 8205379
True
So it seems like obviously there's a problem in the property that uses the custom Gen I defined.
What have I done wrong?
I quickly realized that the property as I've written it is obviously incorrect. Here's the proper way to do it, using the original cubedElementsSumToN property:
quickCheck (forAll smallInts cubedElementsSumToN)
which reads quite naturally.

Generating random integers with given probabilities

I need to generate an infinite stream of random integers, with numbers to be in range [1..n]. However the probability for each number p_i is given in advance thus the distribution is not uniform.
Is there a library function to do it in Haskell?
As people have pointed out there is a function in Control.Monad.Random, but it has pretty poor complexity. Here is some code that I by some coincidence wrote this morning. It uses the beautiful Alias algorithm.
module Data.Random.Distribution.NonUniform(randomsDist) where
import Data.Array
import Data.List
import System.Random
genTable :: (Num a, Ord a) => [a] -> (Array Int a, Array Int Int)
genTable ps =
let n = length ps
n' = fromIntegral n
(small, large) = partition ((< 1) . snd) $ zip [0..] $ map (n' *) ps
loop ((l, pl):ls) ((g, pg):gs) probs aliases =
let prob = (l,pl)
alias = (l,g)
pg' = (pg + pl) - 1
gpg = (g, pg')
in if pg' < 1 then loop (gpg:ls) gs (prob:probs) (alias:aliases)
else loop ls (gpg:gs) (prob:probs) (alias:aliases)
loop ls gs probs aliases = loop' (ls ++ gs) probs aliases
loop' [] probs aliases = (array (0,n-1) probs, array (0,n-1) aliases)
loop' ((g,_):gs) probs aliases = loop' gs ((g,1):probs) ((g, -1):aliases)
in loop small large [] []
-- | Generate an infinite list of random values with the given distribution.
-- The probabilities are scaled so they do not have to add up to one.
--
-- Uses Vose's alias method for generating the values.
-- For /n/ values this has O(/n/) setup complexity and O(1) complexity for each
-- generated item.
randomsDist :: (RandomGen g, Random r, Fractional r, Ord r)
=> g -- | random number generator
-> [(a, r)] -- | list of values with the probabilities
-> [a]
randomsDist g xps =
let (xs, ps) = unzip xps
n = length xps
axs = listArray (0, n-1) xs
s = sum ps
(probs, aliases) = genTable $ map (/ s) ps
(g', g'') = split g
is = randomRs (0, n-1) g'
rs = randoms g''
ks = zipWith (\ i r -> if r <= probs!i then i else aliases!i) is rs
in map (axs!) ks
Just to expand on dflemstr's answer, you can create an infinite list of weighted values using Control.Monad.Random like this:
import Control.Monad.Random
import System.Random
weightedList :: RandomGen g => g -> [(a, Rational)] -> [a]
weightedList gen weights = evalRand m gen
where m = sequence . repeat . fromList $ weights
And use it like this:
> let values = weightedList (mkStdGen 123) [(1, 2), (2, 5), (3, 10)]
> take 20 values
[2,1,3,2,1,2,2,3,3,3,3,3,3,2,3,3,2,2,2,3]
This doesn't require the IO monad, but you need to provide the random number generator that's used for the stream.
Control.Monad.Random offers this function in form of fromList:: MonadRandom m => [(a, Rational)] -> m a
You can use it in the IO Monad with:
import Control.Monad.Random
-- ...
someNums <- evalRandIO . sequence . repeat . fromList $ [(1, 0.3), (2, 0.2), (3, 0.5)]
print $ take 200 someNums
There are other ways of running the Rand Monad as you can see in that package. The weights do not have to add up to 1.
EDIT: Rand is apparently lazier than I thought, so replicateM n can be replaced by sequence . repeat, as #shang suggested.
There is also System.Random.Distributions.frequency
frequency :: (Floating w, Ord w, Random w, RandomGen g) => [(w, a)] -> g -> (a, g)
See https://hackage.haskell.org/package/Euterpea-1.0.0/docs/System-Random-Distributions.html

Resources