How to efficiently perform billions of Bernoulli extractions using numpy?

How to efficiently perform billions of Bernoulli extractions using numpy? - python-3.x

I am working at a thesis about epidemiology, and I have to simulate a SI epidemic in a temporal network. At each time step there's a probability ~ Bernoulli(beta) to perform an extraction between an infected and a susceptible node. I am using np.random.binomial(size=whatever, n=1, p=beta) to make the computer decide. Now, I have to simulate the epidemic in the same network by making it start from each one of the nodes. This should be repeated K times to get some statistically relevant results for each node, and, since the temporal network is stochastic too, everything should be repeated NET_REALIZATION times.
So, in a network with N = 100, if K=500 and NET=REALIZATION=500, the epidemic should be repeated 25,000,000‬ times. If T=100, it means 2,500,000,000‬ extractions per set of S-I couples (which of course varies in time). If beta is small, which is often the case, this leads to a very time-spending computation.
If you think that, for my computer, the bernoulli extraction takes 3.63 µs to happen, this means I have to wait hours to get some results, which is really limitating the development of my thesis.
The problem is that more than half of the time is just spent in random extractions.
I should use numpy since the results of extractions interact with other data structures. I tried to use numba, but it didn't seem to improve extractions' speed.
Is there a faster way to get the same results? I was thinking about doing a very very big extraction once forever, something like 10^12 extractions of 0s and 1s, and just import a part of them for each different simulation (this should be repeated for several values of beta), but I wonder if there's a smarter move.
Thanks for help

If you can express your betas as increments of 2^-N (for example, increments of 1/256 if N is 8.), then extract random N-bit chunks and determine whether each chunk is less than beta * 2^N. This works better if 32 is evenly divisible by N.
Note that numpy.random.uniform produces random floating-point numbers, and is expected to be slower than producing random integers or bits. This is especially because generating random floating-point numbers depends on generating random integers — not the other way around.
The following is an example of how this idea works.
import numpy
# Fixed seed for demonstration purposes
rs = numpy.random.RandomState(777778)
# Generate 10 integers in [0, 256)
ri = rs.randint(0, 256, 10)
# Now each integer x can be expressed, say, as a Bernoulli(5/256)
# variable which is 0 if x < 5, and 1 otherwise. I haven't tested
# the following, which is similar to an example you gave in a
# comment.
rbern = (ri>=5) * 1
If you can use NumPy 1.17 or later, the following alternative exists:
import numpy
rs = numpy.random.default_rng()
ri = rs.integers(0, 256, 10)
Note also that NumPy 1.17 introduces a new random number generation system alongside the legacy one. Perhaps it has better performance generating Bernoulli and binomial variables than the old one, especially because its default RNG, PCG64, is lighter-weight than the legacy system's default, Mersenne Twister. The following is an example.
import numpy
beta = 5.0/256
rs = numpy.random.default_rng()
rbinom = rs.binomial(10, beta)

Related

Which seed when using `pytorch.manual_seed(seed)`?

I have trained a model with ImageNet. I got a new GPU and I also want to train the same model on a different GPU.
I want to compare if the outcome is different and therefore I want to use torch.manual_seed(seed).
After reading the docs https://pytorch.org/docs/stable/generated/torch.manual_seed.html it is still unclear, which number I should take for the paramater seed.
How to choose the seed parameter? Can I take 10, 100, or 1000 or even more or less and why?

How to choose the seed parameter? Can I take 10, 100, or 1000 or even more or less and why?
The PyTorch doc page you are pointing to does not mention anything special, beyond stating that the seed is a 64 bits integer.
So yes, 1000 is OK. As you expect from a modern pseudo-random number generator, the statistical properties of the pseudo-random sequence you are relying on do NOT depend on the choice of seed.
As for most runs you will probably reuse the seed from a previous run, a practical thing is to have the random seed as a command line parameter. In those cases where you have to come up with a brand new seed, you can peek one from the top of your head, or just play dice to get it.
The important thing is to have a record of the seed used for each run.
OK, but ...
That being said, a lot of people seem uncomfortable with the task of just “guessing” some arbitrary number. There are (at least) two common expedients to get some seed in a seemingly proper “random” fashion.
The first one is to use the operating system official source for genuinely (not pseudo) random bits. In Python, this is typically rendered as os.urandom(). So to get a seed in Python as a 64 bits integer, you could use code like this:
import functools
import os
# returns a list of 8 random small integers between 0 and 255
def get8RandomBytesFromOS():
r8 = os.urandom(8) # official OS entropy source
byteCodes = list(map(ord, r8.decode('Latin-1'))) # type conversion
return byteCodes
# make a single long integer from a list of 8 integers between 0 and 255
def getIntFromBytes(bs):
# force highest bit to 0 to avoid overflow
bs2 = [bs[0] if (bs[0] < 128) else (bs[0]-128)] + bs[1:8]
num = functools.reduce(lambda acc,n: acc*256+n, bs2)
return num
# Main user entry point:
def getRandomSeedFromOS():
rbs8 = get8RandomBytesFromOS()
return (getIntFromBytes(rbs8))
A second common expedient is to hash a string containing the current date and time, possibly with some prefix. With Python, the time includes microseconds. When a human user launches a script, the microseconds number of the launch time can be said to be random. One can use code like this, using a version of SHA (Secure Hash Algorithm):
import hashlib
import datetime
def getRandomSeedFromTime():
prefix = 'dl.cs.univ-stackoverflow.edu'
timeString1 = str(datetime.datetime.now())
timeString2 = prefix + ' ' + timeString1
hash = hashlib.sha256(timeString2.encode('ascii'))
bytes = (hash.digest())[24:32] # 8 rightmost bytes of hash
byteCodes = list(map(ord, bytes.decode('Latin-1'))) # type conversion
return (getIntFromBytes(byteCodes))
But, again, 1000 is basically OK. The idea of hashing the time string, instead of just taking the number of microseconds since the Epoch as the seed, probably comes from the fact that some early random number generators used their seed as an offset into a common and not so large global sequence. Hence, if your program naïvely took two seeds in rapid sequence without hashing, there was a risk of overlap between your two sequences. Fortunately, pseudo-random number generators now are much better than what they used to be.
(Addendum - taken from comment)
Note that the seed is a peripheral thing. The important thing is the state of the automaton, which can be much larger than the seed, hence the very word “seed”. The commonly used Mersenne Twister scheme has ~20,000 bits of internal state, and you cannot ask the user to provide 20,000 bits. There are sometimes ill-behaved initial states, but it is always the responsibility of the random number library to expand somehow the user-provided arbitrary seed into a well-behaved initial state.

Safe way for parallel random sampling in python3

I need to repeat N times a scientific simulation based on a random sampling, easily:
results = [mysimulation() for i in range(N)]
Since every simulation require minutes, I'd like to parallelize them in order to reduce the execution time. Some weeks ago I successfully analyzed some simpler cases, for which I wrote my code in C using OpenMP and functions like rand_r() for avoiding seed overlapping. How could I obtain a similar effect in Python?
I tried reading more about python3 multithreading/parallelization, but I found no results concerning the random generation. Conversely, numpy.random does not suggest anything in this direction (as far as I found).

Are the odds of a cryptographically secure random number generator generating the same uuid small enough that you do not need to check for uniqueness?

I'm using this with a length of 20 for uuid. Is it common practice to not check if the uuid generated has not been used already if it's used for a persistent unique value?
Or is it best practice to verify it's not already being used by some part of your application if it's essential to retain uniqueness.

You can calculate the probability of a collision using this formula from Wikipedia::
     
where n(p; H) is the smallest number of samples you have to choose in order to find a collision with a probability of at least p, given H possible outputs with equal probability.
The same article also provides Python source code that you can use to calculate this value:
from math import log1p, sqrt
def birthday(probability_exponent, bits):
probability = 10. ** probability_exponent
outputs = 2. ** bits
return sqrt(2. * outputs * -log1p(-probability))
So if you're generating UUIDs with 20 bytes (160 bits) of random data, how sure can you be that there won't be any collisions? Let's suppose you want there to be a probability of less than one in a quintillion (10–18) that a collision will occur:
>>> birthday(-18,160)
1709679290002018.5
This means that after generating about 1.7 quadrillion UUIDs with 20 bytes of random data each, there is only a one in 1 a quintillion chance that two of these UUIDs will be the same.
Basically, 20 bytes is perfectly adequate.

crypto.RandomBytes is safe enough for most applications. If you want it to by completely secure, use a length of 16. Once there is a length of 16 there will likely never be a collision in the nearest century. And it is definitely not a good idea to check an entire database for any duplicates, because the odds are so low that the performance debuff outweighs the security.

Generating Normally distributed Random Numbers without decimal in excel

I am trying to get random numbers that are normally distributed with a mean of 20 and standard deviation of 2 for a sample size of 225 in Excel but I am getting numbers with decimals ( like 17.5642 , 16.337).
if I round it off, normal distribution cant be achieved. Please help me to get round figures that are normally distributed too....I used the Excel FORMULA "* =NORMINV(RAND(),20,2) *" for generating those numbers. Please suggest to get round figures.

As #circular-ruin has observed, what you are asking for strictly speaking doesn't make sense.
But -- perhaps you can run the Central Limit Theorem backwards. CLT is often used to approximate discrete distributions by normal distributions. You can use it to approximate a normal distribution by a discrete distribution.
If X is binomial with parameters p and n, then it is a standard result that the mean of X is np and the variance of X is np(1-p). Elementary algebra yields that such an X has mean 20 and variance 4 (hence standard deviation 2) if and only if n = 25 and p = 0.8. Thus -- if you simulate a bin(25,0.8) random variable you will get integer values which will be approximately N(20,4). This seems a little more principled then simulating N(20,4) directly and then just rounding. It still isn't normal -- but you really need to drop that requirement if you want your values to be integers.
To simulate a bin(25,0.8) random variable in Excel, just use the formula
=BINOM.INV(25,0.8,RAND())
with just 225 observations the results would probably pass a Chi-squared goodness of fit test for N(20,4) (though the right tail would be under-represented).

Probability of selecting an element from a set

The expected probability of randomly selecting an element from a set of n elements is P=1.0/n .
Suppose I check P using an unbiased method sufficiently many times. What is the distribution type of P? It is clear that P is not normally distributed, since cannot be negative. Thus, may I correctly assume that P is gamma distributed? And if yes, what are the parameters of this distribution?
Histogram of probabilities of selecting an element from 100-element set for 1000 times is shown here.
Is there any way to convert this to a standard distribution
Now supposed that the observed probability of selecting the given element was P* (P* != P). How can I estimate whether the bias is statistically significant?
EDIT: This is not a homework. I'm doing a hobby project and I need this piece of statistics for it. I've done my last homework ~10 years ago:-)

With repetitions, your distribution will be binomial. So let X be the number of times you select some fixed object, with M total selections
P{ X = x } = ( M choose x ) * (1/N)^x * (N-1/N)^(M-x)
You may find this difficult to compute for large N. It turns out that for sufficiently large N, this actually converges to a normal distribution with probability 1 (Central Limit theorem).
In case P{X=x} will be given by a normal distribution. The mean will be M/N and the variance will be M * (1/N) * ( N-1) / N.

This is a clear binomial distribution with p=1/(number of elements) and n=(number of trials).
To test whether the observed result differs significantly from the expected result, you can do the binomial test.
The dice examples on the two Wikipedia pages should give you some good guidance on how to formulate your problem. In your 100-element, 1000 trial example, that would be like rolling a 100-sided die 1000 times.

As others have noted, you want the Binomial distribution. Your question seems to imply an interest in a continuous approximation to it, though. It can actually be approximated by the normal distribution, and also by the Poisson distribution.

Is your distribution a discrete uniform distribution?

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string