Binomial Distribution problem using Montee Carlo - statistics

Here, I tried to compute probability of getting one 4 times when I toss a die 6 times.
Here is my code,
no=0
import random as rd
for i in range(1000):
l=[rd.randint(1,6) for i in range(6)]
a=l.count(1)
if a==4:
no+=1
print(no/1000)
I want to know, If this is actually a correct Montee Carlo approach of Binomial problem?
And Is it correct?

Your code is correct! Let me make two remarks to illustrate in more details how you can verify yourself that the code is correct.
You already mention that the probability is related to the binomial distribution. That is, if getting a 1 is considered a success, then you need exactly 4 successes in n=6 trials when the success probability p=1/6. This probability can be computed exactly using for instance binom in scipy.stats.
The outcome of a Monte Carlo experiment will always depend on the random variables you draw. Theoretically, we can only recover the true probability if we could have an infinite number of Monte Carlo replications. This is of course impossible in practice. However, it is usually a good idea to explicitly define the number of replications in the code (e.g. I took MonteCarloTrials = 1000000 in the code below). This allows you to increase the number of Monte Carlo experiments as you desire.
import random as rd
from scipy.stats import binom # import binomial distribution
# Monte Carlo computation
no = 0
MonteCarloTrials = 1000000
for i in range(MonteCarloTrials):
l = [rd.randint(1,6) for i in range(6)]
a = l.count(1)
if a==4:
no+=1
# Exact binomial computation
n, p = 6, 1/6
x = binom.pmf(4, n, p) # in n=6 trials with should 4 successes with succes probability 1/6
print('Approximate probability computed with Monte Carlo:', no/MonteCarloTrials)
print('Exact probability:', x)

Related

If we rolled the die (6-sided) 1000 time, what is the range of times we'd expect to see a 1 rolled?

I got a question listed below about the confidence interval for rolling a die 1000 times. I'm assuming that the question is using Binomial Distribution but not sure if I'm correct. I guess in the solution, the probability 0.94 comes from 1-0.06. But I'm not sure if we need the probability in this interval, except it is only used for the Z-score, 1.88. Could I assume this question like this?
Question:
Assume that we are okay with accidentally rejecting H0​ 6% of the time, assuming H0​ is true.
If we rolled the die (6-sided) 1000 times, what is the range of times we'd expect to see a 1 rolled? (H0​ is the die is fair.)
Answer:
The interval is (144.50135805579743, 188.8319752775359), with probability = 0.94, mu = 166.67, sigma = 11.785113019775793
We can treat this as a binomial distribution with a success chance p of 1/6 and number of trials n = 1000.
Mean value of such a distribution is np, and variance is np(1-p). sigma (or std) is sqrt(variance).
However, finding the interval is not so trivial since it requires an inverse CDF. The solution apparently uses normal approximation (p is low, n is high) with a Z-score table (like https://www.math.arizona.edu/~rsims/ma464/standardnormaltable.pdf) thus range = mu +- 1.88 * sigma. Obviously, binomial is discrete, so there cannot be '145.5 times' of rolling 1. scipy.stats.binom.ppf(0.97, 1000, 1/6) and scipy.stats.binom.ppf(0.03, 1000, 1/6) yield a sane 145..189 range.

Calculate the improper integral Monte Carlo method

For exaple, I'm trying to transform the integral:
I need to transform it to an integral that goes from 0 to 1 (or from a to b) in order to apply the algorithm of Monte Carlo I implemented. I already have Monte Carlo function, which calculate definite integrals, but I need to transform this improper integral to definite integral and I really don't know how to do it.
Ideally, I want to transform this integral (from -inf to inf):
Also I tried to do transformation, which I found on Inthernet (integration by substitution: x=-ln(y), dx=-1/y), but it doesn't work:
So how can I get transformation of this integral?
I need to transform it to an integral that goes from 0 to 1 (or from a to b) in order to apply the algorithm of Monte Carlo I implemented.
No, you don't.
if you have integral
∫0∞ w(x) g(x) dx
you could sample from w(x) and compute mean value of g(x) at sampled points.
You integral
∫0∞ e-x cos(x) dx
is pretty perfect for such approach - you sample from e-x and compute E[cos(x)]
Along the lines (Python 3.9, Win10 x64)
import numpy as np
rng = np.random.default_rng()
N = 1000000
U = rng.random(N)
W = -np.log(1.0 - U) # sampling exp(-x)
G = np.cos(W)
ans = np.mean(G)
print(ans)
will print something like
0.5002769491719996
And concerning your second integral, see What is the issue in my array division step?

Finding probability for Discrete Binomial distribution problems

Problem Description:
In each of 4 different competitions, Jin has 60% chance of winning. Assuming that the competitions are independent of each other, what is the probability that: Jin will win at least 1 race?
Given Binomial distribution Parameters:
n=4
p=0.60
Hint:
P(x>=1)=1-P(x=0)
Use the binom.pmf() function of scipy.stats package to calculate the probability.
Below is the python code i have tried. But it is being evaluated as wrong.
from scipy import stats
n = 4
p = 0.6
p1 = 1 - p
p2 = stats.binom.pmf(1,4,p1)
print(p1)
Using the hint, all you need to do is to evaluate the PMF of the binomial distribution at x=0 and subtract the result from 1 to obtain the probability of Jin winning at least one competition:
from scipy import stats
x=0
n=4
p=0.6
p0 = stats.binom.pmf(x,n,p)
print(1-p0)

Bayesian Linear Regression with PyMC3 and a large dataset - bracket nesting level exceeded maximum and slow performance

I would like to use a Bayesian multivariate linear regression to estimate the strength of players in team sports (e.g. ice hockey, basketball or soccer). For that purpose, I create a matrix, X, containing the players as columns and the matches as rows. For each match the player entry is either 1 (player plays in the home team), -1 (player plays in the away team) or 0 (player does not take part in this game). The dependent variable Y is defined as the scoring differences for both teams in each match (Score_home_team - Score_away_team).
Thus, the number of parameters will be quite large for one season (e.g. X is defined by 300 rows x 450 columns; i.e. 450 player coefficients + y-intercept). When running the fit I came across a compilation error:
('Compilation failed (return status=1): /Users/me/.theano/compiledir_Darwin-17.7.0-x86_64-i386-64bit-i386-3.6.5-64/tmpdxxc2379/mod.cpp:27598:32: fatal error: bracket nesting level exceeded maximum of 256.
I tried to handle this error by setting:
theano.config.gcc.cxxflags = "-fbracket-depth=1024"
Now, the sampling is running. However, it is so slow that even if I take only 35 of 300 rows the sampling is not completed within 20 minutes.
This is my basic code:
import pymc3 as pm
basic_model = pm.Model()
with basic_model:
# Priors for beta coefficients - these are the coefficients of the players
dict_betas = {}
for col in X.columns:
dict_betas[col] = pm.Normal(col, mu=0, sd=10)
# Priors for unknown model parameters
alpha = pm.Normal('alpha', mu=0, sd=10) # alpha is the y-intercept
sigma = pm.HalfNormal('sigma', sd=1) # standard deviation of the observations
# Expected value of outcome
mu = alpha
for col in X.columns:
mu = mu + dict_betas[col] * X[col] # mu = alpha + beta_1 * Player_1 + beta_2 * Player_2 + ...
# Likelihood (sampling distribution) of observations
Y_obs = pm.Normal('Y_obs', mu=mu, sd=sigma, observed=Y)
The instantiation of the model runs within one minute for the large dataset. I do the sampling using:
with basic_model:
# draw 500 posterior samples
trace = pm.sample(500)
The sampling is completed for small sample sizes (e.g. 9 rows, 80 columns) within 7 minutes. However, the time is increasing substantially with increasing sample size.
Any suggestions how I can get this Bayesian linear regression to run in a feasible amount of time? Are these kind of problems doable using PyMC3 (remember I came across a bracket nesting error)? I saw in a recent publication that this kind of analysis is doable in R (https://arxiv.org/pdf/1810.08032.pdf). Therefore, I guess it should also somehow work with Python 3.
Any help is appreciated!
Eliminating the for loops should improve performance and might also take care of the nesting issue you are reporting. Theano TensorVariables and the PyMC3 random variables that derive from them are already multidimensional and support linear algebra operations. Try changing your code to something along the lines of
beta = pm.Normal('beta', mu=0, sd=10, shape=X.shape[1])
...
mu = alpha + pm.math.dot(X, beta)
...
If you need specify different prior values for mu and/or sd, those arguments accept anything that theano.tensor.as_tensor_variable() accepts, so you can pass a list or numpy array.
I highly recommend getting familiar with the theano.tensor and pymc3.math operations since sometimes you must use these to properly manipulate random variables, and in general it should lead to more efficient code.

How to avoid impression bias when calculate the ctr?

When we train a ctr(click through rate) model, sometimes we need calcute the real ctr from the history data, like this
#(click)
ctr = ----------------
#(impressions)
We know that, if the number of impressions is too small, the calculted ctr is not real. So we always set a threshold to filter out the large enough impressions.
But we know that the higher impressions, the higher confidence for the ctr. Then my question is that: Is there a impressions-normalized statistic method to calculate the ctr?
Thanks!
You probably need a representation of confidence interval for your estimated ctr. Wilson score interval is a good one to try.
You need below stats to calculate the confidence score:
\hat p is the observed ctr (fraction of #clicked vs #impressions)
n is the total number of impressions
zα/2 is the (1-α/2) quantile of the standard normal distribution
A simple implementation in python is shown below, I use z(1-α/2)=1.96 which corresponds to a 95% confidence interval. I attached 3 test results at the end of the code.
# clicks # impressions # conf interval
2 10 (0.07, 0.45)
20 100 (0.14, 0.27)
200 1000 (0.18, 0.22)
Now you can set up some threshold to use the calculated confidence interval.
from math import sqrt
def confidence(clicks, impressions):
n = impressions
if n == 0: return 0
z = 1.96 #1.96 -> 95% confidence
phat = float(clicks) / n
denorm = 1. + (z*z/n)
enum1 = phat + z*z/(2*n)
enum2 = z * sqrt(phat*(1-phat)/n + z*z/(4*n*n))
return (enum1-enum2)/denorm, (enum1+enum2)/denorm
def wilson(clicks, impressions):
if impressions == 0:
return 0
else:
return confidence(clicks, impressions)
if __name__ == '__main__':
print wilson(2,10)
print wilson(20,100)
print wilson(200,1000)
"""
--------------------
results:
(0.07048879557839793, 0.4518041980521754)
(0.14384999046998084, 0.27112660859398174)
(0.1805388068716823, 0.22099327100894336)
"""
If you treat this as a binomial parameter, you can do Bayesian estimation. If your prior on ctr is uniform (a Beta distribution with parameters (1,1)) then your posterior is Beta(1+#click, 1+#impressions-#click). Your posterior mean is #click+1 / #impressions+2 if you want a single summary statistic of this posterior, but you probably don't, and here's why:
I don't know what your method for determining whether ctr is high enough, but let's say you're interested in everything with ctr > 0.9. You can then use the cumulative density function of the beta distribution to look at what proportion of probability mass is over the 0.9 threshold (this will just be 1 - the cdf at 0.9). In this way, your threshold will naturally incorporate uncertainty about the estimate because of limited sample size.
There are many ways to calculate this confidence interval. An alternative to the Wilson Score is the Clopper-Perrson interval, which I found useful in spreadsheets.
Upper Bound Equation
Lower Bound Equation
Where
B() is the the Inverse Beta Distribution
alpha is the confidence level error (e.g for 95% confidence-level, alpha is 5%)
n is the number of samples (e.g. impressions)
x is the number of successes (e.g. clicks)
In Excel an implementation for B() is provided by the BETA.INV formula.
There is no equivalent formula for B() in Google Sheets, but a Google Apps Script custom function can be adapted from the JavaScript Statistical Library (e.g search github for jstat)

Resources