Finding probability for Discrete Binomial distribution problems - statistics

Problem Description:
In each of 4 different competitions, Jin has 60% chance of winning. Assuming that the competitions are independent of each other, what is the probability that: Jin will win at least 1 race?
Given Binomial distribution Parameters:
n=4
p=0.60
Hint:
P(x>=1)=1-P(x=0)
Use the binom.pmf() function of scipy.stats package to calculate the probability.
Below is the python code i have tried. But it is being evaluated as wrong.
from scipy import stats
n = 4
p = 0.6
p1 = 1 - p
p2 = stats.binom.pmf(1,4,p1)
print(p1)

Using the hint, all you need to do is to evaluate the PMF of the binomial distribution at x=0 and subtract the result from 1 to obtain the probability of Jin winning at least one competition:
from scipy import stats
x=0
n=4
p=0.6
p0 = stats.binom.pmf(x,n,p)
print(1-p0)

Related

How to set the maxfun limit of the lbfgs solver on scikit-learn LogisticRegression model?

My scikit-learn LogisticRegression model, which uses the lbfgs solver, is stopping early as shown in the logs bellow. The data is standardized.
(...)
At iterate13150 f= 4.05397D+03 |proj g|= 2.41194D+04
At iterate13200 f= 4.05213D+03 |proj g|= 1.36863D+04
.venv/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:444: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of f AND g EVALUATIONS EXCEEDS LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
[Parallel(n_jobs=-1)]: Done 1 out of 1 | elapsed: 5.5s finished
* * *
Tit = total number of iterations
Tnf = total number of function evaluations
Tnint = total number of segments explored during Cauchy searches
Skip = number of BFGS updates skipped
Nact = number of active bounds at final generalized Cauchy point
Projg = norm of the final projected gradient
F = final function value
* * *
N Tit Tnf Tnint Skip Nact Projg F
62 13240 15001 1 0 0 4.800D+04 4.051D+03
F = 4051.0211050375365
sklearn uses the scipy implementation of the lbfgs solver. The function scipy/optimize/_lbfgsb_py.py:_minimize_lbfgsb has the following early stop conditions
if n_iterations >= maxiter:
task[:] = 'STOP: TOTAL NO. of ITERATIONS REACHED LIMIT'
elif sf.nfev > maxfun:
task[:] = ('STOP: TOTAL NO. of f AND g EVALUATIONS '
'EXCEEDS LIMIT')
I am indeed hitting the sf.nfev > maxfun limit. Unfortunatly, sklearn fixes the value of maxfun to 15_000 when it instanciates the scipy solver (`sklearn/linear_model/_logistic.py:442).
When I hotfix the sklearn package to set maxfun to 100_000, the solver converges. But this is not a real solution (since I do not want to carry arround a custom sklearn dist with one different constant).
Any ideas on how to set the maxfun parameter in another way?

Binomial Distribution problem using Montee Carlo

Here, I tried to compute probability of getting one 4 times when I toss a die 6 times.
Here is my code,
no=0
import random as rd
for i in range(1000):
l=[rd.randint(1,6) for i in range(6)]
a=l.count(1)
if a==4:
no+=1
print(no/1000)
I want to know, If this is actually a correct Montee Carlo approach of Binomial problem?
And Is it correct?
Your code is correct! Let me make two remarks to illustrate in more details how you can verify yourself that the code is correct.
You already mention that the probability is related to the binomial distribution. That is, if getting a 1 is considered a success, then you need exactly 4 successes in n=6 trials when the success probability p=1/6. This probability can be computed exactly using for instance binom in scipy.stats.
The outcome of a Monte Carlo experiment will always depend on the random variables you draw. Theoretically, we can only recover the true probability if we could have an infinite number of Monte Carlo replications. This is of course impossible in practice. However, it is usually a good idea to explicitly define the number of replications in the code (e.g. I took MonteCarloTrials = 1000000 in the code below). This allows you to increase the number of Monte Carlo experiments as you desire.
import random as rd
from scipy.stats import binom # import binomial distribution
# Monte Carlo computation
no = 0
MonteCarloTrials = 1000000
for i in range(MonteCarloTrials):
l = [rd.randint(1,6) for i in range(6)]
a = l.count(1)
if a==4:
no+=1
# Exact binomial computation
n, p = 6, 1/6
x = binom.pmf(4, n, p) # in n=6 trials with should 4 successes with succes probability 1/6
print('Approximate probability computed with Monte Carlo:', no/MonteCarloTrials)
print('Exact probability:', x)

Why do we have different values for skewness and kurtosis in MATLAB and Python?

Following are the codes for skewness and kurtosis in MATLAB:
clc; clear all
% Generate "N" data points
N = 1:1:2000;
% Set sampling frequency
Fs = 1000;
% Set time step value
dt = 1/Fs;
% Frequency of the signal
f = 5;
% Generate time array
t = N*dt;
% Generate sine wave
y = 10 + 5*sin(2*pi*f*t);
% Skewness
y_skew = skewness(y);
% Kurtosis
y_kurt = kurtosis(y);
The answer acquired in MATLAB is:
y_skew = 4.468686410415491e-15
y_kurt = 1.500000000000001 (Value is positive in MATLAB)
Now, below are the codes in Python:
import numpy as np
from scipy.stats import skew
from scipy.stats import kurtosis
# Generate "N" data points
N = np.linspace(1,2000,2000)
# Set sampling frequency
Fs = 1000
# Set time step value
dt = 1/Fs
# Frequency of the signal
f = 5
# Generate time array
t = N*dt
# Generate sine wave
y = 10 + 5*np.sin(2*np.pi*f*t);
# Skewness
y_skew = skew(y)
# Kurtosis
y_kurt = kurtosis(y)
The answer acquired in Python is:
y_skew = -1.8521564287013977e-16
y_kurt = -1.5 (Value has turned out to be negative in Python)
Can somebody please explain, why do we have different answers for skewness and kurtosis, in MATLAB and Python?
Specifically, in the case of kurtosis, the value has changed from positive to negative. Can somebody please help me out in understanding this.
This is the difference between the Fisher and Pearson measure of kurtosis.
From the MATLAB docs:
Kurtosis is a measure of how outlier-prone a distribution is. The kurtosis of the normal distribution is 3. Distributions that are more outlier-prone than the normal distribution have kurtosis greater than 3; distributions that are less outlier-prone have kurtosis less than 3. Some definitions of kurtosis subtract 3 from the computed value, so that the normal distribution has kurtosis of 0. The kurtosis function does not use this convention.
From the scipy docs:
Kurtosis is the fourth central moment divided by the square of the variance. If Fisher’s definition is used, then 3.0 is subtracted from the result to give 0.0 for a normal distribution.
Noting that Fisher's definition is used by default in scipy
scipy.stats.kurtosis(a, axis=0, fisher=True, ...)
Your results would be equivalent if you used fisher=False in Python (or manually add 3) or subtracted 3 from your MATLAB result so that they were both using the same definition.
So it looks like the sign is being flipped, but that's just by chance since +1.5 - 3 = -1.5.
The difference in skewness appears to be due to numerical precision, since both results are basically 0. Please see Why is 24.0000 not equal to 24.0000 in MATLAB?

Why I can't fit Poisson distribution using Chisquare test ? Whats wrong is in fitting? [duplicate]

I want to fit poission distribution on my data points and want to decide based on chisquare test that should I accept or reject this proposed distribution. I only used 10 observations. Here is my code
#Fitting function:
def Poisson_fit(x,a):
return (a*np.exp(-x))
#Code
hist, bins= np.histogram(x, bins=10, density=True)
print("hist: ",hist)
#hist: [5.62657158e-01, 5.14254073e-01, 2.03161280e-01, 5.84898068e-02,
1.35995217e-02,2.67094169e-03,4.39345778e-04,6.59603327e-05,1.01518320e-05,
1.06301906e-06]
XX = np.arange(len(hist))
print("XX: ",XX)
#XX: [0 1 2 3 4 5 6 7 8 9]
plt.scatter(XX, hist, marker='.',color='red')
popt, pcov = optimize.curve_fit(Poisson_fit, XX, hist)
plt.plot(x_data, Poisson_fit(x_data,*popt), linestyle='--',color='red',
label='Fit')
print("hist: ",hist)
plt.xlabel('s')
plt.ylabel('P(s)')
#Chisquare test:
f_obs =hist
#f_obs: [5.62657158e-01, 5.14254073e-01, 2.03161280e-01, 5.84898068e-02,
1.35995217e-02, 2.67094169e-03, 4.39345778e-04, 6.59603327e-05,
1.01518320e-05, 1.06301906e-06]
f_exp= Poisson_fit(XX,*popt)
f_exp: [6.76613820e-01, 2.48912314e-01, 9.15697229e-02, 3.36866185e-02,
1.23926144e-02, 4.55898806e-03, 1.67715798e-03, 6.16991940e-04,
2.26978650e-04, 8.35007789e-05]
chi,p_value=chisquare(f_obs,f_exp)
print("chi: ",chi)
print("p_value: ",p_value)
chi: 0.4588956658201067
p_value: 0.9999789643475111`
I am using 10 observations so degree of freedom would be 9. For this degree of freedom I can't find my p-value and chi value on Chi-square distribution table. Is there anything wrong in my code?Or my input values are too small that test fails? if P-value >0.05 distribution is accepted. Although p-value is large 0.999 but for this I can't find chisquare value 0.4588 on table. I think there is something wrong in my code. How to fix this error?
Is this returned chi value is the critical value of tails? How to check proposed hypothesis?

How to calculate the standard deviation from a histogram? (Python, Matplotlib)

Let's say I have a data set and used matplotlib to draw a histogram of said data set.
n, bins, patches = plt.hist(data, normed=1)
How do I calculate the standard deviation, using the n and bins values that hist() returns? I'm currently doing this to calculate the mean:
s = 0
for i in range(len(n)):
s += n[i] * ((bins[i] + bins[i+1]) / 2)
mean = s / numpy.sum(n)
which seems to work fine as I get pretty accurate results. However, if I try to calculate the standard deviation like this:
t = 0
for i in range(len(n)):
t += (bins[i] - mean)**2
std = np.sqrt(t / numpy.sum(n))
my results are way off from what numpy.std(data) returns. Replacing the left bin limits with the central point of each bin doesn't change this either. I have the feeling that the problem is that the n and bins values don't actually contain any information on how the individual data points are distributed within each bin, but the assignment I'm working on clearly demands that I use them to calculate the standard deviation.
You haven't weighted the contribution of each bin with n[i]. Change the increment of t to
t += n[i]*(bins[i] - mean)**2
By the way, you can simplify (and speed up) your calculation by using numpy.average with the weights argument.
Here's an example. First, generate some data to work with. We'll compute the sample mean, variance and standard deviation of the input before computing the histogram.
In [54]: x = np.random.normal(loc=10, scale=2, size=1000)
In [55]: x.mean()
Out[55]: 9.9760798903061847
In [56]: x.var()
Out[56]: 3.7673459904902025
In [57]: x.std()
Out[57]: 1.9409652213499866
I'll use numpy.histogram to compute the histogram:
In [58]: n, bins = np.histogram(x)
mids is the midpoints of the bins; it has the same length as n:
In [59]: mids = 0.5*(bins[1:] + bins[:-1])
The estimate of the mean is the weighted average of mids:
In [60]: mean = np.average(mids, weights=n)
In [61]: mean
Out[61]: 9.9763028267760312
In this case, it is pretty close to the mean of the original data.
The estimated variance is the weighted average of the squared difference from the mean:
In [62]: var = np.average((mids - mean)**2, weights=n)
In [63]: var
Out[63]: 3.8715035807387328
In [64]: np.sqrt(var)
Out[64]: 1.9676136767004677
That estimate is within 2% of the actual sample standard deviation.
The following answer is equivalent to Warren Weckesser's, but maybe more familiar to those who prefer to want mean as the expected value:
counts, bins = np.histogram(x)
mids = 0.5*(bins[1:] + bins[:-1])
probs = counts / np.sum(counts)
mean = np.sum(probs * mids)
sd = np.sqrt(np.sum(probs * (mids - mean)**2))
Do take note in certain context you may want the unbiased sample variance where the weights are not normalized by N but N-1.

Resources