How to find the probability of 3 scenarios - statistics

I have a bucket of tennis balls(2) and baseballs(22) for a total of 24 balls in the bin.
I want to know what the probability is for 3 scenarios.
Each time I am going to pull out a total of 12 balls at random.
I want to know the probability after pulling out all 12 balls whats the likelihood:
1.) I pull out both(2) tennis balls
2.) I pull out 0 tennis balls
3.) I only pull 1 tennis ball?
Obviously the probabilities for all 3 of these questions have to add up to 1 or 100%
thank you

It's a hypergeometric distribution when you sample without replacement. So let's if we use hypergeom from scipy in python:
from scipy.stats import hypergeom
import seaborn as sns
# M is total pool, n is number of successes, N is the number of draws
[M, n, N] = [22, 2, 12]
rv = hypergeom(M, n, N)
#the range of values we are interested in
x = np.arange(0, n+1)
pmf_tballs = rv.pmf(x)
the probabilities for 0,1,2
pmf_tballs
array([0.19480519, 0.51948052, 0.28571429])
sns.barplot(x=x,y=pmf_tballs,color="b")
You can calculate by brute force:
import itertools
balls = [int(i) for i in '1'*2 + '0'*20]
draws = itertools.combinations(balls, 12)
counts_by_tballs = {0:0,1:0,2:0}
for i in draws:
counts[sum(i)] +=1
You get a tally of 0, 1 and 2:
counts
{0: 251940, 1: 671840, 2: 369512}
And the probability is the same as above with hypergeometric:
[i/sum(counts.values()) for i in counts.values()]
[0.19480519480519481, 0.5194805194805194, 0.2857142857142857]

Related

Fill NaN values in a column within a specific range of values

I am wanting to do the following:
Fill NaN values in a single column using values within a specific range.
The range I am wanting to use is the mean of the non-Nan values in the column +/- 1 one standard
deviation of the computed mean.
NOTE If possible, I would like to be able to use multiples of the std dev by simply multiplying it by
a constant.
I thought I had it (see full code below) but the output from print(df['C'].describe()) shows that
I am generating values well outside my desired range. In fact, I am generating numbers outside
the original min and max of the column, which is definitely not what I want.
import pandas as pd
import numpy as np
import sys
print('Python: {}'.format(sys.version))
print('NumPy: {}'.format(np.__version__))
print('Pandas: {}'.format(pd.__version__))
print('\033[1;31m' + '--------------' + '\033[0m') # Bold red
display_settings = {
'max_columns': 15,
'max_colwidth': 60,
'expand_frame_repr': False, # Wrap to multiple pages
'max_rows': 50,
'precision': 6,
'show_dimensions': False
}
# pd.options.display.float_format = '{:,.2f}'.format
for op, value in display_settings.items():
pd.set_option("display.{}".format(op), value)
df = pd.DataFrame(np.random.randint(0, 1000, size=(200, 10)), columns=list('ABCDEFGHIJ'))
# df = pd.DataFrame(np.random.randint(0, 100, size=(20, 4)), columns=list(['AA','BB','C2','D2']))
print(df, '\n')
# https://stackoverflow.com/questions/55149738/pandas-replace-values-with-nan-at-random
df['C'] = df['C'].sample(frac=0.65) # The percentage of non-NaN values.
df['H'] = df['H'].sample(frac=0.75) # The percentage of non-NaN values.
print(df, '\n')
print(df.isnull().sum(), '\n')
print(df['C'].describe(), '\n')
def fillNaN_with_unifrand(col):
a = col.values
m = np.isnan(a) # mask of NaNs
mu, sigma = col.mean(), col.std()
a[m] = np.random.normal(mu, sigma, size=m.sum())
return col
# https://stackoverflow.com/questions/46543060/how-to-replace-every-nan-in-a-column-with-different-random-values-using-pandas?rq=1
fillNaN_with_unifrand(df['C'])
pd.options.display.float_format = '{:.0f}'.format
print(df, '\n')
print(df.isnull().sum(), '\n')
print(df['C'].describe())
Output of print(df['C'].describe()):
Starting:
count 130.000000
mean 462.446154
std 290.760432
min 7.000000
25% 187.500000
50% 433.000000
75% 671.250000
max 992.000000
Name: C, dtype: float64
Ending:
count 200
mean 517
std 298
min -187
25% 281
50% 544
75% 763
max 1218
Name: C, dtype: float64
Note the min and max. All of my fill values (in this instance) should have been 462 +/- 290.
Well, this is not how statistics work. A Gaussian Normal Distribution has a mean and a std but values can be drawn far away from mean +- std, they are just less likeley. As per definition of a normal distribution, 68 % of all values are within +- 1*std, 95 % are within +-2*std and so on. The question is: What do you want to do with outliers? Set them to mean +- std or draw again?
Case 1: Set outliers to min/max
This is usually unwanted, as this changes your distribution and puts more weight on the lower and upper boundary.
from matplotlib import pyplot as plt
mu = 100
sigma = 7
a = np.random.normal(mu, sigma, size=2000) # I used a size of 2000 as an example
a[a<(mu-sigma)] = mu-sigma
a[a>(mu+sigma)] = mu+sigma
plt.hist(a, bins=12, edgecolor='black')
plt.show()
Case 2: Truncated Normal Distribution
What you usually want is the Truncated Normal Distribution. It creates a distribution with an upper and a lower boundary. You find this function at the scipy.stats module. It works a bit different though: you first create the distribution by normalizing the lower and upper clip and then you create a numer of random variates rvs from it like this:
from matplotlib import pyplot as plt
import scipy.stats as stats
mu = 100
sigma = 7
lower_clip = mu-sigma
upper_clip = mu+sigma
a = stats.truncnorm((lower_clip - mu) / sigma, (upper_clip - mu) / sigma, loc=mu, scale=sigma)
plt.hist(a.rvs(2000), bins=12, edgecolor='black')
plt.show()
The constant of multiples of sigma is easily implemented. You can just change your lower and upper clip like
lower_clip = mu-x*sigma
with x being your constant.

Binomial Distribution using scipy.stats package

In each of 4 different competitions, Jin has 60% chance of winning. Assuming that the competitions are independent of each other, what is the probability that: Jin will win at least 1 race.
Binomial Distribution Parameters:
n=4
p=0.60
Display the probability in decimal.
Hint:
P(x>=1)=1-P(x=0)
Use the binom.pmf() function of scipy.stats package to calculate the probability.
#n=4
#p=0.60
#k=1
from scipy import stats
probability=stats.binom.pmf(1,4,0.60)
print(probability)
#0.15360000000000007
What should be the value of K here. My output is not correct.
I will first explain the solution in Mathematical Terms:
The Probability that Jin will win atleast 1 race = 1 - Jin will win NO race
In each of the 4 races Jin has 60 percent chance of winning. That means he has 40 percent chance of losing.
If the probability of success on an individual trial is p, then the binomial probability of n repeated trials with x successes is nCx⋅p^x⋅(1−p)^n−x
Hence,
the Probability that Jin will win No race out of the 4 races = 4C0 X 0.6^0 X 0.4^4 = 0.0256
Hence, the Probability that Jin will win atleast 1 race = 1 - 0.0256 = 0.9744‬
The Code:
from scipy import stats
def binomial():
ans = 1 - round(stats.binom.pmf(0,4,0.6),2)
return ans
if __name__=='__main__':
print(binomial())
#n=4
#p=0.60
#k=1
from scipy import stats
//P(x>=1)=1-P(x=0) this means 1.first find probability with k=0
probability=stats.binom.pmf(0,4,0.60)
//then do 1- probability
actual_probability=1-probability
print(actual_probability)
from scipy import stats
from scipy.stats import binom
def binomial():
n=4
p=0.6
k=0
prob =binom.pmf(k,n,p)
ans =round(1-prob,2)
#Round off to 2 decimal places
return ans
def binomial():
li=[1,2,3,4]
lis=[stats.binom.pmf(k,4,0.6) for k in li]
an=sum(lis)
ans=round(an,2)
return ans
if __name__=='__main__':
print(binomial())

How to highlight the area with maximum number of changes in a time series plot?

I am trying to play with some time series data. I would like to plot the area with maximum numbers of changes based on some interval.
I have written some sample code but I am not able to move forward in highlighting the region.
import pandas as pd
import numpy as np
import seaborn as sns
f = pd.DataFrame(np.random.randint(0,50,size=(300, 1)))
sns.tsplot(f[0])
I want to highlight the region with maximum changes say with window size 30.
Here is one approach that performs most of the operations in numpy, and then displays the region with matplotlib.axvspan:
f = pd.DataFrame(np.random.randint(0,50,size=(300, 1))) # dataframe
y = f[0].values # working vector in numpy
thr = 5 # criterion for counting as a change
chunk_size = 30 # window length
chunks = np.array_split(y, y.shape[0]/chunk_size) # split into 30-element chunks
# compute how many elements differ from one element to the next
diffs_by_chunk = [(np.abs(np.ediff1d(chunk)) > thr).sum() for chunk in chunks]
ix = np.argmax(diffs_by_chunk) # chunk with most differences
sns.tsplot(f[0])
plt.axvspan(ix * chunk_size, (ix+1) * chunk_size, alpha=0.5)
With a baseline of uniform random data, it is difficult to relate this to a use case, but alternative criteria for what to maximise over might be useful, e.g. just looking at the sum of absolute changes, rather than the number that exceed a threshold:
diffs_by_chunk = [(np.abs(np.ediff1d(chunk))).sum() for chunk in chunks] # criterion #2
It would also be possible to show multiple regions that all have enough differences:
for i, df in enumerate(diffs_by_chunk):
if df >= 25:
sns.mpl.pyplot.axvspan(i*chunk_size, (i+1)*chunk_size, alpha=0.5)

How to compute correlation ratio or Eta in Python?

According the answer to this post,
The most classic "correlation" measure between a nominal and an interval ("numeric") variable is Eta, also called correlation ratio, and equal to the root R-square of the one-way ANOVA (with p-value = that of the ANOVA). Eta can be seen as a symmetric association measure, like correlation, because Eta of ANOVA (with the nominal as independent, numeric as dependent) is equal to Pillai's trace of multivariate regression (with the numeric as independent, set of dummy variables corresponding to the nominal as dependent).
I would appreciate if you could let me know how to compute Eta in python.
In fact, I have a dataframe with some numeric and some nominal variables.
Besides, how to plot a heatmap like plot for it?
The answer above is missing root extraction, so as a result, you will receive an eta-squared. However, in the main article (used by User777) that issue has been fixed.
So, there is an article on Wikipedia about the correlation ratio is and how to calculate it. I've created a simpler version of the calculations and will use the example from wiki:
import pandas as pd
import numpy as np
data = {'subjects': ['algebra'] * 5 + ['geometry'] * 4 + ['statistics'] * 6,
'scores': [45, 70, 29, 15, 21, 40, 20, 30, 42, 65, 95, 80, 70, 85, 73]}
df = pd.DataFrame(data=data)
print(df.head(10))
>>> subjects scores
0 algebra 45
1 algebra 70
2 algebra 29
3 algebra 15
4 algebra 21
5 geometry 40
6 geometry 20
7 geometry 30
8 geometry 42
9 statistics 65
def correlation_ratio(categories, values):
categories = np.array(categories)
values = np.array(values)
ssw = 0
ssb = 0
for category in set(categories):
subgroup = values[np.where(categories == category)[0]]
ssw += sum((subgroup-np.mean(subgroup))**2)
ssb += len(subgroup)*(np.mean(subgroup)-np.mean(values))**2
return (ssb / (ssb + ssw))**.5
coef = correlation_ratio(df['subjects'], df['scores'])
print('Eta_squared: {:.4f}\nEta: {:.4f}'.format(coef**2, coef))
>>> Eta_squared: 0.7033
Eta: 0.8386
The answer is provided here:
def correlation_ratio(categories, measurements):
fcat, _ = pd.factorize(categories)
cat_num = np.max(fcat)+1
y_avg_array = np.zeros(cat_num)
n_array = np.zeros(cat_num)
for i in range(0,cat_num):
cat_measures = measurements[np.argwhere(fcat == i).flatten()]
n_array[i] = len(cat_measures)
y_avg_array[i] = np.average(cat_measures)
y_total_avg = np.sum(np.multiply(y_avg_array,n_array))/np.sum(n_array)
numerator = np.sum(np.multiply(n_array,np.power(np.subtract(y_avg_array,y_total_avg),2)))
denominator = np.sum(np.power(np.subtract(measurements,y_total_avg),2))
if numerator == 0:
eta = 0.0
else:
eta = numerator/denominator
return eta

KNN algorithm that return 2 or more nearest neighbours

For example, I have a vector x and a is it's nearest neigbour. Then, b is it's next nearest neighbour. Is there any package in Pyton or R that outputs something like [a, b] meaning that a is its nearest neighbour(maybe by majority vote), while b is it's second nearest neighbour.
This is exactly what those metric-trees are build for.
Your question reads as you are asking for something as simple as that using sklearn's KDTree (consider BallTree depending on your metric in play):
import numpy as np
from sklearn.neighbors import KDTree
X = np.array([[1,1],[2,2], [3,3]]) # 3 points in 2 dimensions
tree = KDTree(X)
dist, ind = tree.query([[1.25, 1.35]], k=2)
print(ind) # indices of 2 closest neighbors
print(dist) # distances to 2 closest neighbors
Out:
[[0 1]]
[[ 0.43011626 0.99247166]]
And just to be clear: KNN usually refers to some pre-build algorithm based on metric-trees (KDTree, BallTree) for the task of classification. Often those data-structures are the only thing one is interested in.
Edit
If i interpret your comment correctly, you want to use the manhattan / taxicab / l1 metric.
Look here for the compatibility lists of those spatial-trees.
You just would use it like that:
X = np.array([[1,1],[2,2], [3,3]]) # 3 points in 2 dimensions
tree = KDTree(X, metric='l1') # !!!
dist, ind = tree.query([[1.25, 1.35]], k=2)
print(ind) # indices of 2 closest neighbors
print(dist) # distances to 2 closest neighbors
Out:
[[0 1]]
[[ 0.6 1.4]]

Resources