Binomial Distribution using scipy.stats package - python-3.x

In each of 4 different competitions, Jin has 60% chance of winning. Assuming that the competitions are independent of each other, what is the probability that: Jin will win at least 1 race.
Binomial Distribution Parameters:
n=4
p=0.60
Display the probability in decimal.
Hint:
P(x>=1)=1-P(x=0)
Use the binom.pmf() function of scipy.stats package to calculate the probability.
#n=4
#p=0.60
#k=1
from scipy import stats
probability=stats.binom.pmf(1,4,0.60)
print(probability)
#0.15360000000000007
What should be the value of K here. My output is not correct.

I will first explain the solution in Mathematical Terms:
The Probability that Jin will win atleast 1 race = 1 - Jin will win NO race
In each of the 4 races Jin has 60 percent chance of winning. That means he has 40 percent chance of losing.
If the probability of success on an individual trial is p, then the binomial probability of n repeated trials with x successes is nCx⋅p^x⋅(1−p)^n−x
Hence,
the Probability that Jin will win No race out of the 4 races = 4C0 X 0.6^0 X 0.4^4 = 0.0256
Hence, the Probability that Jin will win atleast 1 race = 1 - 0.0256 = 0.9744‬
The Code:
from scipy import stats
def binomial():
ans = 1 - round(stats.binom.pmf(0,4,0.6),2)
return ans
if __name__=='__main__':
print(binomial())

#n=4
#p=0.60
#k=1
from scipy import stats
//P(x>=1)=1-P(x=0) this means 1.first find probability with k=0
probability=stats.binom.pmf(0,4,0.60)
//then do 1- probability
actual_probability=1-probability
print(actual_probability)

from scipy import stats
from scipy.stats import binom
def binomial():
n=4
p=0.6
k=0
prob =binom.pmf(k,n,p)
ans =round(1-prob,2)
#Round off to 2 decimal places
return ans

def binomial():
li=[1,2,3,4]
lis=[stats.binom.pmf(k,4,0.6) for k in li]
an=sum(lis)
ans=round(an,2)
return ans
if __name__=='__main__':
print(binomial())

Related

random selection of numbers with known log-norm distibution

I have some (unknown) numbers that follow a log-norm distribution. What I know is that the mean value is 3 and the coefficient of variation of 0.5.
This means the range of St.dev. varies an order of magnitude.
How can in python generate 100 random variables from the mean and coefficient (in pyhton)?
From desired mean mu_d and coefficient of variation coeff_var,
var_d = mu_d * coeff_var
Solve these expressions 3 for mu_x and var_x.
With a given mean 'mu_x' and variance 'var_x' for underlying normal distribution. 4
import numpy as np
# Mean and variance of underlying normal distribution
mu_x = 0
var_x = 1
sigma_x = np.sqrt(var_x)
# Samples from the distribution
s = np.random.lognormal(mu_x, sigma_x, 100)
Is this what you're looking for?
https://numpy.org/doc/stable/reference/random/generated/numpy.random.lognormal.html
import numpy as np
mean = 3 # mean
var_coef = 0.5 # coefficient of variation
std = var_coef * mean / 100 # standard deviation
s = np.random.lognormal(mean, std, 100)
print(s)

Problem with negative numbers in sklearn.feature_selection.SelectKBest feautre scoring module

I was trying auto feature engineering and selecting, so for that, I used the Boston house price dataset available in sklearn.
from sklearn.datasets import load_boston
import pandas as pd
data = load_boston()
x = data.data
y= data.target
y = pd.DataFrame(y)
Then I implemented the feature transformation library on the dataset.
import autofeat as af
clf = af.AutoFeatRegressor()
df = clf.fit_transform(x,y)
df = pd.DataFrame(df)
After this, I implemented another function to find the score of each feature in relation to the label.
from sklearn.feature_selection import SelectKBest, chi2
X_new = SelectKBest(chi2, k=20)
X_new_done = X_new.fit_transform(df,y)
dfscores = pd.DataFrame(X_new.scores_)
dfcolumns = pd.DataFrame(X_new_done.columns)
featureScores = pd.concat([dfcolumns,dfscores],axis=1)
featureScores.columns = ['Specs','Score']
print(featureScores.nlargest(10,'Score'))
This gave error as following.
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-16-b0fa1556bdef> in <module>()
1 from sklearn.feature_selection import SelectKBest, chi2
2 X_new = SelectKBest(chi2, k=20)
----> 3 X_new_done = X_new.fit_transform(df,y)
4 dfscores = pd.DataFrame(X_new.scores_)
5 dfcolumns = pd.DataFrame(X_new_done.columns)
ValueError: Input X must be non-negative.
I had a few negative numbers in my dataset. So how can I overcome this problem?
Note:- df has now transformations of y, its only having transformations of x.
You have a feature with all negative values:
df['exp(x005)*log(x000)']
returns
0 -3630.638503
1 -2212.931477
2 -4751.790753
3 -3754.508972
4 -3395.387438
...
501 -2022.382877
502 -1407.856591
503 -2998.638158
504 -1973.273347
505 -1267.482741
Name: exp(x005)*log(x000), Length: 506, dtype: float64
Quoting another answer (https://stackoverflow.com/a/46608239/5025009):
The error message Input X must be non-negative says it all: Pearson's chi square test (goodness of fit) does not apply to negative values. It's logical because the chi square test assumes frequencies distribution and a frequency can't be a negative number. Consequently, sklearn.feature_selection.chi2 asserts the input is non-negative.
In many cases, it may be quite safe to simply shift each feature to make it all positive, or even normalize to [0, 1] interval as suggested by EdChum.
If data transformation is for some reason not possible (e.g. a negative value is an important factor), you should pick another statistic to score your features:
sklearn.feature_selection.f_regression computes ANOVA f-value
sklearn.feature_selection.mutual_info_classif computes the mutual information
Since the whole point of this procedure is to prepare the features for another method, it's not a big deal to pick anyone, the end result usually the same or very close.

How to find the probability of 3 scenarios

I have a bucket of tennis balls(2) and baseballs(22) for a total of 24 balls in the bin.
I want to know what the probability is for 3 scenarios.
Each time I am going to pull out a total of 12 balls at random.
I want to know the probability after pulling out all 12 balls whats the likelihood:
1.) I pull out both(2) tennis balls
2.) I pull out 0 tennis balls
3.) I only pull 1 tennis ball?
Obviously the probabilities for all 3 of these questions have to add up to 1 or 100%
thank you
It's a hypergeometric distribution when you sample without replacement. So let's if we use hypergeom from scipy in python:
from scipy.stats import hypergeom
import seaborn as sns
# M is total pool, n is number of successes, N is the number of draws
[M, n, N] = [22, 2, 12]
rv = hypergeom(M, n, N)
#the range of values we are interested in
x = np.arange(0, n+1)
pmf_tballs = rv.pmf(x)
the probabilities for 0,1,2
pmf_tballs
array([0.19480519, 0.51948052, 0.28571429])
sns.barplot(x=x,y=pmf_tballs,color="b")
You can calculate by brute force:
import itertools
balls = [int(i) for i in '1'*2 + '0'*20]
draws = itertools.combinations(balls, 12)
counts_by_tballs = {0:0,1:0,2:0}
for i in draws:
counts[sum(i)] +=1
You get a tally of 0, 1 and 2:
counts
{0: 251940, 1: 671840, 2: 369512}
And the probability is the same as above with hypergeometric:
[i/sum(counts.values()) for i in counts.values()]
[0.19480519480519481, 0.5194805194805194, 0.2857142857142857]

KNN algorithm that return 2 or more nearest neighbours

For example, I have a vector x and a is it's nearest neigbour. Then, b is it's next nearest neighbour. Is there any package in Pyton or R that outputs something like [a, b] meaning that a is its nearest neighbour(maybe by majority vote), while b is it's second nearest neighbour.
This is exactly what those metric-trees are build for.
Your question reads as you are asking for something as simple as that using sklearn's KDTree (consider BallTree depending on your metric in play):
import numpy as np
from sklearn.neighbors import KDTree
X = np.array([[1,1],[2,2], [3,3]]) # 3 points in 2 dimensions
tree = KDTree(X)
dist, ind = tree.query([[1.25, 1.35]], k=2)
print(ind) # indices of 2 closest neighbors
print(dist) # distances to 2 closest neighbors
Out:
[[0 1]]
[[ 0.43011626 0.99247166]]
And just to be clear: KNN usually refers to some pre-build algorithm based on metric-trees (KDTree, BallTree) for the task of classification. Often those data-structures are the only thing one is interested in.
Edit
If i interpret your comment correctly, you want to use the manhattan / taxicab / l1 metric.
Look here for the compatibility lists of those spatial-trees.
You just would use it like that:
X = np.array([[1,1],[2,2], [3,3]]) # 3 points in 2 dimensions
tree = KDTree(X, metric='l1') # !!!
dist, ind = tree.query([[1.25, 1.35]], k=2)
print(ind) # indices of 2 closest neighbors
print(dist) # distances to 2 closest neighbors
Out:
[[0 1]]
[[ 0.6 1.4]]

gradient descendent coust increass by each iteraction in linear regression with one feature

Hi I am learning some machine learning algorithms and for the sake of understanding I was trying to implement a linear regression algorithm with one feature using as cost function the Residual sum of squares for gradient descent method as bellow:
My pseudocode:
while not converge
w <- w - step*gradient
python code
Linear.py
import math
import numpy as num
def get_regression_predictions(input_feature, intercept, slope):
predicted_output = [intercept + xi*slope for xi in input_feature]
return(predicted_output)
def rss(input_feature, output, intercept,slope):
return sum( [ ( output.iloc[i] - (intercept + slope*input_feature.iloc[i]) )**2 for i in range(len(output))])
def train(input_feature,output,intercept,slope):
file = open("train.csv","w")
file.write("ID,intercept,slope,RSS\n")
i =0
while True:
print("RSS:",rss(input_feature, output, intercept,slope))
file.write(str(i)+","+str(intercept)+","+str(slope)+","+str(rss(input_feature, output, intercept,slope))+"\n")
i+=1
gradient = [derivative(input_feature, output, intercept,slope,n) for n in range(0,2) ]
step = 0.05
intercept -= step*gradient[0]
slope-= step*gradient[1]
return intercept,slope
def derivative(input_feature, output, intercept,slope,n):
if n==0:
return sum( [ -2*(output.iloc[i] - (intercept + slope*input_feature.iloc[i])) for i in range(0,len(output))] )
return sum( [ -2*(output.iloc[i] - (intercept + slope*input_feature.iloc[i]))*input_feature.iloc[i] for i in range(0,len(output))] )
With the main program:
import Linear as lin
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
df = pd.read_csv("test2.csv")
train = df
lin.train(train["X"],train["Y"], 0, 0)
The test2.csv:
X,Y
0,1
1,3
2,7
3,13
4,21
I resisted the value of rss on a file and noticed that the value of rss became worst at each iteration as follows:
ID,intercept,slope,RSS
0,0,0,669
1,4.5,14.0,3585.25
2,-7.25,-18.5,19714.3125
3,19.375,58.25,108855.953125
Mathematically I think it doesn't make any sense I review my own code many times I think it is correct, I am doing something else wrong?
If your cost isn't decreasing, that's usually a sign you're overshooting with your gradient descent approach, meaning too large of a step size.
A smaller step size can help. You can also look into methods for variable step sizes, which can change each iteration to get you nice convergence properties and speed; usually, these methods change the step size with some proportionality to the gradient. Of course, the specifics depend on each problem.

Resources