Statistical tests: how do (perception; actual results; and next) interact? - statistics

What is the interaction between perception, outcome, and outlook?
I've brought them into categorical variables to [potentially] simplify things.
import pandas as pd
import numpy as np
high, size = 100, 20
df = pd.DataFrame({'perception': np.random.randint(0, high, size),
'age': np.random.randint(0, high, size),
'smokes_cat': pd.Categorical(np.tile(['lots', 'little', 'not'],
size//3+1)[:size]),
'outcome': np.random.randint(0, high, size),
'outlook_cat': pd.Categorical(np.tile(['positive', 'neutral',
'negative'],
size//3+1)[:size])
})
df.insert(2, 'age_cat', pd.Categorical(pd.cut(df.age, range(0, high+5, size//2),
right=False, labels=[
"{0} - {1}".format(i, i + 9)
for i in range(0, high, size//2)])))
def tierify(i):
if i <= 25:
return 'lowest'
elif i <= 50:
return 'low'
elif i <= 75:
return 'med'
return 'high'
df.insert(1, 'perception_cat', df['perception'].map(tierify))
df.insert(6, 'outcome_cat', df['outcome'].map(tierify))
np.random.shuffle(df['smokes_cat'])
Run online: http://ideone.com/fftuSv or https://repl.it/repls/MicroLeftSequences
This is faked data but should present the idea. The individual have a perceived view perception, then they are presented with actual outcome, and from that can decide their outlook.
Using Python (pandas, or anything open-source really), how do I show the probability—and p-value—of the interaction between these 3 dependent columns (possibly using the age, smokes_cat as potential confounders)?

You can use interaction plots for this particular purpose. This fits pretty well to your case. I would use such plot for your data. I've tried it for your dummy data generated in the question, and you can write your code like below. Think it as a pseudo-code though, you must tailor the code to your need.
In its simple form:
If the lines in the plot have an intersection or likely to have for other values, then you may assume that there is an interaction effect.
If the lines are parellel or not likely to have an intersection, then you assume there is no interaction effect.
Yet, for additional and deeper understanding, I placed some links that you can check out.
Code
... # The rest of the code in the question.
# Interaction plot
import matplotlib.pyplot as plt
from statsmodels.graphics.factorplots import interaction_plot
p = interaction_plot(
x = df['perception'],
trace=df['outlook_cat'],
response= df['outcome']
)
plt.savefig('./my_interaction_plot.png') # or plt.show()
You can find the documentation of interaction_plot() here. Besides, I also suggest you run an ANOVA.
Further reading
You can check out these links:
(A paper) titled Interaction Effects in ANOVA.
(A case) in practice case.
(Another case) in practice case.

One option is a Multinomial logit model:
# Create one-hot encoded version of categorical variables
from sklearn.preprocessing import LabelEncoder
enc = LabelEncoder()
all_enc_df = pd.DataFrame({column: enc.fit_transform(df[column])
for column in ('perception_cat', 'age_cat',
'smokes_cat', 'outlook_cat')})
# Regression
from sklearn.linear_model import LogisticRegression
X, y = (all_enc_df[['age_cat', 'smokes_cat', 'outlook_cat']],
all_enc_df[['perception_cat']])
#clf = LogisticRegression(random_state=0, solver='lbfgs',
# multi_class='multinomial').fit(X, y)
import statsmodels.api as sm
mullogit = sm.MNLogit(y,X)
mulfit = mullogit.fit(method='bfgs', maxiter=100)
print(mulfit.summary())
https://repl.it/repls/MicroLeftSequences

Related

MLE for censored distributions of the exponential family

With the scipy.stats package it is straightforward to fit a distribution to data, e.g. scipy.stats.expon.fit() can be used to fit data to an exponential distribution.
However, I am trying to fit data to a censored/conditional distribution in the exponential family. In other words, using MLE, I am trying to find the maximum of
,
where is a PDF of a distribution in the exponential family, and is its corresponding CDF.
Mathematically, I have found that the log-likelihood function is convex in the parameter space , so my assumption was that it should be relatively straightforward to apply the scipy.optimize.minimize function. Notice in the above log-likelihood that by taking we obtain the traditional/uncensored MLE problem.
However, I find that even for simple distributions that e.g. the nelder-mead simplex algorithm does not always converge, or that it does converge but the estimated parameters are far off from the true ones. I have attached my code below. Notice that one can choose a distribution, and that the code is generic enough to fit the loc and scale parameters, as well as the optional shape parameters (for e.g. a Beta or Gamma distribution).
My question is: what am I doing wrong to obtain these bad estimates, or sometimes get convergence issues? I have tried a few algorithms but there is not one that easily works, to my surprise as the problem is convex. Are there smoothness issues, and that I need to find a way to use the Jacobian and Hessian in a generic way for this problem?
Are there other methods to tackle this problem? Initially I thought to override fit() function in the scipy.stats.rv class to take care of the censoring with the CDF, but this seemed quite cumbersome. But since the problem is convex, I would guess that using the minimize function of scipy I should be able to easily get the results...
Comments and help are very welcome!
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.stats import expon, gamma, beta, norm
from scipy.optimize import minimize
from scipy.stats import rv_continuous as rv
def log_likelihood(func: rv, delays, max_delays=10**8, **func_pars)->float:
return np.sum(np.log(func.pdf(delays, **func_pars)+1) - np.log(func.cdf(max_delays, **func_pars)))
def minimize_log_likelihood(func: rv, delays, max_delays):
# Determine number of parameters to estimate (always 'loc', 'scale', sometimes shape parameters)
n_pars = 2 + func.numargs
# Initialize guess (loc, scale, [shapes])
x0 = np.ones(n_pars)
def wrapper(params, *args):
func = args[0]
delays = args[1]
max_delays = args[2]
loc, scale = params[0], params[1]
# Set 'loc' and 'scale' parameters
kwargs = {'loc': loc, 'scale': scale}
# Add shape parameters if existing to kwargs
if func.shapes is not None:
for i, s in enumerate(func.shapes.split(', ')):
kwargs[s] = params[2+i]
return -log_likelihood(func=func, delays=delays, max_delays=max_delays, **kwargs)
# Find maximum of log-likelihood (thus minimum of minus log-likelihood; see wrapper function)
return minimize(wrapper, x0, args=(func, delays, max_delays), options={'disp': True},
method='nelder-mead', tol=1e-8)
# Test code with by sampling from known distribution, and retrieve parameters
distribution = expon
dist_pars = {'loc': 0, 'scale': 4}
x = np.linspace(distribution.ppf(0.0001, **dist_pars), distribution.ppf(0.9999, **dist_pars), 1000)
res = minimize_log_likelihood(distribution, x, 10**8)
print(res)
I have found that the convergence is bad due to numerical inaccuracies. Best is to replace
np.log(func.pdf(x, **func_kwargs))
with
func.logpdf(x, **func_kwargs)
This leads to correct estimation of the parameters. The same holds for the CDF. The documentation of scipy also indicates that the numerical accuracy of the latter performs better.
This all works nicely with the Exponential, Normal, Gamma, chi2 distributions. The Beta distribution still gives me issues, but I think this is again to some (other) numerical inaccuracies which I will analyse separately.

Vectorizing a for loop with a pandas dataframe

I am trying to do a project for my physics class where we are supposed to simulate motion of charged particles. We are supposed to randomly generate their positions and charges but we have to have positively charged particles in one region and negatively charged ones anywhere else. Right now, as a proof of concept, I am trying to do only 10 particles but the final project will have at least 1000.
My thought process is to create a dataframe with the first column containing the randomly generated charges and run a loop to see what value I get and place in the same dataframe as the next three columns their generated positions.
I have tried to do a simple for loop going over the rows and inputting the data as I go, but I run into an IndexingError: too many indexers. I also want this to run as efficiently as possible so that if I scale up the number of particles, it doesn't slow as much.
I also want to vectorize the operations of calculating the motion of each particle since it is based on position of every other particle which, through normal loops would take a lot of computational time.
Any vectorization optimization or offloading to GPU would be very helpful, thanks.
# In[1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from mpl_toolkits import mplot3d
# In[2]:
num_points=10
df_position = pd.DataFrame(pd,np.empty((num_points,4)),columns=['Charge','X','Y','Z'])
# In[3]:
charge = np.array([np.random.choice(2,num_points)])
df_position.iloc[:,0]=np.where(df_position["Charge"]==0,-1,1)
# In[4]:
def positive():
return np.random.uniform(low=0, high=5)
def negative():
return np.random.uniform(low=5, high=10)
# In[5]:
for row in df_position.itertuples(index=True,name='Charge'):
if(getattr(row,"Charge")==-1):
df_position.iloc[row,1]=positive()
df_position.iloc[row,2]=positive()
df_position.iloc[row,3]=positive()
else:
df_position.iloc[row,1]=negative()
#this is where I would get the IndexingError and would like to optimize this portion
df_position.iloc[row,2]=negative()
df_position.iloc[row,3]=negative()
df_position.iloc[:,0]=np.where(df_position["Charge"]==0,-1,1)
# In[6]:
ax=plt.axes(projection='3d')
ax.set_xlim(0, 10); ax.set_ylim(0, 10); ax.set_zlim(0,10);
xdata=df_position.iloc[:,1]
ydata=df_position.iloc[:,2]
zdata=df_position.iloc[:,3]
chargedata=df_position.iloc[:11,0]
colors = np.where(df_position["Charge"]==1,'r','b')
ax.scatter3D(xdata,ydata,zdata,c=colors,alpha=1)
EDIT:
The dataframe that I want the results in would be something like this
Charge X Y Z
-1
1
-1
-1
1
With the inital coordinates of each charge listed after in their respective columns. It will be a 3D dataframe as I will need to track of all their new positions after each time step so that I can do animations of the motion. Each layer will be exactly the same format.
Some code for creating your dataframe:
import numpy as np
import pandas as pd
num_points = 1_000
# uniform distribution of int, not sure it is the best one for your problem
# positive_point = np.random.randint(0, num_points)
positive_point = int(num_points / 100 * np.random.randn() + num_points / 2)
negavite_point = num_points - positive_point
positive_df = pd.DataFrame(
np.random.uniform(0.0, 5.0, size=[positive_point, 3]), index=[1] * positive_point, columns=['X', 'Y', 'Z']
)
negative_df = pd.DataFrame(
np.random.uniform(5.0, 10.0, size=[negavite_point, 3]), index=[-1] *negavite_point, columns=['X', 'Y', 'Z']
)
df = pd.concat([positive_df, negative_df])
It is quite fast for 1,000 or 1,000,000.
Edit: with my first answer, I totally miss a big part of the question. This new one should fit better.
Second edit: I use a better distribution for the number of positive point than a uniform distribution of int.

How do I fix KeyError bug in my code while implementing K-Nearest Neighbours from scratch?

I am trying to implement K-Nearest Neighbours algorithm from scratch in Python. The code I wrote worked well for the Breast-Cancer-Wisconsin.csv dataset.
However, the same code when I try to run for Iris.csv dataset, my implementation fails and gives KeyError.
The only difference in the 2 datasets is the fact that in Breast-Cancer-Wisconsin.csv there are only 2 classes ('2' for malignant and '4' for benign) and both the labels are integers wheres in Iris.csv there are 3 classes ('setosa', 'versicolor', 'virginica') and all these 3 labels are in string type.
Here is the code I wrote (for Iris.csv) :
import numpy as np
from math import sqrt
import matplotlib.pyplot as plt
from matplotlib import style
from collections import Counter
import warnings
import pandas as pd
import random
style.use('fivethirtyeight')
dataset = {'k':[[1,2],[2,3],[3,1]], 'r':[[6,5],[7,7],[8,6]]}
new_features = [5,7]
#[[plt.scatter(j[0],j[1], s=100, color=i) for j in dataset[i]] for i in dataset]
#plt.scatter(new_features[0], new_features[1], s=100)
#plt.show()
def k_nearest_neighbors(data, predict, k=3):
if len(data) >= k:
warnings.warn('K is set to a value less than total voting groups!')
distances = []
for group in data:
for features in data[group]:
euclidean_distance = np.linalg.norm(np.array(features) - np.array(predict))
distances.append([euclidean_distance, group])
votes = [i[1] for i in sorted(distances)[:k]]
vote_result = Counter(votes).most_common(1)[0][0]
return vote_result
df = pd.read_csv('iris.csv')
df.replace('?', -99999, inplace=True)
#full_data = df.astype(float).values.tolist()
#random.shuffle(full_data)
test_size = 0.2
train_set = {'setosa':[], 'versicolor':[], 'virginica':[]}
test_set = {'setosa':[], 'versicolor':[], 'virginica':[]}
train_data = full_data[:-int(test_size*len(full_data))]
test_data = full_data[-int(test_size*len(full_data)):]
for i in train_data:
train_set[i[-1]].append(i[:-1])
for i in test_data:
test_set[i[-1]].append(i[:-1])
correct = 0
total = 0
for group in test_set:
for data in test_set[group]:
vote = k_nearest_neighbors(train_set, data, k=5)
if group == vote:
correct += 1
total += 1
print('Accuracy : ', correct/total)
When I run the above code, I get a KeyError message at line number 49.
Could anyone please explain to me where I am going wrong? Also, it would be great if someone could point out how do I modify this algorithm to classify multiple classes (instead of 2 or 3) in the future?
Also, how do I handle if the classes are in string type instead of integer?
one solution I thought of was to convert all string types to integer types and try to solve but would that work?
REFERENCES
Iris.csv
Breas-Cancer-Wisconsin.csv
Let's start from your last question:
one solution I thought of was to convert all string types to integer types and try to solve but would that work?
Yes, that would work. You shouldn't have to hardcode the names of all the classes of every problem in your code. Instead, you can just write a function that reads all the different values for the class attribute, and assigns a numeric value to each different one.
Could anyone please explain to me where I am going wrong?
Most likely, the problem is that you are reading an instance whose class attribute is not 'setosa', 'versicolor', 'virginica' (something like Iris-setosa perhaps?). The idea above should fix this problem.
Also, it would be great if someone could point out how do I modify this algorithm to classify multiple classes (instead of 2 or 3) in the future?
As discuss before, you just need to avoid hard-coding the names of the classes in your code
Also, how do I handle if the classes are in string type instead of integer?
def get_class_values(data):
classes_seen = {}
for i in data:
_class = data[-1]
if _class not in classes_seen:
classes_seen[_class] = len(classes_seen)
return classes_seen
A function like this one would return a mapping between all your classes (no matter the type) and numeric codes (from 0 to N-1). Using this mapping would also solve all the problems mentioned before.
Convert String Labels In CSV Files To Integer Labels
After going through some GitHub repos I came across a very simple yet elegant piece of code that solves the above problem. Hope it helps those who have faced this problem before (beginners especially!)
% read the csv file
df = pd.read_csv('iris.csv')
% clean the data file
df.replace('?', -99999, inplace=True)
% convert the string classes into integer types.
% integers are assigned from 0 to N-1.
% species is the name of the column which has class labels.
df['species'] = df['species'].astype('category')
df['species_value'] = df['species'].cat.codes
df.drop(['species'], 1, inplace=True)
% convert the data frame to list
full_data = df.astype(float).values.tolist()
random.shuffle(full_data)
Post Debugging
Turns out that we need not use the above piece of code also, i.e I can get the answer without explicitly converting the string labels into integer labels (using the above code).
I have posted the original code after some minor changes (below) and the key error is now fixed. Also, I am now getting an accuracy of 97% to 100% (only on IRIS dataset).
test_size = 0.2
train_set = {0:[], 1:[], 2:[]}
test_set = {0:[], 1:[], 2:[]}
That is the only change you need to make to the original code I posted in order to make it work!! Simple!
However, please note that the numbers have to be given as integers and not string (otherwise it would lead to key error!).
Wrap-Up
There are some commented lines in the original code which I thought would be good to explain in case somebody ran into some issues. Here's one snippet with the comments removed (compare with original code in the question).
df = pd.read_csv('iris.csv')
df.replace('?', -99999, inplace=True)
full_data = df.astype(float).values.tolist()
random.shuffle(full_data)
Here's the output you get:
ValueError: could not convert string to float: 'virginica'
What went wrong?
Note that here we did not convert the string labels into integer labels. Therefore, when we tried to convert the data in the CSV to float values, the kernel threw an error because a string cannot be converted to float!
So one way to go about it is that you don't convert the data into floating point values and then you won't get this error. However in many cases you need to convert all the data into floating point (for eg.. normalisation, accuracy, long mathematical calculations, prevention of loss of precision etc etc..).
Hence after heavy debugging and going through a lot of articles I finally came up with a simple version of the original code (below):
import numpy as np
from math import sqrt
import matplotlib.pyplot as plt
from matplotlib import style
from collections import Counter
import warnings
import pandas as pd
import random
def k_nearest_neighbors(data, predict, k=3):
if len(data) >= k:
warnings.warn('K is set to a value less than total voting groups!')
distances = []
for group in data:
for features in data[group]:
euclidean_distance = np.linalg.norm(np.array(features) - np.array(predict))
distances.append([euclidean_distance, group])
votes = [i[1] for i in sorted(distances)[:k]]
vote_result = Counter(votes).most_common(1)[0][0]
return vote_result
df = pd.read_csv('iris.csv')
df.replace('?', -99999, inplace=True)
df['species'] = df['species'].astype('category')
df['species_value'] = df['species'].cat.codes
df.drop(['species'], 1, inplace=True)
full_data = df.astype(float).values.tolist()
random.shuffle(full_data)
test_size = 0.2
train_set = {0:[], 1:[], 2:[]}
test_set = {0:[], 1:[], 2:[]}
train_data = full_data[:-int(test_size*len(full_data))]
test_data = full_data[-int(test_size*len(full_data)):]
for i in train_data:
train_set[i[-1]].append(i[:-1])
for i in test_data:
test_set[i[-1]].append(i[:-1])
correct = 0
total = 0
for group in test_set:
for data in test_set[group]:
vote = k_nearest_neighbors(train_set, data, k=5)
if group == vote:
correct += 1
total += 1
print('Accuracy : ', (correct/total)*100,'%')
Hope this helps!

gradient descendent coust increass by each iteraction in linear regression with one feature

Hi I am learning some machine learning algorithms and for the sake of understanding I was trying to implement a linear regression algorithm with one feature using as cost function the Residual sum of squares for gradient descent method as bellow:
My pseudocode:
while not converge
w <- w - step*gradient
python code
Linear.py
import math
import numpy as num
def get_regression_predictions(input_feature, intercept, slope):
predicted_output = [intercept + xi*slope for xi in input_feature]
return(predicted_output)
def rss(input_feature, output, intercept,slope):
return sum( [ ( output.iloc[i] - (intercept + slope*input_feature.iloc[i]) )**2 for i in range(len(output))])
def train(input_feature,output,intercept,slope):
file = open("train.csv","w")
file.write("ID,intercept,slope,RSS\n")
i =0
while True:
print("RSS:",rss(input_feature, output, intercept,slope))
file.write(str(i)+","+str(intercept)+","+str(slope)+","+str(rss(input_feature, output, intercept,slope))+"\n")
i+=1
gradient = [derivative(input_feature, output, intercept,slope,n) for n in range(0,2) ]
step = 0.05
intercept -= step*gradient[0]
slope-= step*gradient[1]
return intercept,slope
def derivative(input_feature, output, intercept,slope,n):
if n==0:
return sum( [ -2*(output.iloc[i] - (intercept + slope*input_feature.iloc[i])) for i in range(0,len(output))] )
return sum( [ -2*(output.iloc[i] - (intercept + slope*input_feature.iloc[i]))*input_feature.iloc[i] for i in range(0,len(output))] )
With the main program:
import Linear as lin
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
df = pd.read_csv("test2.csv")
train = df
lin.train(train["X"],train["Y"], 0, 0)
The test2.csv:
X,Y
0,1
1,3
2,7
3,13
4,21
I resisted the value of rss on a file and noticed that the value of rss became worst at each iteration as follows:
ID,intercept,slope,RSS
0,0,0,669
1,4.5,14.0,3585.25
2,-7.25,-18.5,19714.3125
3,19.375,58.25,108855.953125
Mathematically I think it doesn't make any sense I review my own code many times I think it is correct, I am doing something else wrong?
If your cost isn't decreasing, that's usually a sign you're overshooting with your gradient descent approach, meaning too large of a step size.
A smaller step size can help. You can also look into methods for variable step sizes, which can change each iteration to get you nice convergence properties and speed; usually, these methods change the step size with some proportionality to the gradient. Of course, the specifics depend on each problem.

Apply power fit to data by using levenberg-marquardt algorithm in python

Hy everybody!
I am a beginer in python and data analysis, and meet with a problem, during fitting a power function to my data.
Here I plotted my dataset as a scatterplot
I want to plot a power function with expontent arround -1 , but after I apply the levenberg-marquardt method, using lmfit library in python, I get the following faulty image. I tried to modify the initial parameters, but it didn't help.
Here is my code:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from lmfit import minimize, Parameters, Parameter, report_fit
be = pd.read_table('...',
skipinitialspace=True,
names = ["CoM", "slope", "slope2"])
x=be["CoM"]
data=be["slope"]
def fcn2min(params, x, data):
n2 = params['n2'].value
n1 = params['n1'].value
model = n1 * x ** n2
return model - data #that's what you want to minimize
# create a set of Parameters
# 'value' is the initial condition
params = Parameters()
params.add('n2', value= -1.00)
params.add('n1',value= 23.0)
# do fit, here with leastsq model
result = minimize(fcn2min, params, args=(be["CoM"],be["slope"]))
#calculate final result
final = data + result.residual
resid = result.residual
# write error report
report_fit(result)
#plot results
xplot = x
yplot = result.params['n1'].value * x ** result.params['n2'].value
plt.figure(figsize=(15,6))
plt.ylabel('OD-slope',fontsize=18, color='blue')
plt.xlabel('CoM height_Sz [m]',fontsize=18, color='blue')
plt.plot(be["CoM"],be["slope"],"o", label="slope_flat")
plt.plot(be["CoM"],be["slope2"],"+",color='r', label="slope_curv")
plt.plot(xplot,yplot)
plt.legend()
plt.savefig('plot2')
plt.show()
I don't quite understand what is the problem with this, so if you have any observations, thank you very much.
It's a little hard to tell what the question is. t looks to me like the fit completed and gave a reasonably good fit, but you don't provide the fit statistics or report of the parameters.
If you're asking about all the green lines for the "COM" array (the best fit?), this is almost certainly because the starting x axis "height_Sz" data was not sorted to be strictly increasing. That's OK for the fit, but plotting an X-Y trace with a line expects the data to be in order.

Resources