Perform Custom GLM using sklearn/Scikit-Learn - python-3.x

I was looking to implement custom GLM using sklearn/Scikit-learn. The same is possible with statsmodel for example using statsmodel we could use below code
import pandas as pd
import statsmodels.api as sm
data = [(300,1),(200,0),(170,1),(420,1),(240,1),(133,0),(323,1),(150,0),(230,0),(499,0)]
Labels = ['datapoint','value']
df = pd.DataFrame.from_records(data, columns=Labels)
glm_linear = sm.GLM(df.value, df.datapoint, family=sm.families.Gaussian(sm.families.links.identity()))
res = glm_linear.fit()
print(res.summary())
Here as we see we can pass any link and random function using family attribute in sm.GLM method.
I was looking for something similar in sklearn

You can use sklearn TweedieRegressor with parameter power=0 to specify the normal distribution:
from sklearn.linear_model import TweedieRegressor
import pandas as pd
data = [(300,1), (200,0), (170,1), (420,1), (240,1), (133,0), (323,1), (150,0), (230,0), (499,0)]
Labels = ['datapoint','value']
df = pd.DataFrame.from_records(data, columns=Labels)
X, y = df.datapoint, df.value
glm_gaussian = TweedieRegressor(power=0, fit_intercept=False)
glm_gaussian.fit(X.to_numpy()[:, None], y)
print(glm_gaussian.coef_)
array([0.00173114])

Related

Drop the features that have less correlation with respect to target variable

I have loaded a dataset and tried to find the correlation coefficient with respect to target variable.
Below are the codes:
from sklearn.datasets import load_boston
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
#Loading the dataset
x = load_boston()
df = pd.DataFrame(x.data, columns = x.feature_names)
df["MEDV"] = x.target
X = df.drop("MEDV",1) #Feature Matrix
y = df["MEDV"] #Target Variable
df.head()
#Using Pearson Correlation
plt.figure(figsize=(12,10))
cor = df.corr()
sns.heatmap(cor, annot=True, cmap=plt.cm.Reds)
plt.show()
#Correlation with output variable
cor_target = abs(cor["MEDV"])
#Selecting highly correlated features
relevant_features = cor_target[cor_target>0.4]
print(relevant_features)
How do I drop the features that have correlation coefficient < 0.4?
Try this:
#Selecting least correlated features
irelevant_features = cor_target[cor_target<0.4]
# list of irelevant_features
cols = list([i for i in irelevant_features.index])
#Dropping irelevant_features
df = df.drop(cols, axis=1)
relevant_features = cor_target[cor_target < 0.4]
print(relevant_features)
X = df.drop(['MEDV','CRIM', 'ZN', 'CHAS','AGE', 'DIS','RAD', 'B'], 1)
use: for i in irelevant_features(As written above)

Python - Need help in solving "Load the R data set mtcars as a pandas dataframe." problem

I am working on this problem and unsure on how to proceed.
Load the R data set mtcars as a pandas dataframe.
Build a linear regression model by considering the log of independent variable wt, and log of dependent variable mpg.
Fit the model with data.
Perform ANOVA on the linear model obtained in the previous step.(Hint:Use anova.anova_lm)
Display the F-statistic value.
I see in another post below solution was provided. But it doesn't to seem work.
import statsmodels.api as sm
import numpy as np
mtcars = sm.datasets.get_rdataset('mtcars')
mtcars_data = mtcars.data
liner_model = sm.formula.ols('np.log(wt) ~ np.log(mpg)',mtcars_data)
liner_result = liner_model.fit()
print(liner_result.rsquared)'''
fixed it
import statsmodels.api as sm
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
from statsmodels.stats import anova
mtcars = sm.datasets.get_rdataset("mtcars", "datasets", cache=True).data
df = pd.DataFrame(mtcars)
model = smf.ols(formula='np.log(mpg) ~ np.log(wt)', data=mtcars).fit()
print(anova.anova_lm(model))
print(anova.anova_lm(model).F["np.log(wt)"])

How do i fix "If using all scalar values, you must pass an index" error?

I am manually trying to build a linear regression model for understanding purpose without using the builtin function. I am getting the error while plotting the regression line. Kindly help me fix it.
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sb
data = {'X': list(np.arange(0,10,1)), 'Y': [1,3,2,5,7,8,8,9,10,12]}
df = pd.DataFrame(data)
df2 = pd.DataFrame(np.ones(10), columns = ['ones'])
df_new = pd.concat([df2,df], axis = 1)
X = df_new.loc[:, ['ones', 'X']].values
Y = df_new['Y'].values.reshape(-1,1)
theta = np.array([0.5, 0.2]).reshape(-1,1)
Y_pred = X.dot(theta)
sb.lineplot(df['X'].values.reshape(-1,1),Y_pred)
plt.show()
Error message:
If using all scalar values, you must pass an index
You are passing a 2d array, while seaborn's lineplot expects a 1d array (or a pandas column which is basically same). So change it to
sb.lineplot(df['X'].values,Y_pred.reshape(-1))

Unable to use "from sklearn.preprocessing import Imputer" , it shows the exception " Data must be 1-dimensional"

I have made a model for the artificial neural network(ANN). I want to preprocess the data before train the model.
I have tried the code given below.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
dataset = pd.read_csv('Update-Detaset with hacking1.csv')
y=[]
X = dataset.iloc[:,2:7]
y = dataset.iloc[:,8]
from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values = 'NaN', strategy = 'mean', axis = 0)
Y = np.reshape(y,(-1,1))
imputer = imputer.fit(Y)
Y= imputer.transform(Y)
Exception: Data must be 1-dimensional
Here, Update-Detaset with hacking1.csv is the .csv file. The dataset is lookig like:
Please click the link to see the demo of the csv file
It shows the following errors.
How can I solve this?
This has nothing to do with Imputer. You should have been able to tell this from the line number that threw the Exception. The error is from you trying to reshape a pandas DataFrame. Change
y = dataset.iloc[:,8]
to
y = dataset.iloc[:,8].values
and it should work.

How to loop through items in pandas col and run and plot a scikit model?

I got some interesting user data from races. I know when the respecitve athletes planed to finish a race and I know when they actaully finished (next to some more stuff). The goal is to find out when the athletes come in late. I want to run a support vector machine for each athlete and plot the decision boundaries.
Here is what I do:
import numpy as np
import pandas as pd
from sklearn import svm
from mlxtend.plotting import plot_decision_regions
import matplotlib.pyplot as plt
# Create arbitrary dataset for example
df = pd.DataFrame({'User': np.random.random_integers(low=1, high=4, size=50),
'Planned_End': np.random.uniform(low=-5, high=5, size=50),
'Actual_End': np.random.uniform(low=-1, high=1, size=50),
'Late': np.random.random_integers(low=0, high=2, size=50)}
)
# Fit Support Vector Machine Classifier
X = df[['Planned_End', 'Actual_End']]
y = df['Late']
clf = svm.SVC(decision_function_shape='ovo')
for i, y in df['User']:
clf.fit(X, y)
ax = plt.subplot()
fig = plot_decision_regions(X=X, y=y, clf=clf, legend=2)
plt.title(lab)
plt.show()
I get the following error: TypeError: 'numpy.int64' object is not iterable - that is, I somehow can't loop through the column.
I guess it comes down to the numpy data format? How can I solve that?
try iteritems()
for i, y in df['User'].iteritems():
Your User Series contains numpy.int64 objects so you can only use:
for y in df['User']:
And you don't use i anywhere.
As for the rest of the code, this produces some solution, please edit accordingly:
import numpy as np
import pandas as pd
from sklearn import svm
from mlxtend.plotting import plot_decision_regions
import matplotlib.pyplot as plt
# Create arbitrary dataset for example
df = pd.DataFrame({'User': np.random.random_integers(low=1, high=4, size=50),
'Planned_End': np.random.uniform(low=-5, high=5, size=50),
'Actual_End': np.random.uniform(low=-1, high=1, size=50),
'Late': np.random.random_integers(low=0, high=2, size=50)}
)
# Fit Support Vector Machine Classifier
X = df[['Planned_End', 'Actual_End']].as_matrix()
y = df['Late']
clf = svm.SVC(decision_function_shape='ovo')
y = df['User'].values
clf.fit(X, y)
ax = plt.subplot()
fig = plot_decision_regions(X=X, y=y, clf=clf, legend=2)
plt.title('lab')
plt.show()

Resources