Standardize Regressors in sklearn - python-3.x

I'm working with sklearn and I'm wondering how StandardScaler() is used appropriately. I build a function that allows to switch between Ridge and Lasso regression as well as takes the alpha value, the regressors X and the predicted variable Y. All regressors should be standardized.
from sklearn.linear_model import Ridge, Lasso
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler() # Standardize regressors by removing the mean and scaling to unit variance
def do_penalized_regression(X, y, penalty, type):
if type == "ridge":
lm = Ridge(alpha = penalty, normalize=False)
elif type == "lasso":
lm = Lasso(alpha = penalty, normalize=False)
lm.scaler.fit(X,y)
return lm
Is this the way to go or should I standardize the regressors in advance?

you can use sklearn.pipeline.make_pipeline:
from sklearn.linear_model import Ridge, Lasso
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
model = make_pipeline(StandardScaler(), lm)
model.fit(X, y)
...

Related

how to return the features that used in decision tree that created by DecisionTreeClassifier in sklearn

i want to do feature selection on my data set by CART and C4.5 decision tree. In such a way that apply decision tree on data set and then extract the features that decision tree algorithm use to create the tree. so i need return the features that use in the created tree. i use "DecisionTreeClassifier" in sklearn.tree module. i need a method or function to give me (return) the features that used in created tree!! to use this features as more important features in main modulation algorithm.
You can approach the problem similar to the below:
I assume you have the train (x_train, y_train) and test (x_test, y_test) sets.
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
tree_clf1 = DecisionTreeClassifier().fit(x_train, y_train)
y_pred = tree_clf1.predict(x_test)
print(confusion_matrix(y_test, y_pred))
print("\n\nAccuracy:{:,.2f}%".format(accuracy_score(y_test, y_pred)*100))
print("Precision:{:,.2f}%".format(precision_score(y_test, y_pred)*100))
print("Recall:{:,.2f}%".format(recall_score(y_test, y_pred)*100))
print("F1-Score:{:,.2f}%".format(f1_score(y_test, y_pred)*100))
feature_importances = DataFrame(tree_clf1.feature_importances_,
index = x_train.columns,
columns['importance']).sort_values('importance',
ascending=False)
print(feature_importances)
Below is an example output, shows which features are important for your classification.
Below is exammple for getting top 5 important features in decesion tree.
from sklearn.tree import DecisionTreeClassifier
decesionTreeModel = DecisionTreeClassifier(random_state=0)
decesionTreeModel.fit(X_train2, y_train2)
most_features_frame = pd.DataFrame(
data=decesionTreeModel.feature_importances_,
columns=["importance"],
index=X_train2.columns,
).sort_values(by=["importance"], ascending=False)
# print(most_features_frame)
top_5_feature = most_features_frame.index[:5]
top_5_feature_list = [i for i in top_5_feature]
print(top_5_feature_list)

Why is my sklearn linear regression model producing perfect predictions?

I'm trying to do multiple linear regression with sklearn and I have performed the following steps. However, when it comes to predicting y_pred using the trained model I am getting a perfect r^2 = 1.0. Does anyone know why this is the case/what's going wrong with my code?
Also sorry I'm new to this site so I'm not fully up to speed with the formatting/etiquette of questions!
import numpy as np
import pandas as pd
# Import and subset data
ml_data_all = pd.read_excel('C:/Users/User/Documents/RSEM/STADM/Coursework/Crime_SF/Machine_learning_collated_data.xlsx')
ml_data_1218 = ml_data_all[ml_data_all['Year'] >= 2012]
ml_data_1218.drop(columns=['Pop_MOE',
'Pop_density_MOE',
'Age_median_MOE',
'Sex_ratio_MOE',
'Income_median_household_MOE',
'Pop_total_pov_status_determ_MOE',
'Pop_total_50percent_pov_MOE',
'Pop_total_125percent_pov_MOE',
'Poverty_percent_below_MOE',
'Total_labourforceMOE',
'Unemployed_total_MOE',
'Unemployed_total_male_MOE'], inplace=True)
# Taking care of missing data
# Delete rows containing any NaNs
ml_data_1218.dropna(axis=0,
how='any',
inplace=True)
# DATA PREPROCESSING
# Defining X and y
X = ml_data_1218.drop(columns=['Year']).values
y = ml_data_1218['Burglaries '].values
# Encoding categorical data
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
transformer = ColumnTransformer(transformers=[("cat", OneHotEncoder(), [0])], remainder='passthrough')
X = transformer.fit_transform(X)
X.toarray()
X = pd.DataFrame.sparse.from_spmatrix(X)
# Split into Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
# Feature scaling
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train.iloc[:,149:] = sc_X.fit_transform(X_train.iloc[:,149:])
X_test.iloc[:,149:] = sc_X.transform(X_test.iloc[:,149:])
# Fitting multiple linear regression to training set
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)
# Predicting test set results
y_pred = regressor.predict(X_test)
from sklearn.metrics import r2_score
r2_score(y_test, y_pred)
So turns out it was a stupid mistake in the end: I forgot to drop the dependent variable (Burglaries) from the X columns haha, hence why the linear regression model was making perfect predictions. Now it's working (r2 = 0.56). Thanks everyone!
With regression, it's often a good idea to run a correlation matrix against all of your variables (IVs and the DV). Regression likes parsimony, so removing IVs that are functionally the same (and just leaving one in the model) is better for R^2 value (aka model fit). Also, if something is correlated at .97 or higher with the DV, it is basically a substitute for the DV and all the other data is most likely superfluous.
When reading your issue (before I saw your "Answer") I was thinking "either this person has outrageous correlation issues or the DV is also in the prediction data."

Scaling in scikit-learn permutation_test_score

I'm using the scikit-learn "permutation_test_score" method to evaluate my estimator performances significance. Unfortunately, I cannot understand from the scikit-learn documentation if the method implements any scaling on data. I use to standardise my data through a StandardScaler, to apply the training set standardisation to the testing set.
The function itself does not apply any scaling.
Here is an example from the documentation:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.svm import SVC
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import permutation_test_score
from sklearn import datasets
iris = datasets.load_iris()
X = iris.data
y = iris.target
n_classes = np.unique(y).size
# Some noisy data not correlated
random = np.random.RandomState(seed=0)
E = random.normal(size=(len(X), 2200))
# Add noisy data to the informative features for make the task harder
X = np.c_[X, E]
svm = SVC(kernel='linear')
cv = StratifiedKFold(2)
score, permutation_scores, pvalue = permutation_test_score(
svm, X, y, scoring="accuracy", cv=cv, n_permutations=100, n_jobs=1)
However, what you may want to do is to pass in the permutation_test_score a pipeline where you apply the scaling.
Example:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
pipe = Pipeline([('scaler', StandardScaler()), ('clf', SVC(kernel='linear'))])
score, permutation_scores, pvalue = permutation_test_score(
pipe, X, y, scoring="accuracy", cv=cv, n_permutations=100, n_jobs=1)

Sklearn elastic-net logistic regression (SGDClassifier) does not return a probability

In sklearn, when using SGDCLassifier for elastic-net logistic regression, the predict_proba function returns the same thing as the predict function.
AKA the code below (with X and y the predictors and binary label respectively) returns True:
EN = sklearn.linear_model.SGDClassifier(loss='log', penalty='elasticnet',
alpha=0.0001, l1_ratio=0.15)
EN.fit(X[train], y[train])
numpy.all(EN.predict(X[test]) == EN.predict_proba(X[test])[:,1])
How to obtain probability values?
It seems that the sklearn version is the problem. You need to upgrade to 0.18.2.
Example using iris data:
from sklearn.datasets import load_iris
from sklearn.linear import model.SGDClassifier
from sklearn.model_selection import train_test_split
import numpy
import sklearn
data = load_iris()
x = data.data
y = data.target
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, random_state = 42)
EN = SGDClassifier(loss='log', penalty='elasticnet', alpha=0.0001, l1_ratio=0.15)
EN.fit(X_train, y_train)
numpy.all(EN.predict(X_test) == EN.predict_proba(X_test)[:,1])
sklearn.__version__
Result
False
'0.18.2'
So with sklearn 0.18.2 it works fine.

trying a custom computation of grid.best_score_ (obtained with GridSearchCV)

I'm trying to recompute grid.best_score_ I obtained on my own data without success...
So I tried it using a conventional dataset but no more success. Here is the code :
from sklearn import datasets
from sklearn import linear_model
from sklearn.cross_validation import ShuffleSplit
from sklearn import grid_search
from sklearn.metrics import r2_score
import numpy as np
lr = linear_model.LinearRegression()
boston = datasets.load_boston()
target = boston.target
param_grid = {'fit_intercept':[False]}
cv = ShuffleSplit(target.size, n_iter=5, test_size=0.30, random_state=0)
grid = grid_search.GridSearchCV(lr, param_grid, cv=cv)
grid.fit(boston.data, target)
# got cv score computed by gridSearchCV :
print grid.best_score_
0.677708680059
# now try a custom computation of cv score
cv_scores = []
for (train, test) in cv:
y_true = target[test]
y_pred = grid.best_estimator_.predict(boston.data[test,:])
cv_scores.append(r2_score(y_true, y_pred))
print np.mean(cv_scores)
0.703865991851
I can't see why it's different, GridSearchCV is supposed to use scorer from LinearRegression, which is r2 score. Maybe the way I code cv score is not the one used to compute best_score_... I'm asking here before going through GridSearchCV code.
Unless refit=False in the GridSearchCV constructor, the winning estimator is refit on the entire dataset at the end of fit. best_score_ is the estimator's average score using the cross-validation splits, while best_estimator_ is an estimator of the winning configuration fit on all the data.
lr2 = linear_model.LinearRegression(fit_intercept=False)
scores2 = [lr2.fit(boston.data[train,:], target[train]).score(boston.data[test,:], target[test])
for train, test in cv]
print np.mean(scores2)
Will print 0.67770868005943297.

Resources