Scaling in scikit-learn permutation_test_score - scikit-learn

I'm using the scikit-learn "permutation_test_score" method to evaluate my estimator performances significance. Unfortunately, I cannot understand from the scikit-learn documentation if the method implements any scaling on data. I use to standardise my data through a StandardScaler, to apply the training set standardisation to the testing set.

The function itself does not apply any scaling.
Here is an example from the documentation:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.svm import SVC
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import permutation_test_score
from sklearn import datasets
iris = datasets.load_iris()
X = iris.data
y = iris.target
n_classes = np.unique(y).size
# Some noisy data not correlated
random = np.random.RandomState(seed=0)
E = random.normal(size=(len(X), 2200))
# Add noisy data to the informative features for make the task harder
X = np.c_[X, E]
svm = SVC(kernel='linear')
cv = StratifiedKFold(2)
score, permutation_scores, pvalue = permutation_test_score(
svm, X, y, scoring="accuracy", cv=cv, n_permutations=100, n_jobs=1)
However, what you may want to do is to pass in the permutation_test_score a pipeline where you apply the scaling.
Example:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
pipe = Pipeline([('scaler', StandardScaler()), ('clf', SVC(kernel='linear'))])
score, permutation_scores, pvalue = permutation_test_score(
pipe, X, y, scoring="accuracy", cv=cv, n_permutations=100, n_jobs=1)

Related

xgboost feature importance high but doesn't produce a better model

I am using XGboost for a binary prediction problem. I tested my model with several features and had some good results.
After adding one feature to the model and calculating the feature importance. The importance of this feature showed to be very high and far superior to other features.
However, when testing the model the test score drops considerably.
Is there an explanation for this kind of behaviour ?
There are at least a few ways to run feature importance experiments.
# Let's load the packages
import numpy as np
import pandas as pd
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.inspection import permutation_importance
import shap
from matplotlib import pyplot as plt
plt.rcParams.update({'figure.figsize': (12.0, 8.0)})
plt.rcParams.update({'font.size': 14})
boston = load_boston()
X = pd.DataFrame(boston.data, columns=boston.feature_names)
y = boston.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=12)
rf = RandomForestRegressor(n_estimators=100)
rf.fit(X_train, y_train)
# 1
rf.feature_importances_
plt.barh(boston.feature_names, rf.feature_importances_)
sorted_idx = rf.feature_importances_.argsort()
plt.barh(boston.feature_names[sorted_idx], rf.feature_importances_[sorted_idx])
plt.xlabel("Random Forest Feature Importance")
# 2
perm_importance = permutation_importance(rf, X_test, y_test)
sorted_idx = perm_importance.importances_mean.argsort()
plt.barh(boston.feature_names[sorted_idx], perm_importance.importances_mean[sorted_idx])
plt.xlabel("Permutation Importance")
# 3
explainer = shap.TreeExplainer(rf)
shap_values = explainer.shap_values(X_test)
shap.summary_plot(shap_values, X_test, plot_type="bar")
Also, you can certainly add more data into your model. Models, almost without exception, produce more accurate results when they 'see' more data. Finally, you can always test other models on your dataset and see how they perform. Today at work I tested an XGboost model and a RandomForestRegressor model. I expected the former to perform better, but the latter actually performed much better. It's almost impossible to guess which model will perform better over any given dataset, you have to try multiple models, check the predictive capabilities of each, and pick the one (or maybe two) that performs the best. Having said that, you can try something like this.
import time
import numpy as np
import matplotlib.pyplot as plt
from sklearn import cluster, datasets
from sklearn.neighbors import kneighbors_graph
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
np.random.seed(0)
pd.set_option('display.max_columns', 500)
#df = pd.read_csv('C:\\your_path_here\\test.csv')
#print('done!')
#df = df[:10000]
#df = df.fillna(0)
#df = df.dropna()
X = df[['RatingScore',
'Par',
'Term',
'TimeToMaturity',
'LRMScore',
'Coupon',
'Price']]
#select your target variable
y = df[['Spread']]
#train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
colors = np.array([x for x in 'bgrcmykbgrcmykbgrcmykbgrcmyk'])
colors = np.hstack([colors] * 20)
clustering_names = [
'MiniBatchKMeans', 'AffinityPropagation', 'MeanShift',
'SpectralClustering', 'Ward', 'AgglomerativeClustering',
'DBSCAN', 'Birch']
plt.figure(figsize=(len(clustering_names) * 2 + 3, 9.5))
plt.subplots_adjust(left=.02, right=.98, bottom=.001, top=.96, wspace=.05,
hspace=.01)
plot_num = 1
blobs = datasets.make_blobs(n_samples=n_samples, random_state=8)
# normalize dataset for easier parameter selection
X = StandardScaler().fit_transform(X)
# estimate bandwidth for mean shift
bandwidth = cluster.estimate_bandwidth(X, quantile=0.3)
# connectivity matrix for structured Ward
connectivity = kneighbors_graph(X, n_neighbors=10, include_self=False)
# make connectivity symmetric
connectivity = 0.5 * (connectivity + connectivity.T)
# create clustering estimators
ms = cluster.MeanShift(bandwidth=bandwidth, bin_seeding=True)
two_means = cluster.MiniBatchKMeans(n_clusters=2)
ward = cluster.AgglomerativeClustering(n_clusters=2, linkage='ward',
connectivity=connectivity)
spectral = cluster.SpectralClustering(n_clusters=2,
eigen_solver='arpack',
affinity="nearest_neighbors")
dbscan = cluster.DBSCAN(eps=.2)
affinity_propagation = cluster.AffinityPropagation(damping=.9,
preference=-200)
average_linkage = cluster.AgglomerativeClustering(
linkage="average", affinity="cityblock", n_clusters=2,
connectivity=connectivity)
birch = cluster.Birch(n_clusters=2)
clustering_algorithms = [
two_means, affinity_propagation, ms, spectral, ward, average_linkage,
dbscan, birch]
for name, algorithm in zip(clustering_names, clustering_algorithms):
# predict cluster memberships
t0 = time.time()
algorithm.fit(X)
t1 = time.time()
if hasattr(algorithm, 'labels_'):
y_pred = algorithm.labels_.astype(np.int)
else:
y_pred = algorithm.predict(X)
# plot
plt.subplot(4, len(clustering_algorithms), plot_num)
if i_dataset == 0:
plt.title(name, size=18)
plt.scatter(X[:, 0], X[:, 1], color=colors[y_pred].tolist(), s=10)
if hasattr(algorithm, 'cluster_centers_'):
centers = algorithm.cluster_centers_
center_colors = colors[:len(centers)]
plt.scatter(centers[:, 0], centers[:, 1], s=100, c=center_colors)
plt.xlim(-2, 2)
plt.ylim(-2, 2)
plt.xticks(())
plt.yticks(())
plt.text(.99, .01, ('%.2fs' % (t1 - t0)).lstrip('0'),
transform=plt.gca().transAxes, size=15,
horizontalalignment='right')
plot_num += 1
plt.show()
Finally, consider looping through several regression, or classification, models in one go, and getting the results for each.
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)
from sklearn import linear_model
import statsmodels.api as sm
X = X
y = y
# Note the difference in argument order
model = sm.OLS(y, X).fit()
predictions = model.predict(X) # make the predictions by the model
# Print out the statistics
model.summary()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import SGDRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.linear_model import TweedieRegressor
from sklearn.linear_model import PoissonRegressor
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.svm import LinearSVR
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
regressors = [
LinearRegression(),
SGDRegressor(),
KNeighborsRegressor(),
DecisionTreeRegressor(),
RandomForestRegressor(),
GradientBoostingRegressor(),
TweedieRegressor(),
PoissonRegressor(),
Ridge(),
Lasso()
]
import pandas as pd
# Logging for Visual Comparison
log_cols=["Regressor", "RMSE", "MAE"]
log = pd.DataFrame(columns=log_cols)
for reg in regressors:
reg.fit(X_train, y_train)
name = reg.__class__.__name__
print(reg.score(X_test, y_test))
y_pred = reg.predict(X_test)
lr_mse = mean_squared_error(y_pred, y_test)
lr_rmse = np.sqrt(lr_mse)
print(name + ' RMSE: %.4f' % lr_rmse)
lin_mae = mean_absolute_error(y_pred, y_test)
print(name + ' MAE: %.4f' % lin_mae)
log_entry = pd.DataFrame([[name, lr_rmse, lin_mae]], columns=log_cols)
log = log.append(log_entry)
print("="*30)
import seaborn as sns
import matplotlib as plt
sns.set_color_codes("muted")
sns.barplot(x='RMSE', y='Regressor', data=log, color="b")
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_moons, make_circles, make_classification
from sklearn.neural_network import MLPClassifier
from sklearn.gaussian_process.kernels import RBF
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, log_loss
from sklearn.neighbors import KNeighborsClassifier
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.svm import SVC
from sklearn.datasets import load_iris
iris = load_iris()
iris
# Step 2: Separating the data into dependent and independent variables
X = iris.data[:, :2] # we only take the first two features.
y = iris.target
# Step 3: Splitting the dataset into the Training set and Test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
classifiers = [
GaussianNB(),
MLPClassifier(),
KNeighborsClassifier(),
GaussianProcessClassifier(),
DecisionTreeClassifier(),
RandomForestClassifier(),
AdaBoostClassifier(),
GradientBoostingClassifier(),
QuadraticDiscriminantAnalysis()]
import pandas as pd
# Logging for Visual Comparison
log_cols=["Classifier", "Accuracy"]
log = pd.DataFrame(columns=log_cols)
for clf in classifiers:
clf.fit(X_train, y_train)
name = clf.__class__.__name__
print("="*30)
print(name)
print('****Results****')
train_predictions = clf.predict(X_test)
acc = accuracy_score(y_test, train_predictions)
print("Accuracy: {:.4%}".format(acc))
log_entry = pd.DataFrame([[name, acc*100]], columns=log_cols)
log = log.append(log_entry)
print("="*30)
import seaborn as sns
import matplotlib as plt
sns.set_color_codes("muted")
sns.barplot(x='Accuracy', y='Classifier', data=log, color="b")

Why do `scoring="neg_log_loss"` and `scoring=make_scorer(log_loss)` give such different validation scores?

Sklearn's documentation seems to imply that the neg_log_loss scoring uses log_loss as the scorer. This question tries to clarify what is happening under the hood and the accepted answer says neg_log_loss is simply equal to - log_loss. However, the attached example shows that this is not the case.
What is the relationship between scoring = "neg_log_loss" and scoring=make_scorer(log_loss)? The apparent discontinuities make me think that neg_log_loss is using probabilities rather than predictions in the loss. How can I alter my code below so that each method returns the same results?
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, log_loss, make_scorer, get_scorer
from sklearn.model_selection import GridSearchCV
from sklearn.datasets import make_classification
from sklearn.model_selection import KFold
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
X,y = make_classification(random_state = 0)
cv = KFold(10)
nll = lambda y, ypred: -1*log_loss(y, ypred)
param_grid = {'C':1/np.logspace(-5,2, base = np.exp(1))}
model = LogisticRegression(penalty = 'l1', solver = 'liblinear', max_iter=10_000)
gscv_scoring = GridSearchCV(model, param_grid=param_grid, cv = cv, scoring = 'neg_log_loss').fit(X,y)
gscv_make_scoring = GridSearchCV(model, param_grid=param_grid, cv = cv, scoring = make_scorer(nll)).fit(X,y)
fig, ax = plt.subplots(dpi = 120)
r1 = pd.DataFrame(gscv_scoring.cv_results_)
r2 = pd.DataFrame(gscv_make_scoring.cv_results_)
plt.plot(r1.param_C, r1.mean_test_score)
plt.plot(r2.param_C, r2.mean_test_score)
If you use:
cross_val_score(model, X_train, y_train, scoring='neg_log_loss', cv=2)
You will get back the negative probability. If you passed in the metrics.log_loss directly via make_scorer, you will need to set greater_is_better=False, needs_proba=True
To get the equivalent by using metrics.log_loss directly, you need to pass it the outcome of predict_proba from the model, not the predicted labels.
y_probs = model.predict_proba(X_train)
log_loss(y_train, y_probs)
This will give you back the positive probability. Then, as you mentioned, they only differ by a sign.

Support vector regression

After executing this code, y_pred is way too high
I have tried my code
import numpy as py
import matplotlib.pyplot as plt
import pandas as pd
dataset = pd.read_csv('Position_Salaries.csv')
X = dataset.iloc[:,1:2].values
y= dataset.iloc[:, 2].values
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
sc_y = StandardScaler()
X = sc_X.fit_transform(X)
y= sc_y.fit_transform(y.reshape(-1,1))
# Fitting SVR to the dataset
from sklearn.svm import SVR
regressor = SVR(kernel = 'rbf')
regressor.fit(X, y)
# Predicting a new result
y_pred=regressor.predict([[6.5]])
y_pred = sc_y.inverse_transform(y_pred)
Why is the value of y_pred so high? is there some mistake in my code
I found the solution:
Instead of line 31 and 32, I need to use
y_pred = sc_y.inverse_transform(regressor.predict(sc_X.transform(np.array([[6.5]))))

Pipe-lining Standardscaler, Recursive feature selection, and Classifier

I have a given dataset, X and Y.
I want to implement the following steps using pipeline:
- Standardscaler
- Recursive feature selection
- RandomForestClassifier
- cross-validation predict
I implemented as follows:
import numpy as np
from sklearn.feature_selection import RFE, RFECV
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_predict, KFold
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.datasets import load_iris
data = load_iris()
X = data.data
Y = data.target
print X.shape
print Y.shape
clf = RandomForestClassifier(n_estimators=50,max_features=None,n_jobs=-1,random_state=0)
kf = KFold(n_splits=2, shuffle=True, random_state=0)
pipeline = Pipeline([('standardscaler', StandardScaler()),
('rfecv', RFECV(estimator=clf, step=1, cv=kf, scoring='accuracy', n_jobs=7)),
('clf', clf)])
pipeline.fit(X,Y)
ypredict = cross_val_predict(pipeline, X, Y, cv=kf)
accuracy = accuracy_score(Y, ypredict)
print (accuracy)
Please look into my implementation deeply, and let me know where is wrong with my code. Thank you.
This works. The final estimator in the pipeline only needs to implement fit which REFCV does. Here's the code:
from sklearn.feature_selection import RFECV
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_predict, KFold
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.datasets import load_iris
data = load_iris()
X = data.data
Y = data.target
clf = RandomForestClassifier()
# create pipeline
estimators = [('standardize' , StandardScaler()),
('rfecv', RFECV(estimator=clf, scoring='accuracy'))]
# build the pipeline
pipeline = Pipeline(estimators)
# run the pipeline
kf = KFold(n_splits=2, shuffle=True, random_state=0)
ypredict = cross_val_predict(pipeline, X, Y, cv=kf)
accuracy = accuracy_score(Y, ypredict)
print (accuracy)
'Output':
0.96

Standardize Regressors in sklearn

I'm working with sklearn and I'm wondering how StandardScaler() is used appropriately. I build a function that allows to switch between Ridge and Lasso regression as well as takes the alpha value, the regressors X and the predicted variable Y. All regressors should be standardized.
from sklearn.linear_model import Ridge, Lasso
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler() # Standardize regressors by removing the mean and scaling to unit variance
def do_penalized_regression(X, y, penalty, type):
if type == "ridge":
lm = Ridge(alpha = penalty, normalize=False)
elif type == "lasso":
lm = Lasso(alpha = penalty, normalize=False)
lm.scaler.fit(X,y)
return lm
Is this the way to go or should I standardize the regressors in advance?
you can use sklearn.pipeline.make_pipeline:
from sklearn.linear_model import Ridge, Lasso
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
model = make_pipeline(StandardScaler(), lm)
model.fit(X, y)
...

Resources