Combining Recursive Feature Elimination and Grid Search in scikit-learn - scikit-learn

I am trying to combine recursive feature elimination and grid search in scikit-learn. As you can see from the code below (which works), I am able to get the best estimator from a grid search and then pass that estimator to RFECV. However, I would rather do the RFECV first, then the grid search. The problem is that when I pass the selector ​from RFECV to the grid search, it does not take it:
ValueError: Invalid parameter bootstrap for estimator RFECV
Is it possible to get the selector from RFECV and pass it directly to RandomizedSearchCV, or is this procedurally not the right thing to do?
from sklearn.datasets import make_classification
from sklearn.feature_selection import RFECV
from sklearn.grid_search import GridSearchCV, RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
from scipy.stats import randint as sp_randint
# Build a classification task using 3 informative features
X, y = make_classification(n_samples=1000, n_features=25, n_informative=5, n_redundant=2, n_repeated=0, n_classes=8, n_clusters_per_class=1, random_state=0)
grid = {"max_depth": [3, None],
"min_samples_split": sp_randint(1, 11),
"min_samples_leaf": sp_randint(1, 11),
"bootstrap": [True, False],
"criterion": ["gini", "entropy"]}
estimator = RandomForestClassifierCoef()
clf = RandomizedSearchCV(estimator, param_distributions=grid, cv=7)
clf.fit(X, y)
estimator = clf.best_estimator_
selector = RFECV(estimator, step=1, cv=4)
selector.fit(X, y)
selector.grid_scores_

The best way to do this would be to nest the RFECV inside the random search, using the method from this SO answer.
Some example code, based on the question code and the SO answer mentioned above:
from sklearn.datasets import make_classification
from sklearn.feature_selection import RFECV
from sklearn.grid_search import GridSearchCV, RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
from scipy.stats import randint as sp_randint
# Build a classification task using 5 informative features
X, y = make_classification(n_samples=1000, n_features=25, n_informative=5, n_redundant=2, n_repeated=0, n_classes=8, n_clusters_per_class=1, random_state=0)
grid = {"estimator__max_depth": [3, None],
"estimator__min_samples_split": sp_randint(1, 11),
"estimator__min_samples_leaf": sp_randint(1, 11),
"estimator__bootstrap": [True, False],
"estimator__criterion": ["gini", "entropy"]}
estimator = RandomForestClassifier()
selector = RFECV(estimator, step=1, cv=4)
clf = RandomizedSearchCV(selector, param_distributions=grid, cv=7)
clf.fit(X, y)
print(clf.grid_scores_)
print(clf.best_estimator_.n_features_)

Related

Error in Grid search CV - RidgeClassifierCV as the constructor either does not set or modifies parameter alphas

I am performing gridsearchcv on ridgeclassifiercv to obtain hyper-parameters for my model.
So i imported the libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn import linear_model
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
import warnings
warnings.filterwarnings('ignore')
np.random.seed(27)
Then i imported the dataset and split, scaled and label encoded the target variable
!wget -O ChurnData.csv https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/ChurnData.csv
churn = pd.read_csv("ChurnData.csv")
X = churn.drop(['churn'], axis='columns')
y1 = churn[['churn']]
y1['churn']=y1['churn'].astype('int')
scaler=StandardScaler()
X_scaled=scaler.fit_transform(X)
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(churn['churn'].unique())
y = le.transform(y1)
# split
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size = 0.2)
Then i performed gridsearchcv
alphas = [(0.1, 1, 2, 5 , 10)]
solver_churn = ['auto', 'svd','cholesky', 'lsqr', 'sparse_cg', 'sag', 'saga']
fit_intercept = [True, False]
class_weight = [{0:0.5,1:0.5},{0:0.6,1:0.4}]
param_grid_churn = dict(alphas=alphas, fit_intercept=fit_intercept,class_weight=class_weight)
ridgecv = linear_model.RidgeClassifierCV()
grids_churn = GridSearchCV(estimator=ridgecv, param_grid=param_grid_churn, scoring='roc_auc', verbose=1, n_jobs=-1)
grid_result_churn = grids_churn.fit(X_train, y_train)
alphas is given in docs as a parameter still i get
Error in Grid search CV - RidgeClassifierCV as the constructor either does not set or modifies parameter alphas
How to resolve this?
Adjust your code like this:
alphas = (0.1, 1, 2, 5 , 10)
solver_churn = ['auto', 'svd','cholesky', 'lsqr', 'sparse_cg', 'sag', 'saga']
fit_intercept = [True, False]
class_weight = [{0:0.5,1:0.5},{0:0.6,1:0.4}]
param_grid_churn = dict(fit_intercept=fit_intercept,class_weight=class_weight)
ridgecv = linear_model.RidgeClassifierCV(alphas=alphas)
grids_churn = GridSearchCV(estimator=ridgecv, param_grid=param_grid_churn, scoring='roc_auc', verbose=1, n_jobs=-1)
grid_result_churn = grids_churn.fit(X_train, y_train)

SVR hyperparameter selection and visualisation

I am just a beginner in data analysis. I want to use 'Cross-validation Grid Search method" to determine the parameters gamma and C of the Radial Basis Function (RBF) kernel SVM. I don't know where I should put my data on this code, and what data type I should use (training or target data)?
For SVR
import numpy as np
import pandas as pd
from math import sqrt
from sklearn.tree import DecisionTreeRegressor
import matplotlib.pyplot as plt
from sklearn.ensemble import AdaBoostRegressor
from sklearn.metrics import mean_squared_error,explained_variance_score
from TwoStageTrAdaBoostR2 import TwoStageTrAdaBoostR2 # import the two-stage algorithm
from sklearn import preprocessing
from sklearn import svm
from sklearn.svm import SVR
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report
from matplotlib.colors import Normalize
from sklearn.svm import SVC
# Data import (source)
source= pd.read_csv(sourcedata)
# Data import (target)
data= pd.read_csv(targetdata)
# Sample Size
datatrain = data.sample(n=60, random_state=1)
datatest = data[~dataL.index.isin(data.index)]
# Merge training set data (source and target)
train = pd.concat([source, datatrain], sort=False)
train.reset_index(inplace=True, drop=True)
datatest.reset_index(inplace=True, drop=True)
# Variable input
X_train, y_train = train[['x1', 'x2']].values, train['y'].values
X_test, y_test = FL[['x1', 'x2']].values, FL['y'].values
# Parameter setting
#sample_size = [n_source1+n_source2+n_source3+n_source4+n_source5, n_target_train]
n_estimators = 100
steps = 8
fold = 5
random_state = np.random.RandomState(1)
sample_size = [350, 60]
#1 twostage tradaboost.r2
regr_1 = TwoStageTrAdaBoostR2(SVR(C=50, gamma='auto'),
n_estimators = n_estimators, sample_size = sample_size,
steps = steps, fold = fold,
random_state = random_state)
regr_1.fit(X_train, y_train)
y_pred1 = regr_1.predict(X_test)
print("MSE of regular two stage trAdaboostR2--model1:",sqrt(mean_squared_error(y_test, y_pred1)))
#Plot the results
plt.figure()
plt.scatter(y_test, y_test-y_pred1, c="black", label="TwoStageTrAdaBoostR2_model1", s=10)
plt.xlabel("CAR")
plt.ylabel("Err")
plt.title("Two-stage Transfer Learning Boosted Decision Tree Regression", loc='left', fontsize=12, fontweight=0, color="orange")
plt.legend()
plt.show()
for cross-validation grid search methods(best parameters):
# Cross validation grid search (best parameters)
parameter_candidates = [
{'C': [1, 10, 100, 1000], 'gamma': [0.001, 0.0001], 'kernel': ['linear']},
{'C': [1, 10, 100, 1000], 'gamma': [0.001, 0.0001], 'kernel': ['rbf']},
]
svr = svm.SVC()
clf = grid_search.GridSearchCV(svr, parameters, c=5 ,n_jobs=-1)
clf.fit(X_train, y_train)
print('Best score for data:', clf.best_score_)
print('Best C:',clf.best_estimator_.C)
print('Best Kernel:',clf.best_estimator_.kernel)
print('Best Gamma:',clf.best_estimator_.gamma)
For visualization of parameter effects
c_range = np.logspace(-2, 2, 4)
gamma_range = np.logspace(-2, 2, 5)
tuned_parameters = [{'kernel': ['rbf'],'C': c_range,'gamma':gamma_range},
{'kernel': ['linear'], 'C': c_range,'gamma':gamma_range}]
svr = svm.SVR()
clf = GridSearchCV(svr,param_grid=tuned_parameters,verbose=2,n_jobs=-1,
scoring='explained_variance')
clf.fit(X_train, y_train)
print('Best score for data:', clf.best_score_)
print('Best C:',clf.best_estimator_.C)
print('Best Kernel:',clf.best_estimator_.kernel)
print('Best Gamma:',clf.best_estimator_.gamma)
# scores for rbf kernel
n = len(gamma_range)*len(c_range)
scores_rbf = clf.cv_results_['mean_test_score'][:n].reshape(len(gamma_range),
len(c_range))
# scores for rbf kernel
scores_linear = clf.cv_results_['mean_test_score'][n:].reshape(len(gamma_range),
len(c_range))
class MidpointNormalize(Normalize):
def __init__(self, vmin=None, vmax=None, midpoint=None, clip=False):
self.midpoint = midpoint
Normalize.__init__(self, vmin, vmax, clip)
def __call__(self, value, clip=None):
x, y = [self.vmin, self.midpoint, self.vmax], [0, 0.5, 1]
return np.ma.masked_array(np.interp(value, x, y))
plt.figure(figsize=(8, 6))
plt.subplots_adjust(left=.2, right=0.95, bottom=0.15, top=0.95)
plt.imshow(scores_rbf, interpolation='nearest', cmap=plt.cm.hot,
norm=MidpointNormalize(vmin=0.2, midpoint=0.92))
plt.xlabel('gamma')
plt.ylabel('C')
plt.colorbar()
plt.xticks(np.arange(len(gamma_range)), gamma_range, rotation=45)
plt.yticks(np.arange(len(c_range)), c_range)
plt.title('Validation accuracy')
plt.show()
When I used this code, I found the following output Heatmap plot!
But I am trying to get a Heatmap like this one
The following code with some typical regression data should work all the way through:
import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm, datasets
from sklearn.model_selection import GridSearchCV,train_test_split
from matplotlib.colors import Normalize
class MidpointNormalize(Normalize):
def __init__(self, vmin=None, vmax=None, midpoint=None, clip=False):
self.midpoint = midpoint
Normalize.__init__(self, vmin, vmax, clip)
def __call__(self, value, clip=None):
x, y = [self.vmin, self.midpoint, self.vmax], [0, 0.5, 1]
return np.ma.masked_array(np.interp(value, x, y))
X, y = datasets.load_boston(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X,y)
# Cross validation grid search (best parameters)
c_range = np.logspace(-0, 4, 8)
gamma_range = np.logspace(-4, 0, 8)
tuned_parameters = [{'kernel': ['rbf'],'C': c_range,'gamma':gamma_range},
{'kernel': ['linear'], 'C': c_range,'gamma':gamma_range}]
svr = svm.SVR()
clf = GridSearchCV(svr,param_grid=tuned_parameters,verbose=20,n_jobs=-4,cv=4,
scoring='explained_variance')
clf.fit(X_train, y_train)
print('Best score for data:', clf.best_score_)
print('Best C:',clf.best_estimator_.C)
print('Best Kernel:',clf.best_estimator_.kernel)
print('Best Gamma:',clf.best_estimator_.gamma)
# scores for rbf kernel
n = len(gamma_range)*len(c_range)
scores_rbf = clf.cv_results_['mean_test_score'][:n].reshape(len(gamma_range),
len(c_range))
# scores for rbf kernel
scores_linear = clf.cv_results_['mean_test_score'][n:].reshape(len(gamma_range),
len(c_range))
plt.figure(figsize=(8, 6))
plt.subplots_adjust(left=.2, right=0.95, bottom=0.15, top=0.95)
plt.imshow(scores_rbf, interpolation='nearest', cmap=plt.cm.hot,
norm=MidpointNormalize(vmin=-.2, midpoint=0.5))
plt.xlabel('gamma')
plt.ylabel('C')
plt.colorbar()
plt.xticks(np.arange(len(gamma_range)),
[np.format_float_scientific(i,1) for i in gamma_range],rotation=45)
plt.yticks(np.arange(len(c_range)),
[np.format_float_scientific(i,) for i in c_range])
plt.title('Validation accuracy')
plt.show()
The granularity of the grid is very low but it takes some time run otherwise. Also the limits of the grid will need to be more educated that the ones I chose.
I'm not sure why you get the error you get but I kept things simple and initiated the SVR once in my snippet so you can see how it works. I've also used different lengths for the C and gamma arrays that's just to show how these parameters are carried through. Sometimes I find that if everything has the same length is difficult to see which parameter is responsible for what.
The final plot looks like that but this depends heavily on the range of the grid, its granularity and the dataset that you are working with. Also note that I change the parameters of the MidpointNormalize class you provided.

tuning models with grid search

I ran across an example on parameters tuning with Grid search and text data using TfidfVectorizer() in the pipeline.
As far as I've understood is that when we call grid_search.fit(X_train, y_train) it will transform the data then fit the model as it is described in a dictionary. However during the evaluation, I'm a bit confused with the test dataset, since when we call grid_search.predict(X_test) I don't know whether/(how) the TfidfVectorizer() is applied on this test chunk.
Thanks
David
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model.logistic import LogisticRegression
from sklearn.grid_search import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.cross_validation import train_test_split
from sklearn.metrics import precision_score, recall_score, accuracy_
score
pipeline = Pipeline([
('vect', TfidfVectorizer(stop_words='english')),
('clf', LogisticRegression())
])
parameters = {
'vect__max_df': (0.25, 0.5, 0.75),
'vect__stop_words': ('english', None),
'vect__max_features': (2500, 5000, 10000, None),
'vect__ngram_range': ((1, 1), (1, 2)),
'vect__use_idf': (True, False),
'vect__norm': ('l1', 'l2'),
'clf__penalty': ('l1', 'l2'),
'clf__C': (0.01, 0.1, 1, 10),
}
if __name__ == "__main__":
grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1,
verbose=1, scoring='accuracy', cv=3)
df = pd.read_csv('data/sms.csv')
X, y, = df['message'], df['label']
X_train, X_test, y_train, y_test = train_test_split(X, y)
grid_search.fit(X_train, y_train)
print 'Best score: %0.3f' % grid_search.best_score_
print 'Best parameters set:'
best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
print '\t%s: %r' % (param_name, best_parameters[param_name])
predictions = grid_search.predict(X_test)
print 'Accuracy:', accuracy_score(y_test, predictions)
print 'Precision:', precision_score(y_test, predictions)
print 'Recall:', recall_score(y_test, predictions)
This is an example of scikit-learn pipelines magic. It works like this:
First, you define elements of a pipeline with Pipeline constructor - all data, whether on train or test (predict) stage, will be processed through all the defined steps - in this case by TfidfVectorizer and then the output will be passed to LogisticRegression model.
Passing defined pipeline to GridSearchCV constructor allows you to use the method fit, that not only performs grid search but also internally sets both TfidfVectorizer and LogisticRegression to best found parameters, so later running predict does so on best-found models.
You can find more info on creating pipelines in scikit-learn documentation.

Scaling in scikit-learn permutation_test_score

I'm using the scikit-learn "permutation_test_score" method to evaluate my estimator performances significance. Unfortunately, I cannot understand from the scikit-learn documentation if the method implements any scaling on data. I use to standardise my data through a StandardScaler, to apply the training set standardisation to the testing set.
The function itself does not apply any scaling.
Here is an example from the documentation:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.svm import SVC
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import permutation_test_score
from sklearn import datasets
iris = datasets.load_iris()
X = iris.data
y = iris.target
n_classes = np.unique(y).size
# Some noisy data not correlated
random = np.random.RandomState(seed=0)
E = random.normal(size=(len(X), 2200))
# Add noisy data to the informative features for make the task harder
X = np.c_[X, E]
svm = SVC(kernel='linear')
cv = StratifiedKFold(2)
score, permutation_scores, pvalue = permutation_test_score(
svm, X, y, scoring="accuracy", cv=cv, n_permutations=100, n_jobs=1)
However, what you may want to do is to pass in the permutation_test_score a pipeline where you apply the scaling.
Example:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
pipe = Pipeline([('scaler', StandardScaler()), ('clf', SVC(kernel='linear'))])
score, permutation_scores, pvalue = permutation_test_score(
pipe, X, y, scoring="accuracy", cv=cv, n_permutations=100, n_jobs=1)

Grid search with LightGBM example

I am trying to find the best parameters for a lightgbm model using GridSearchCV from sklearn.model_selection. I have not been able to find a solution that actually works.
I have managed to set up a partly working code:
import numpy as np
import pandas as pd
import lightgbm as lgb
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import KFold
np.random.seed(1)
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
y = pd.read_csv('y.csv')
y = y.values.ravel()
print(train.shape, test.shape, y.shape)
categoricals = ['COL_A','COL_B']
indexes_of_categories = [train.columns.get_loc(col) for col in categoricals]
gkf = KFold(n_splits=5, shuffle=True, random_state=42).split(X=train, y=y)
param_grid = {
'num_leaves': [31, 127],
'reg_alpha': [0.1, 0.5],
'min_data_in_leaf': [30, 50, 100, 300, 400],
'lambda_l1': [0, 1, 1.5],
'lambda_l2': [0, 1]
}
lgb_estimator = lgb.LGBMClassifier(boosting_type='gbdt', objective='binary', num_boost_round=2000, learning_rate=0.01, metric='auc',categorical_feature=indexes_of_categories)
gsearch = GridSearchCV(estimator=lgb_estimator, param_grid=param_grid, cv=gkf)
lgb_model = gsearch.fit(X=train, y=y)
print(lgb_model.best_params_, lgb_model.best_score_)
This seems to be working but with a UserWarning:
categorical_feature keyword has been found in params and will be
ignored. Please use categorical_feature argument of the Dataset
constructor to pass this parameter.
I am looking for a working solution or perhaps a suggestion on how to ensure that lightgbm accepts categorical arguments in the above code
As the warning states, categorical_feature is not one of the LGBMModel arguments. It is relevant in lgb.Dataset instantiation, which in the case of sklearn API is done directly in the fit() method see the doc. Thus, in order to pass those in the GridSearchCV optimisation one has to provide it as an argument of the GridSearchCV.fit() method in the case of sklearn v0.19.1 or as an additional fit_params argument in GridSearchCV instantiation in older sklearn versions
In case you are struggling with how to pass the fit_params, which happened to me as well, this is how you should do that:
fit_params = {'categorical_feature':indexes_of_categories}
clf = GridSearchCV(model, param_grid, cv=n_folds)
clf.fit(x_train, y_train, **fit_params)

Resources