Stacking StandardScaler() with RFECV and GridSearchCV - scikit-learn

So I found out that StandardScaler() can make my RFECV inside my GridSearchCV with each on a nested 3-fold cross validation run faster. Without StandardScaler(), my code was running for more than 2 days, so I canceled and decided to inject StandardScaler into the process. But now it is has been running for more than 4 hours and I am not sure if I have done it right. Here is my code:
# Choose Linear SVM as classifier
LSVM = SVC(kernel='linear')
selector = RFECV(LSVM, step=1, cv=3, scoring='f1')
param_grid = [{'estimator__C': [0.001, 0.01, 0.1, 1, 10, 100]}]
clf = make_pipeline(StandardScaler(),
GridSearchCV(selector,
param_grid,
cv=3,
refit=True,
scoring='f1'))
clf.fit(X, Y)
I think I haven't gotten it right to be honest because I think the StandardScaler() should be put inside the GridSearchCV() function for it to normalize the data each fold, not only just once (?). Please correct me if I am wrong or if my pipeline is incorrect and hence why it is still running for a long time.
I have 8,000 rows of 145 features to be pruned by RFECV, and 6 C-Values to be pruned by GridSearchCV. So for each C-Value, the best feature set is determined by the RFECV.
Thanks!
Update:
So I put the StandardScaler inside the RFECV like this:
clf = SVC(kernel='linear')
kf = KFold(n_splits=3, shuffle=True, random_state=0)
estimators = [('standardize' , StandardScaler()),
('clf', clf)]
class Mypipeline(Pipeline):
#property
def coef_(self):
return self._final_estimator.coef_
#property
def feature_importances_(self):
return self._final_estimator.feature_importances_
pipeline = Mypipeline(estimators)
rfecv = RFECV(estimator=pipeline, cv=kf, scoring='f1', verbose=10)
param_grid = [{'estimator__svc__C': [0.001, 0.01, 0.1, 1, 10, 100]}]
clf = GridSearchCV(rfecv, param_grid, cv=3, scoring='f1', verbose=10)
But it still throws out the following error:
ValueError: Invalid parameter C for estimator Pipeline(memory=None,
steps=[('standardscaler', StandardScaler(copy=True, with_mean=True, >with_std=True)), ('svc', SVC(C=1.0, cache_size=200, class_weight=None, >coef0=0.0,
decision_function_shape='ovr', degree=3, gamma='auto', kernel='linear',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False))]). Check the list of available parameters with >estimator.get_params().keys().

Kumar is right. Also, what You might want to do, turn on verbose in the GridSearchCV. Also, You could add a limit to the number of iterations of the SVC, starting from a very small number, like 5, just to make sure that the problem is not with the convergence.

Related

Pipeline for more than 2 classifiers

I am trying to build an ensemble using Knn and random forest classifiers.
steps = [('scaler', StandardScaler()),
('regressor', VotingClassifier(estimators=[
('knn', KNeighborsClassifier()),
('clf', RandomForestClassifier())]))]
pipeline = Pipeline(steps)
parameters = [{'knn__n_neighbors': np.arange(1, 50)}, {
'clf__n_estimators': [10, 20, 30],
'clf__criterion': ['gini', 'entropy'],
'clf__max_features': [5, 10, 15],
'clf__max_depth': ['auto', 'log2', 'sqrt', None]}]
X_train, X_test, y_train, y_test = train_test_split(X, y.values.ravel(),
test_size=0.3, random_state=65)
cv = GridSearchCV(pipeline, param_grid=parameters)
cv.fit(X_train, y_train)
y_pred = cv.predict(X_test)
I have encourntered the following error while running the above code:
Invalid parameter knn for estimator Pipeline(steps=[('scaler', StandardScaler()),
('regressor',VotingClassifier(estimators=[('knn', KNeighborsClassifier()),('clf', RandomForestClassifier())]))]). Check the list of available parameters with estimator.get_params().keys()
Since I am new to machine learning I having difficulty in understanding the error.
I agree that the error message is not clear, but the error is raised because of the knn estimator in your VotingClassifier. Please check the VotingClassifier documentation:
voting : str, {'hard', 'soft'}, default='hard'
If 'hard', uses predicted class labels for majority rule voting. Else if 'soft', predicts the class label based on the argmax of the sums of the predicted probabilities, which is recommended for an ensemble of well-calibrated classifiers.
(...)
attributes : dict, default=None
Dictionary mapping each estimator to a list of attributes to be extracted as predictors.
If None, all public estimator attributes are used.
Concretely, KNeighborsClassifier has no predict_proba method, thus it cannot work with the soft voting classifier.
If you want to keep both estimators in your VotingClassifier, you should set voting='hard':
VotingClassifier(estimators=[('knn', KNeighborsClassifier()), ('clf', RandomForestClassifier())],voting='hard')
Let me know if it helped.

tuning models with grid search

I ran across an example on parameters tuning with Grid search and text data using TfidfVectorizer() in the pipeline.
As far as I've understood is that when we call grid_search.fit(X_train, y_train) it will transform the data then fit the model as it is described in a dictionary. However during the evaluation, I'm a bit confused with the test dataset, since when we call grid_search.predict(X_test) I don't know whether/(how) the TfidfVectorizer() is applied on this test chunk.
Thanks
David
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model.logistic import LogisticRegression
from sklearn.grid_search import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.cross_validation import train_test_split
from sklearn.metrics import precision_score, recall_score, accuracy_
score
pipeline = Pipeline([
('vect', TfidfVectorizer(stop_words='english')),
('clf', LogisticRegression())
])
parameters = {
'vect__max_df': (0.25, 0.5, 0.75),
'vect__stop_words': ('english', None),
'vect__max_features': (2500, 5000, 10000, None),
'vect__ngram_range': ((1, 1), (1, 2)),
'vect__use_idf': (True, False),
'vect__norm': ('l1', 'l2'),
'clf__penalty': ('l1', 'l2'),
'clf__C': (0.01, 0.1, 1, 10),
}
if __name__ == "__main__":
grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1,
verbose=1, scoring='accuracy', cv=3)
df = pd.read_csv('data/sms.csv')
X, y, = df['message'], df['label']
X_train, X_test, y_train, y_test = train_test_split(X, y)
grid_search.fit(X_train, y_train)
print 'Best score: %0.3f' % grid_search.best_score_
print 'Best parameters set:'
best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
print '\t%s: %r' % (param_name, best_parameters[param_name])
predictions = grid_search.predict(X_test)
print 'Accuracy:', accuracy_score(y_test, predictions)
print 'Precision:', precision_score(y_test, predictions)
print 'Recall:', recall_score(y_test, predictions)
This is an example of scikit-learn pipelines magic. It works like this:
First, you define elements of a pipeline with Pipeline constructor - all data, whether on train or test (predict) stage, will be processed through all the defined steps - in this case by TfidfVectorizer and then the output will be passed to LogisticRegression model.
Passing defined pipeline to GridSearchCV constructor allows you to use the method fit, that not only performs grid search but also internally sets both TfidfVectorizer and LogisticRegression to best found parameters, so later running predict does so on best-found models.
You can find more info on creating pipelines in scikit-learn documentation.

Adaboost in Pipeline with Gridsearch SKLEARN

I would like to use the AdaBoostClassifier with LinearSVC as base estimator. I want to do a gridsearch on some of the parameters in LinearSVC. Also I have to scale my features.
p_grid = {'base_estimator__C': np.logspace(-5, 3, 10)}
n_splits = 5
inner_cv = StratifiedKFold(n_splits=n_splits,
shuffle=True, random_state=5)
SVC_Kernel=LinearSVC(multi_class ='crammer_singer',tol=10e-3,max_iter=10000,class_weight='balanced')
ABC = AdaBoostClassifier(base_estimator=SVC_Kernel,n_estimators=600,learning_rate=1.5,algorithm="SAMME")
for train_index, test_index in kk.split(input):
X_train, X_test = input[train_index], input[test_index]
y_train, y_test = target[train_index], target[test_index]
pipe_SVC = Pipeline([('scaler', RobustScaler()),('AdaBoostClassifier', ABC)])
clfSearch = GridSearchCV(estimator=pipe_SVC, param_grid=p_grid,
cv=inner_cv, scoring='f1_macro', iid=False, n_jobs=-1)
clfSearch.fit(X_train, y_train)
The following error occurs:
ValueError: Invalid parameter base_estimator for estimator Pipeline(memory=None,
steps=[('scaler',
RobustScaler(copy=True, quantile_range=(25.0, 75.0),
with_centering=True, with_scaling=True)),
('AdaBoostClassifier',
AdaBoostClassifier(algorithm='SAMME',
base_estimator=LinearSVC(C=1.0,
class_weight='balanced',
dual=True,
fit_intercept=True,
intercept_scaling=1,
loss='squared_hinge',
max_iter=10000,
multi_class='crammer_singer',
penalty='l2',
random_state=None,
tol=0.01,
verbose=0),
learning_rate=1.5, n_estimators=600,
random_state=None))],
verbose=False). Check the list of available parameters with `estimator.get_params().keys()`.
Without the AdaBoostClassifier the pipeline is working, so I think there is the problem.
I think your p_grid should be defined as follows,
p_grid = {'AdaBoostClassifier__base_estimator__C': np.logspace(-5, 3, 10)}
Try pipe_SVC.get_params(), if you are not sure about the name of your parameter.

GridSearchCV - XGBoost - Early Stopping

i am trying to do hyperparemeter search with using scikit-learn's GridSearchCV on XGBoost. During gridsearch i'd like it to early stop, since it reduce search time drastically and (expecting to) have better results on my prediction/regression task. I am using XGBoost via its Scikit-Learn API.
model = xgb.XGBRegressor()
GridSearchCV(model, paramGrid, verbose=verbose ,fit_params={'early_stopping_rounds':42}, cv=TimeSeriesSplit(n_splits=cv).get_n_splits([trainX, trainY]), n_jobs=n_jobs, iid=iid).fit(trainX,trainY)
I tried to give early stopping parameters with using fit_params, but then it throws this error which is basically because of lack of validation set which is required for early stopping:
/opt/anaconda/anaconda3/lib/python3.5/site-packages/xgboost/callback.py in callback(env=XGBoostCallbackEnv(model=<xgboost.core.Booster o...teration=4000, rank=0, evaluation_result_list=[]))
187 else:
188 assert env.cvfolds is not None
189
190 def callback(env):
191 """internal function"""
--> 192 score = env.evaluation_result_list[-1][1]
score = undefined
env.evaluation_result_list = []
193 if len(state) == 0:
194 init(env)
195 best_score = state['best_score']
196 best_iteration = state['best_iteration']
How can i apply GridSearch on XGBoost with using early_stopping_rounds?
note: model is working without gridsearch, also GridSearch works without 'fit_params={'early_stopping_rounds':42}
When using early_stopping_rounds you also have to give eval_metric and eval_set as input parameter for the fit method. Early stopping is done via calculating the error on an evaluation set. The error has to decrease every early_stopping_rounds otherwise the generation of additional trees is stopped early.
See the documentation of xgboosts fit method for details.
Here you see a minimal fully working example:
import xgboost as xgb
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import TimeSeriesSplit
cv = 2
trainX= [[1], [2], [3], [4], [5]]
trainY = [1, 2, 3, 4, 5]
# these are the evaluation sets
testX = trainX
testY = trainY
paramGrid = {"subsample" : [0.5, 0.8]}
fit_params={"early_stopping_rounds":42,
"eval_metric" : "mae",
"eval_set" : [[testX, testY]]}
model = xgb.XGBRegressor()
gridsearch = GridSearchCV(model, paramGrid, verbose=1 ,
fit_params=fit_params,
cv=TimeSeriesSplit(n_splits=cv).get_n_splits([trainX,trainY]))
gridsearch.fit(trainX,trainY)
An update to #glao's answer and a response to #Vasim's comment/question, as of sklearn 0.21.3 (note that fit_params has been moved out of the instantiation of GridSearchCV and been moved into the fit() method; also, the import specifically pulls in the sklearn wrapper module from xgboost):
import xgboost.sklearn as xgb
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import TimeSeriesSplit
cv = 2
trainX= [[1], [2], [3], [4], [5]]
trainY = [1, 2, 3, 4, 5]
# these are the evaluation sets
testX = trainX
testY = trainY
paramGrid = {"subsample" : [0.5, 0.8]}
fit_params={"early_stopping_rounds":42,
"eval_metric" : "mae",
"eval_set" : [[testX, testY]]}
model = xgb.XGBRegressor()
gridsearch = GridSearchCV(model, paramGrid, verbose=1,
cv=TimeSeriesSplit(n_splits=cv).get_n_splits([trainX, trainY]))
gridsearch.fit(trainX, trainY, **fit_params)
Here's a solution that works in a Pipeline with GridSearchCV. The challenge occurs when you have a pipeline that is required to pre-process your training data. For example, when X is a text document and you need TFTDFVectorizer to vectorize it.
Over-ride the XGBRegressor or XGBClssifier.fit() Function
This step uses train_test_split() to select the specified number of
validation records from X for the eval_set and then passes the
remaining records along to fit().
A new parameter eval_test_size is added to .fit() to control the number of validation records. (see train_test_split test_size documenation)
**kwargs passes along any other parameters added by the user for the XGBRegressor.fit() function.
from xgboost.sklearn import XGBRegressor
from sklearn.model_selection import train_test_split
class XGBRegressor_ES(XGBRegressor):
def fit(self, X, y, *, eval_test_size=None, **kwargs):
if eval_test_size is not None:
params = super(XGBRegressor, self).get_xgb_params()
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=eval_test_size, random_state=params['random_state'])
eval_set = [(X_test, y_test)]
# Could add (X_train, y_train) to eval_set
# to get .eval_results() for both train and test
#eval_set = [(X_train, y_train),(X_test, y_test)]
kwargs['eval_set'] = eval_set
return super(XGBRegressor_ES, self).fit(X_train, y_train, **kwargs)
Example Usage
Below is a multistep pipeline that includes multiple transformations to X. The pipeline's fit() function passes the new evaluation parameter to the XGBRegressor_ES class above as xgbr__eval_test_size=200. In this example:
X_train contains text documents passed to the pipeline.
XGBRegressor_ES.fit() uses train_test_split() to select 200 records from X_train for the validation set and early stopping. (This could also be a percentage such as xgbr__eval_test_size=0.2)
The remaining records in X_train are passed along to XGBRegressor.fit() for the actual fit().
Early stopping may now occur after 75 rounds of unchanged boosting for each cv fold in a gridsearch.
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_selection import VarianceThreshold
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectPercentile, f_regression
xgbr_pipe = Pipeline(steps=[('tfidf', TfidfVectorizer()),
('vt',VarianceThreshold()),
('scaler', StandardScaler()),
('Sp', SelectPercentile()),
('xgbr',XGBRegressor_ES(n_estimators=2000,
objective='reg:squarederror',
eval_metric='mae',
learning_rate=0.0001,
random_state=7)) ])
X_train = train_idxs['f_text'].values
y_train = train_idxs['Pct_Change_20'].values
Example Fitting the Pipeline:
%time xgbr_pipe.fit(X_train, y_train,
xgbr__eval_test_size=200,
xgbr__eval_metric='mae',
xgbr__early_stopping_rounds=75)
Example Fitting GridSearchCV:
learning_rate = [0.0001, 0.001, 0.01, 0.05, 0.1, 0.2, 0.3]
param_grid = dict(xgbr__learning_rate=learning_rate)
grid_search = GridSearchCV(xgbr_pipe, param_grid, scoring="neg_mean_absolute_error", n_jobs=-1, cv=10)
grid_result = grid_search.fit(X_train, y_train,
xgbr__eval_test_size=200,
xgbr__eval_metric='mae',
xgbr__early_stopping_rounds=75)

Gridsearchcv with skflow/TF learn runs forever even if grid is just one point

I am trying to do a gridsearch over steps, learning_rate and batch_size of a DNN regression. I've tried to do this with the simple example, the Boston dataset shown here boston example however, I can't get it to work. It does not throw any error, it just runs and runs and runs. It does this even if I set up a grid of a single point.
Do you see any error in the below? Do I miss something obvious?
I am new to both sklearn and skflow (I know, skflow has been merged into Tensorflow Learn, but I think the example should be the same), but I just combined the examples I found.
from sklearn import datasets, cross_validation, metrics
from sklearn import preprocessing, grid_search
import skflow
# Load dataset
boston = datasets.load_boston()
X, y = boston.data, boston.target
# Split dataset into train / test
X_train, X_test, y_train, y_test=cross_validation.train_test_split(X, y,test_size=0.2, random_state=42)
# scale data (training set) to 0 mean and unit Std. dev
scaler = preprocessing.StandardScaler()
X_train = scaler.fit_transform(X_train)
regressor = skflow.TensorFlowDNNRegressor(hidden_units=[10, 10],
steps=5000, learning_rate=0.1, batch_size=10)
# use a full grid over all parameters
param_grid = {"steps": [200,400],
"learning_rate": [0.1,0.2],
"batch_size": [10,32]}
# run grid search
gs = grid_search.GridSearchCV(regressor, param_grid=param_grid, scoring = 'accuracy', verbose=10, n_jobs=-1,cv=2)
gs.fit(X_train, y_train)
# summarize the results of the grid search
print(gs.best_score_)
print(gs.best_params_)
Thanks for any help!!
Add fit_params to gird_search else TensorFlowDNNRegressor runs forever.
gs = grid_search.GridSearchCV(
regressor, param_grid=param_grid,
scoring = 'accuracy', verbose=10, n_jobs=-1,cv=2
)
to
gs = grid_search.GridSearchCV(
regressor, param_grid=param_grid, scoring = 'accuracy',
verbose=10, n_jobs=-1,cv=2, fit_params={'steps': [200,400]}
)

Resources