Pipeline for more than 2 classifiers - python-3.x

I am trying to build an ensemble using Knn and random forest classifiers.
steps = [('scaler', StandardScaler()),
('regressor', VotingClassifier(estimators=[
('knn', KNeighborsClassifier()),
('clf', RandomForestClassifier())]))]
pipeline = Pipeline(steps)
parameters = [{'knn__n_neighbors': np.arange(1, 50)}, {
'clf__n_estimators': [10, 20, 30],
'clf__criterion': ['gini', 'entropy'],
'clf__max_features': [5, 10, 15],
'clf__max_depth': ['auto', 'log2', 'sqrt', None]}]
X_train, X_test, y_train, y_test = train_test_split(X, y.values.ravel(),
test_size=0.3, random_state=65)
cv = GridSearchCV(pipeline, param_grid=parameters)
cv.fit(X_train, y_train)
y_pred = cv.predict(X_test)
I have encourntered the following error while running the above code:
Invalid parameter knn for estimator Pipeline(steps=[('scaler', StandardScaler()),
('regressor',VotingClassifier(estimators=[('knn', KNeighborsClassifier()),('clf', RandomForestClassifier())]))]). Check the list of available parameters with estimator.get_params().keys()
Since I am new to machine learning I having difficulty in understanding the error.

I agree that the error message is not clear, but the error is raised because of the knn estimator in your VotingClassifier. Please check the VotingClassifier documentation:
voting : str, {'hard', 'soft'}, default='hard'
If 'hard', uses predicted class labels for majority rule voting. Else if 'soft', predicts the class label based on the argmax of the sums of the predicted probabilities, which is recommended for an ensemble of well-calibrated classifiers.
(...)
attributes : dict, default=None
Dictionary mapping each estimator to a list of attributes to be extracted as predictors.
If None, all public estimator attributes are used.
Concretely, KNeighborsClassifier has no predict_proba method, thus it cannot work with the soft voting classifier.
If you want to keep both estimators in your VotingClassifier, you should set voting='hard':
VotingClassifier(estimators=[('knn', KNeighborsClassifier()), ('clf', RandomForestClassifier())],voting='hard')
Let me know if it helped.

Related

SKLearn Error with Pipeline and Gridsearch

I would like to first split my data in a test and train set. Then I want to use GridSearchCV on my training set (internally split into train/validation set). In the end I want to collect all the testdata and do some other things (not in the scope of the question).
I have to scale my data. So I want to handle this problem in a pipeline. Some things in my SVC should be ficed (kernel='rbf', class_weight=...).
When I run the code the following occurs:
"ValueError: Invalid parameter estimator for estimator Pipeline"
I don't understand what I'm doing wrong. I tried to follow this thread: StandardScaler with Pipelines and GridSearchCV
The only difference is, that I fix some parameters in my SVC. How can I handle this?
target = np.array(target).ravel()
loo = LeaveOneOut()
loo.get_n_splits(input)
# Outer Loop
for train_index, test_index in loo.split(input):
X_train, X_test = input[train_index], input[test_index]
y_train, y_test = target[train_index], target[test_index]
p_grid = {'estimator__C': np.logspace(-5, 2, 20),}
'estimator__gamma': np.logspace(-5, 3, 20)}
SVC_Kernel = SVC(kernel='rbf', class_weight='balanced',tol=10e-4, max_iter=200000, probability=False)
pipe_SVC = Pipeline([('scaler', RobustScaler()),('SVC', SVC_Kernel)])
n_splits = 5
scoring = "f1_micro"
inner_cv = StratifiedKFold(n_splits=n_splits,
shuffle=True, random_state=5)
clfSearch = GridSearchCV(estimator=pipe_SVC, param_grid=p_grid,
cv=inner_cv, scoring='f1_micro', iid=False, n_jobs=-1)
clfSearch.fit(X_train, y_train)
print("Best parameters set found on validation set for Support Vector Machine:")
print()
print(clfSearch.best_params_)
print()
print(clfSearch.best_score_)
print("Grid scores on validation set:")
print()
I also tried it this way:
p_grid = {'estimator__C': np.logspace(-5, 2, 20),
'estimator__gamma': np.logspace(-5, 3, 20),
'estimator__tol': [10e-4],
'estimator__kernel': ['rbf'],
'estimator__class_weight': ['balanced'],
'estimator__max_iter':[200000],
'estimator__probability': [False]}
SVC_Kernel = SVC()
This also doesn't work.
The problem is in your p_grid. You are grid searching on your Pipeline, and that doesn't have anything called estimator. It does have something called SVC, so if you want to set that SVC's parameter, you should prefix you keys with SVC__ instead of estimator__. So replace p_grid with:
p_grid = {'SVC__C': np.logspace(-5, 2, 20),}
'SVC__gamma': np.logspace(-5, 3, 20)}
Also, you can replace your outer for loop using cross_validate function.

How to retrieve the training history from LinearSVC model?

I am trying to retrieve the training history of my SVM model to plot its learning curve. Something like:
history = model.fit(X, Y, validation_split=0.33, epochs=150, batch_size=10)
I have already looked at GridSearchCV best model CV history and https://scikit-learn.org/stable/auto_examples/model_selection/plot_learning_curve.html to do that, but the first approach did not work for LinearSVC model, and the second approach is not quite what I would like to do (as far as I understood, If I use the learning curve method I will have to train my model again after the grid search).
model = GridSearchCV(LinearSVC(verbose=0),
{'C': [1, 10, 100, 1000]}, cv=5,
iid=False, scoring='recall_macro')
model.fit(x_train, y_train)
_, loss, val_loss = learning_curve(model.best_estimator_.fit(X, Y, cv=5))
How can I get this history? I am using sklearn 0.20.3.

Stacking StandardScaler() with RFECV and GridSearchCV

So I found out that StandardScaler() can make my RFECV inside my GridSearchCV with each on a nested 3-fold cross validation run faster. Without StandardScaler(), my code was running for more than 2 days, so I canceled and decided to inject StandardScaler into the process. But now it is has been running for more than 4 hours and I am not sure if I have done it right. Here is my code:
# Choose Linear SVM as classifier
LSVM = SVC(kernel='linear')
selector = RFECV(LSVM, step=1, cv=3, scoring='f1')
param_grid = [{'estimator__C': [0.001, 0.01, 0.1, 1, 10, 100]}]
clf = make_pipeline(StandardScaler(),
GridSearchCV(selector,
param_grid,
cv=3,
refit=True,
scoring='f1'))
clf.fit(X, Y)
I think I haven't gotten it right to be honest because I think the StandardScaler() should be put inside the GridSearchCV() function for it to normalize the data each fold, not only just once (?). Please correct me if I am wrong or if my pipeline is incorrect and hence why it is still running for a long time.
I have 8,000 rows of 145 features to be pruned by RFECV, and 6 C-Values to be pruned by GridSearchCV. So for each C-Value, the best feature set is determined by the RFECV.
Thanks!
Update:
So I put the StandardScaler inside the RFECV like this:
clf = SVC(kernel='linear')
kf = KFold(n_splits=3, shuffle=True, random_state=0)
estimators = [('standardize' , StandardScaler()),
('clf', clf)]
class Mypipeline(Pipeline):
#property
def coef_(self):
return self._final_estimator.coef_
#property
def feature_importances_(self):
return self._final_estimator.feature_importances_
pipeline = Mypipeline(estimators)
rfecv = RFECV(estimator=pipeline, cv=kf, scoring='f1', verbose=10)
param_grid = [{'estimator__svc__C': [0.001, 0.01, 0.1, 1, 10, 100]}]
clf = GridSearchCV(rfecv, param_grid, cv=3, scoring='f1', verbose=10)
But it still throws out the following error:
ValueError: Invalid parameter C for estimator Pipeline(memory=None,
steps=[('standardscaler', StandardScaler(copy=True, with_mean=True, >with_std=True)), ('svc', SVC(C=1.0, cache_size=200, class_weight=None, >coef0=0.0,
decision_function_shape='ovr', degree=3, gamma='auto', kernel='linear',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False))]). Check the list of available parameters with >estimator.get_params().keys().
Kumar is right. Also, what You might want to do, turn on verbose in the GridSearchCV. Also, You could add a limit to the number of iterations of the SVC, starting from a very small number, like 5, just to make sure that the problem is not with the convergence.

Error message on attempting to fit training data using GridSearch function

from sklearn.preprocessing import PolynomialFeatures
polyreg = PolynomialFeatures(degree = 4)
param_grid = {'alpha': [0.001, 0.01, 0.1, 1, 10, 100]}
grid_search_polyreg = GridSearchCV(polyreg, param_grid, cv = 5)
grid_search_polyreg.fit(x_train, y_train)
grid_search_polyreg.score(x_test, y_test)
print("Best Parameters for polynomial regression:
{}".format(grid_search_polyreg.best_params_))
print("Best Score for polynomial regression:
{:.2f}".format(grid_search_polyreg.best_score_))
TypeError: If no scoring is specified, the estimator passed should
have a 'score' method. The estimator PolynomialFeatures(degree=4,
include_bias=True, interaction_only=False) does not.
1)I understand that alpha is not a parameter for polynomial features. But when I tried to remove alpha and fit the data it did not work.
2) Does that mean that I am not supposed to use grid search for getting scores of KNN Regressor, Linear and kernel SVM?
I am new to python and any suggestion is much appreciated. Thanks in advance.
sklearn.preprocessing.PolynomialFeatures() doesn't have a scoring function. It's not actually an estimator or machine learning model, it just transforms a matrix. You can have it as part of your pipeline and test its parameters, but you have to pass an actual estimator with a scoring function to GridSearchCV.
Fitting to data has a different meaning when your dealing with transformers vs estimators, only in the latter case does it mean "train".

GridSearchCV - XGBoost - Early Stopping

i am trying to do hyperparemeter search with using scikit-learn's GridSearchCV on XGBoost. During gridsearch i'd like it to early stop, since it reduce search time drastically and (expecting to) have better results on my prediction/regression task. I am using XGBoost via its Scikit-Learn API.
model = xgb.XGBRegressor()
GridSearchCV(model, paramGrid, verbose=verbose ,fit_params={'early_stopping_rounds':42}, cv=TimeSeriesSplit(n_splits=cv).get_n_splits([trainX, trainY]), n_jobs=n_jobs, iid=iid).fit(trainX,trainY)
I tried to give early stopping parameters with using fit_params, but then it throws this error which is basically because of lack of validation set which is required for early stopping:
/opt/anaconda/anaconda3/lib/python3.5/site-packages/xgboost/callback.py in callback(env=XGBoostCallbackEnv(model=<xgboost.core.Booster o...teration=4000, rank=0, evaluation_result_list=[]))
187 else:
188 assert env.cvfolds is not None
189
190 def callback(env):
191 """internal function"""
--> 192 score = env.evaluation_result_list[-1][1]
score = undefined
env.evaluation_result_list = []
193 if len(state) == 0:
194 init(env)
195 best_score = state['best_score']
196 best_iteration = state['best_iteration']
How can i apply GridSearch on XGBoost with using early_stopping_rounds?
note: model is working without gridsearch, also GridSearch works without 'fit_params={'early_stopping_rounds':42}
When using early_stopping_rounds you also have to give eval_metric and eval_set as input parameter for the fit method. Early stopping is done via calculating the error on an evaluation set. The error has to decrease every early_stopping_rounds otherwise the generation of additional trees is stopped early.
See the documentation of xgboosts fit method for details.
Here you see a minimal fully working example:
import xgboost as xgb
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import TimeSeriesSplit
cv = 2
trainX= [[1], [2], [3], [4], [5]]
trainY = [1, 2, 3, 4, 5]
# these are the evaluation sets
testX = trainX
testY = trainY
paramGrid = {"subsample" : [0.5, 0.8]}
fit_params={"early_stopping_rounds":42,
"eval_metric" : "mae",
"eval_set" : [[testX, testY]]}
model = xgb.XGBRegressor()
gridsearch = GridSearchCV(model, paramGrid, verbose=1 ,
fit_params=fit_params,
cv=TimeSeriesSplit(n_splits=cv).get_n_splits([trainX,trainY]))
gridsearch.fit(trainX,trainY)
An update to #glao's answer and a response to #Vasim's comment/question, as of sklearn 0.21.3 (note that fit_params has been moved out of the instantiation of GridSearchCV and been moved into the fit() method; also, the import specifically pulls in the sklearn wrapper module from xgboost):
import xgboost.sklearn as xgb
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import TimeSeriesSplit
cv = 2
trainX= [[1], [2], [3], [4], [5]]
trainY = [1, 2, 3, 4, 5]
# these are the evaluation sets
testX = trainX
testY = trainY
paramGrid = {"subsample" : [0.5, 0.8]}
fit_params={"early_stopping_rounds":42,
"eval_metric" : "mae",
"eval_set" : [[testX, testY]]}
model = xgb.XGBRegressor()
gridsearch = GridSearchCV(model, paramGrid, verbose=1,
cv=TimeSeriesSplit(n_splits=cv).get_n_splits([trainX, trainY]))
gridsearch.fit(trainX, trainY, **fit_params)
Here's a solution that works in a Pipeline with GridSearchCV. The challenge occurs when you have a pipeline that is required to pre-process your training data. For example, when X is a text document and you need TFTDFVectorizer to vectorize it.
Over-ride the XGBRegressor or XGBClssifier.fit() Function
This step uses train_test_split() to select the specified number of
validation records from X for the eval_set and then passes the
remaining records along to fit().
A new parameter eval_test_size is added to .fit() to control the number of validation records. (see train_test_split test_size documenation)
**kwargs passes along any other parameters added by the user for the XGBRegressor.fit() function.
from xgboost.sklearn import XGBRegressor
from sklearn.model_selection import train_test_split
class XGBRegressor_ES(XGBRegressor):
def fit(self, X, y, *, eval_test_size=None, **kwargs):
if eval_test_size is not None:
params = super(XGBRegressor, self).get_xgb_params()
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=eval_test_size, random_state=params['random_state'])
eval_set = [(X_test, y_test)]
# Could add (X_train, y_train) to eval_set
# to get .eval_results() for both train and test
#eval_set = [(X_train, y_train),(X_test, y_test)]
kwargs['eval_set'] = eval_set
return super(XGBRegressor_ES, self).fit(X_train, y_train, **kwargs)
Example Usage
Below is a multistep pipeline that includes multiple transformations to X. The pipeline's fit() function passes the new evaluation parameter to the XGBRegressor_ES class above as xgbr__eval_test_size=200. In this example:
X_train contains text documents passed to the pipeline.
XGBRegressor_ES.fit() uses train_test_split() to select 200 records from X_train for the validation set and early stopping. (This could also be a percentage such as xgbr__eval_test_size=0.2)
The remaining records in X_train are passed along to XGBRegressor.fit() for the actual fit().
Early stopping may now occur after 75 rounds of unchanged boosting for each cv fold in a gridsearch.
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_selection import VarianceThreshold
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectPercentile, f_regression
xgbr_pipe = Pipeline(steps=[('tfidf', TfidfVectorizer()),
('vt',VarianceThreshold()),
('scaler', StandardScaler()),
('Sp', SelectPercentile()),
('xgbr',XGBRegressor_ES(n_estimators=2000,
objective='reg:squarederror',
eval_metric='mae',
learning_rate=0.0001,
random_state=7)) ])
X_train = train_idxs['f_text'].values
y_train = train_idxs['Pct_Change_20'].values
Example Fitting the Pipeline:
%time xgbr_pipe.fit(X_train, y_train,
xgbr__eval_test_size=200,
xgbr__eval_metric='mae',
xgbr__early_stopping_rounds=75)
Example Fitting GridSearchCV:
learning_rate = [0.0001, 0.001, 0.01, 0.05, 0.1, 0.2, 0.3]
param_grid = dict(xgbr__learning_rate=learning_rate)
grid_search = GridSearchCV(xgbr_pipe, param_grid, scoring="neg_mean_absolute_error", n_jobs=-1, cv=10)
grid_result = grid_search.fit(X_train, y_train,
xgbr__eval_test_size=200,
xgbr__eval_metric='mae',
xgbr__early_stopping_rounds=75)

Resources