Adaboost in Pipeline with Gridsearch SKLEARN - scikit-learn

I would like to use the AdaBoostClassifier with LinearSVC as base estimator. I want to do a gridsearch on some of the parameters in LinearSVC. Also I have to scale my features.
p_grid = {'base_estimator__C': np.logspace(-5, 3, 10)}
n_splits = 5
inner_cv = StratifiedKFold(n_splits=n_splits,
shuffle=True, random_state=5)
SVC_Kernel=LinearSVC(multi_class ='crammer_singer',tol=10e-3,max_iter=10000,class_weight='balanced')
ABC = AdaBoostClassifier(base_estimator=SVC_Kernel,n_estimators=600,learning_rate=1.5,algorithm="SAMME")
for train_index, test_index in kk.split(input):
X_train, X_test = input[train_index], input[test_index]
y_train, y_test = target[train_index], target[test_index]
pipe_SVC = Pipeline([('scaler', RobustScaler()),('AdaBoostClassifier', ABC)])
clfSearch = GridSearchCV(estimator=pipe_SVC, param_grid=p_grid,
cv=inner_cv, scoring='f1_macro', iid=False, n_jobs=-1)
clfSearch.fit(X_train, y_train)
The following error occurs:
ValueError: Invalid parameter base_estimator for estimator Pipeline(memory=None,
steps=[('scaler',
RobustScaler(copy=True, quantile_range=(25.0, 75.0),
with_centering=True, with_scaling=True)),
('AdaBoostClassifier',
AdaBoostClassifier(algorithm='SAMME',
base_estimator=LinearSVC(C=1.0,
class_weight='balanced',
dual=True,
fit_intercept=True,
intercept_scaling=1,
loss='squared_hinge',
max_iter=10000,
multi_class='crammer_singer',
penalty='l2',
random_state=None,
tol=0.01,
verbose=0),
learning_rate=1.5, n_estimators=600,
random_state=None))],
verbose=False). Check the list of available parameters with `estimator.get_params().keys()`.
Without the AdaBoostClassifier the pipeline is working, so I think there is the problem.

I think your p_grid should be defined as follows,
p_grid = {'AdaBoostClassifier__base_estimator__C': np.logspace(-5, 3, 10)}
Try pipe_SVC.get_params(), if you are not sure about the name of your parameter.

Related

Pipeline for more than 2 classifiers

I am trying to build an ensemble using Knn and random forest classifiers.
steps = [('scaler', StandardScaler()),
('regressor', VotingClassifier(estimators=[
('knn', KNeighborsClassifier()),
('clf', RandomForestClassifier())]))]
pipeline = Pipeline(steps)
parameters = [{'knn__n_neighbors': np.arange(1, 50)}, {
'clf__n_estimators': [10, 20, 30],
'clf__criterion': ['gini', 'entropy'],
'clf__max_features': [5, 10, 15],
'clf__max_depth': ['auto', 'log2', 'sqrt', None]}]
X_train, X_test, y_train, y_test = train_test_split(X, y.values.ravel(),
test_size=0.3, random_state=65)
cv = GridSearchCV(pipeline, param_grid=parameters)
cv.fit(X_train, y_train)
y_pred = cv.predict(X_test)
I have encourntered the following error while running the above code:
Invalid parameter knn for estimator Pipeline(steps=[('scaler', StandardScaler()),
('regressor',VotingClassifier(estimators=[('knn', KNeighborsClassifier()),('clf', RandomForestClassifier())]))]). Check the list of available parameters with estimator.get_params().keys()
Since I am new to machine learning I having difficulty in understanding the error.
I agree that the error message is not clear, but the error is raised because of the knn estimator in your VotingClassifier. Please check the VotingClassifier documentation:
voting : str, {'hard', 'soft'}, default='hard'
If 'hard', uses predicted class labels for majority rule voting. Else if 'soft', predicts the class label based on the argmax of the sums of the predicted probabilities, which is recommended for an ensemble of well-calibrated classifiers.
(...)
attributes : dict, default=None
Dictionary mapping each estimator to a list of attributes to be extracted as predictors.
If None, all public estimator attributes are used.
Concretely, KNeighborsClassifier has no predict_proba method, thus it cannot work with the soft voting classifier.
If you want to keep both estimators in your VotingClassifier, you should set voting='hard':
VotingClassifier(estimators=[('knn', KNeighborsClassifier()), ('clf', RandomForestClassifier())],voting='hard')
Let me know if it helped.

Google colab syntax error-- Using keras hyperas

I am trying to optimize hyperparameters for the IMBD dataset using keras hyperas but I am getting an error. I used this (https://www.kaggle.com/kt66nf/hyperparameter-optimization-using-keras-hyperas) code as a reference
CODE
def vectorize_sequences(sequences, dimension=10000):
# Create an all-zero matrix of shape (len(sequences), dimension)
results = np.zeros((len(sequences), dimension))
for i, sequence in enumerate(sequences):
results[i, sequence] = 1. # set specific indices of results[i] to 1s
return results
def data():
(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words=10000)
x_train = vectorize_sequences(train_data) #vectorized training data
x_test = vectorize_sequences(test_data)
y_train = np.asarray(train_labels).astype('float32') # vectorized labels
y_test = np.asarray(test_labels).astype('float32')
return x_train, y_train, x_test, y_test
def create_model(x_train, y_train, x_test, y_test):
model = models.Sequential()
model.add(layers.Dense(16, activation='relu', input_shape=(10000,)))
model.add(layers.Dense(16, activation='relu'))
model.add(Dropout({{uniform(0.5, 1)}}))
model.add(layers.Dense(64, activation='relu'))
model.add(layers.Dense(64, activation='relu'))
model.add(Dropout({{uniform(0.5, 1)}}))
model.add(layers.Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', metrics=['accuracy'],
optimizer={{choice(['rmsprop', 'adam', 'sgd'])}})
model.fit(x_train, y_train,
batch_size={{choice([16, 32, 64])}},
epochs={{choice([25, 50, 75, 100])}},
validation_data=(x_test, y_test))
score, acc = model.evaluate(x_test, y_test, verbose=0)
print('Test accuracy:', acc)
return {'loss': -acc, 'status': STATUS_OK, 'model': model}
***then I loaded google drive and set the path to the notebook
best_run, best_model = optim.minimize(model= create_model,
data=data,
max_evals=15,
algo=tpe.suggest,
notebook_name='Copy of imbd', #name of the notebook
trials= Trials())
this last code is where I got the error
"File "", line 182
]
^
SyntaxError: invalid syntax"

GridSearchCV with Invalid parameter gamma for estimator LogisticRegression

I need to perform a grid search on the parameters listed below for a Logistic Regression classifier, using recall for scoring and cross-validation three times.
The data is in a csv file (11,1 MB), this link for download is: https://drive.google.com/file/d/1cQFp7HteaaL37CefsbMNuHqPzkINCVzs/view?usp=sharing
I have grid_values = {'gamma':[0.01, 0.1, 1, 10, 100]}
I need to apply penalty L1 e L2 in a Logistic Regression
I couldn't verify if the scores will run because I have the following error:
Invalid parameter gamma for estimator LogisticRegression. Check the list of available parameters with estimator.get_params().keys().
This is my code:
from sklearn.model_selection import train_test_split
df = pd.read_csv('fraud_data.csv')
X = df.iloc[:,:-1]
y = df.iloc[:,-1]
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
def LogisticR_penalty():
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
grid_values = {'gamma':[0.01, 0.1, 1, 10, 100]}
#train de model with many parameters for "C" and penalty='l1'
lr_l1 = LogisticRegression(penalty='l1')
grid_lr_l1 = GridSearchCV(lr_l1, param_grid = grid_values, cv=3, scoring = 'recall')
grid_lr_l1.fit(X_train, y_train)
y_decision_fn_scores_recall = grid_lr_l1.decision_function(X_test)
lr_l2 = LogisticRegression(penalty='l2')
grid_lr_l2 = GridSearchCV(lr_l2, param_grid = grid_values, cv=3 , scoring = 'recall')
grid_lr_l2.fit(X_train, y_train)
y_decision_fn_scores_recall = grid_lr_l2.decision_function(X_test)
#The precision, recall, and accuracy scores for every combination
#of the parameters in param_grid are stored in cv_results_
results = pd.DataFrame()
results['l1_results'] = pd.DataFrame(grid_lr_l1.cv_results_)
results['l1_results'] = results['l2_results'].sort_values(by='mean_test_precision_score', ascending=False)
results['l2_results'] = pd.DataFrame(grid_lr_l2.cv_results_)
results['l2_results'] = results['l2_results'].sort_values(by='mean_test_precision_score', ascending=False)
return results
LogisticR_penalty()
I expected from .cv_results_, the average test scores of each parameter combination that I should be available here: mean_test_precision_score but not sure
The output is: ValueError: Invalid parameter gamma for estimator LogisticRegression. Check the list of available parameters with estimator.get_params().keys().
The error message contains the answer for your question. You can use the function estimator.get_params().keys() to see all available parameters for you estimator:
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
print(lr.get_params().keys())
Output:
dict_keys(['C', 'class_weight', 'dual', 'fit_intercept', 'intercept_scaling', 'l1_ratio', 'max_iter', 'multi_class', 'n_jobs', 'penalty', 'random_state', 'solver', 'tol', 'verbose', 'warm_start'])
From scikit-learn's documentation, the LogisticRegression has no parameter gamma, but a parameter C for the regularization weight.
If you change grid_values = {'gamma':[0.01, 0.1, 1, 10, 100]} for grid_values = {'C':[0.01, 0.1, 1, 10, 100]} your code should work.
My code contained some errors the main error was using param_grid incorrectly. I had to apply L1 and L2 penalties with gamma 0.01, 0.1, 1, 10, 100. The right way to do this is:
grid_values ​​= {'penalty': ['l1', 'l2'], 'C': [0.01, 0.1, 1, 10, 100]}
Then it was necessary to correct the way I was training my logistic regression and to correct the way I retrieved the scores in cv_results_ and averaged those scores.
Follow my code:
from sklearn.model_selection import train_test_split
df = pd.read_csv('fraud_data.csv')
X = df.iloc[:,:-1]
y = df.iloc[:,-1]
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
def LogisticR_penalty():
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
grid_values = {'penalty': ['l1', 'l2'], 'C': [0.01, 0.1, 1, 10, 100]}
#train de model with many parameters for "C" and penalty='l1'
lr = LogisticRegression()
# We use GridSearchCV to find the value of the range that optimizes a given measurement metric.
grid_lr_recall = GridSearchCV(lr, param_grid = grid_values, cv=3, scoring = 'recall')
grid_lr_recall.fit(X_train, y_train)
y_decision_fn_scores_recall = grid_lr_recall.decision_function(X_test)
##The precision, recall, and accuracy scores for every combination
#of the parameters in param_grid are stored in cv_results_
CVresults = []
CVresults = pd.DataFrame(grid_lr_recall.cv_results_)
#test scores and mean of them
split_test_scores = np.vstack((CVresults['split0_test_score'], CVresults['split1_test_score'], CVresults['split2_test_score']))
mean_scores = split_test_scores.mean(axis=0).reshape(5, 2)
return mean_scores
LogisticR_penalty()

Stacking StandardScaler() with RFECV and GridSearchCV

So I found out that StandardScaler() can make my RFECV inside my GridSearchCV with each on a nested 3-fold cross validation run faster. Without StandardScaler(), my code was running for more than 2 days, so I canceled and decided to inject StandardScaler into the process. But now it is has been running for more than 4 hours and I am not sure if I have done it right. Here is my code:
# Choose Linear SVM as classifier
LSVM = SVC(kernel='linear')
selector = RFECV(LSVM, step=1, cv=3, scoring='f1')
param_grid = [{'estimator__C': [0.001, 0.01, 0.1, 1, 10, 100]}]
clf = make_pipeline(StandardScaler(),
GridSearchCV(selector,
param_grid,
cv=3,
refit=True,
scoring='f1'))
clf.fit(X, Y)
I think I haven't gotten it right to be honest because I think the StandardScaler() should be put inside the GridSearchCV() function for it to normalize the data each fold, not only just once (?). Please correct me if I am wrong or if my pipeline is incorrect and hence why it is still running for a long time.
I have 8,000 rows of 145 features to be pruned by RFECV, and 6 C-Values to be pruned by GridSearchCV. So for each C-Value, the best feature set is determined by the RFECV.
Thanks!
Update:
So I put the StandardScaler inside the RFECV like this:
clf = SVC(kernel='linear')
kf = KFold(n_splits=3, shuffle=True, random_state=0)
estimators = [('standardize' , StandardScaler()),
('clf', clf)]
class Mypipeline(Pipeline):
#property
def coef_(self):
return self._final_estimator.coef_
#property
def feature_importances_(self):
return self._final_estimator.feature_importances_
pipeline = Mypipeline(estimators)
rfecv = RFECV(estimator=pipeline, cv=kf, scoring='f1', verbose=10)
param_grid = [{'estimator__svc__C': [0.001, 0.01, 0.1, 1, 10, 100]}]
clf = GridSearchCV(rfecv, param_grid, cv=3, scoring='f1', verbose=10)
But it still throws out the following error:
ValueError: Invalid parameter C for estimator Pipeline(memory=None,
steps=[('standardscaler', StandardScaler(copy=True, with_mean=True, >with_std=True)), ('svc', SVC(C=1.0, cache_size=200, class_weight=None, >coef0=0.0,
decision_function_shape='ovr', degree=3, gamma='auto', kernel='linear',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False))]). Check the list of available parameters with >estimator.get_params().keys().
Kumar is right. Also, what You might want to do, turn on verbose in the GridSearchCV. Also, You could add a limit to the number of iterations of the SVC, starting from a very small number, like 5, just to make sure that the problem is not with the convergence.

GridSearch / make_scorer strange results with xgboost model

I am tryting to use sklearn's gridsearch with a model created by xgboost. To do this, I am creating a custom scorer based on ndcg evaluation. I am successfully able to use Snippet 1 but it is too messy / hacky, I would prefer to use good old sklearn to simplify the code. I tried to implement GridSearch and the results is completely off: for the same X and y sets I get NDCG#k = 0.8 with Snippet 1 versus 0.5 with Snippet 2. Obviously there something I am not doing right here ...
The following pieces of code return very different results:
Snippet1:
kf = StratifiedKFold(y, n_folds=5, shuffle=True, random_state=42)
max_depth = [6]
learning_rate = [0.22]
n_estimators = [43]
reg_alpha = [0.1]
reg_lambda = [10]
for md in max_depth:
for lr in learning_rate:
for ne in n_estimators:
for ra in reg_alpha:
for rl in reg_lambda:
xgb = XGBClassifier(objective='multi:softprob',
max_depth=md,
learning_rate=lr,
n_estimators=ne,
reg_alpha=ra,
reg_lambda=rl,
subsample=0.6, colsample_bytree=0.6, seed=0)
print([md, lr, ne])
score = []
for train_index, test_index in kf:
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
xgb.fit(X_train, y_train)
y_pred = xgb.predict_proba(X_test)
score.append(ndcg_scorer(y_test, y_pred))
print('all scores: %s' % score)
print('average score: %s' % np.mean(score))
Snippet2:
from sklearn.grid_search import GridSearchCV
params = {
'max_depth':[6],
'learning_rate':[0.22],
'n_estimators':[43],
'reg_alpha':[0.1],
'reg_lambda':[10],
'subsample':[0.6],
'colsample_bytree':[0.6]
}
xgb = XGBClassifier(objective='multi:softprob',seed=0)
scorer = make_scorer(ndcg_scorer, needs_proba=True)
gs = GridSearchCV(xgb, params, cv=5, scoring=scorer, verbose=10, refit=False)
gs.fit(X,y)
gs.best_score_
While snippet1 gives me the result as expected, the score returned by Snippet2 is not consistent with the ndcg_scorer.
The problem is with cv inGridSearchCV(xgb, params, cv=5, scoring=scorer, verbose=10, refit=False). It can recieve a KFold / StratifiedKFold instead of an int. Unlike what is says in the doc it seems that by default an agrument of type 'int' is not calling StratifiedKFold another function maybe KFold.

Resources