make custom scorer with GridSearchCV - python-3.x

I have the code below where I’m trying to use a custom scorer I defined “custom_loss_five” with GridSearchCV to tune hyper parameters. I have the example code below. I also have some sample data. I’m getting an error 'numpy.dtype' object has no attribute 'base_dtype’. I think this is because I’m mixing keras code with sklearn. I’m also using this same “custom_loss_five” function to train a neural network. So that’s why I used keras. If anyone could point out the issue and let me know how to adapt the function to use with GridSearchCV I would appreciate it.
sample data:
print(x_train_scld[:5])
[[ 0.37773519 2.0109691 0.49644224 0.21679945 0.538941 1.99144889
2.15011467 1.20312084 0.86114816 0.79507318 -0.45602028 0.07146743
-0.19524294 -0.33405545 -0.60264522 1.26724727 1.44991588 0.74630967
0.16529837 0.89613455 0.3253014 2.19166429 0.64865429 0.12894674
0.46995314 3.41479052 4.44308499 1.83182458 1.54348561 2.50155582]
[ 0.32029317 0.1214269 0.28824456 0.13510828 -0.0851059 -0.0057386
-0.31671716 0.0303454 0.32754165 -0.15354084 -0.36310852 -0.34419771
-0.28347519 -0.28927174 -0.39507256 -0.2039463 -0.49919802 0.12281647
-0.56756272 -0.30637335 0.10701249 0.21461633 0.17531634 -0.04414507
0.19574444 0.36354262 -1.23318869 0.59029124 0.28936372 0.19248437]
[ 0.25843254 0.29037034 0.21339798 0.12738073 0.28185716 -0.47995085
-0.13321816 0.14228058 -3.69915162 -0.10246162 0.26193423 0.12807553
0.18956053 0.12487671 -0.28174435 -0.71770499 -0.34455425 0.00729992
-0.70102685 -0.57022389 0.59171701 0.77319193 0.52065985 -1.37655715
0.59387438 -1.52826854 0.18054306 0.76212977 0.3639211 0.08726502]
[-0.70482588 -0.32963569 -0.74849491 -0.86505667 0.10026287 -0.87877366
-1.06584707 -1.19559926 0.34039964 0.10112554 -0.62427503 -0.3134676
-0.65996358 -0.52932857 0.11989554 -0.95345177 -0.67459484 -0.82130922
-0.52228025 -0.38191412 -0.75239269 -0.31180246 -0.7418967 -0.7432583
0.12191902 -0.97620932 -1.02049823 -1.20098216 -0.02333216 -0.24853266]
[-0.36680171 -0.14757043 -0.41413663 -0.56754624 -0.34512544 -0.76162172
-0.72684687 -0.61557149 0.31896966 -0.25351016 -0.6357623 0.12484078
-0.71632135 -0.51097128 0.26933611 -0.53549047 -0.54070413 -0.36472263
-0.24581883 -0.67901706 -0.44128802 0.16221265 -0.42239358 -0.52459003
0.34339528 -0.43064345 -1.23318869 -0.23310168 0.44404246 -0.40964978]]
print(x_test_scld[:5])
[[ 2.60641850e-01 -7.18369636e-01 3.27138629e-01 -1.76172773e+00
4.67645320e-01 1.53766591e+00 7.62837058e-01 4.07109050e-01
7.71142242e-01 9.80417766e-01 5.10262027e-01 5.66383900e-01
9.28678845e-01 2.06576727e-01 9.68389151e-01 1.48288576e+00
7.53349504e-01 7.04842193e-01 7.80186706e-01 6.43850055e-01
1.43107505e-01 -7.20312971e-01 2.96065817e-01 -4.51322867e-02
1.93107816e-01 7.41280492e-01 3.28514299e-01 4.47039330e-02
1.39136160e-01 4.94989991e-01]
[-7.51730115e-02 4.92568820e-02 -7.29146850e-02 -2.86318841e-01
1.00026599e+00 4.43886212e-01 4.80336890e-01 6.71683119e-01
8.61148159e-01 5.21434522e-01 -3.65135682e-01 -4.32021118e-01
-4.10049198e-01 -3.01778906e-01 -4.27568719e-02 -1.34413479e+00
-4.09570872e-02 1.64283954e-01 -3.04209384e-01 -7.10176931e-03
7.32148655e-03 -2.90459367e+00 2.31719950e-02 -1.37655715e+00
1.44286672e+00 1.07281572e+00 1.19548020e+00 1.44805187e+00
1.33316704e+00 1.55622575e+00]
[-1.23777794e-01 -3.83763205e-01 -1.65737513e-01 -3.43999436e-01
3.58604868e-01 -3.45623859e-01 -2.89602186e-01 -3.38277511e-01
8.23494778e-03 2.97415674e-01 -6.27653637e-01 -6.42441486e-01
-7.17707195e-01 -4.34516210e-01 6.01100047e-01 -2.64325075e-01
-2.31751338e-01 4.13624916e-02 7.46820672e-01 3.84336779e-01
-3.24408912e-01 -5.30945125e-01 -3.14685046e-01 -4.13363730e-01
6.43970206e-01 -2.37091815e-01 -1.45963962e-01 -2.97594271e-02
7.54512744e-01 6.49530907e-01]
[ 1.06041146e+00 3.61350612e-02 9.93240469e-01 1.11126264e+00
-2.54537983e-01 -2.50709092e-01 -3.56042668e-02 -1.19559926e+00
-2.25351836e-01 -4.65124054e-01 -4.64466800e-01 -1.10808348e+00
-4.47005113e-01 -2.07571731e-01 -1.11908130e+00 -8.49190558e-01
-5.40704133e-01 -6.40037086e-01 -1.10737748e+00 -9.30940117e-01
9.76730527e-01 2.34863210e-01 9.02228200e-01 9.43399666e-01
-1.25487123e-02 -1.70804996e-03 4.83277659e-01 7.07714236e-01
5.60886115e-01 -4.38009686e-01]
[ 3.57851416e-01 1.87811066e+00 2.77785646e-01 2.23975029e-01
-3.66933526e-01 -9.49100986e-01 -4.74866806e-01 -4.98802740e-01
2.69680706e-01 -5.60715159e-01 2.46392629e-01 7.53999293e-01
1.19344293e-01 1.24473258e-01 4.50284535e-02 -5.74844494e-01
-1.80203418e-01 -2.89340672e-01 1.37362545e+00 -6.91305992e-01
2.80612333e-01 1.49136056e+00 1.99466234e-01 1.55930637e-01
-2.39298218e-01 -9.12274848e-01 -4.82659170e-01 -6.00406523e-01
5.90931626e-01 -7.55722792e-01]]
print(y_train[:5])
562 1
291 0
16 1
546 0
293 0
Name: diagnosis, dtype: int64
print(y_test[:5])
421 0
47 1
292 0
186 1
414 1
Name: diagnosis, dtype: int64
Code:
# custom loss function
# importing libraries
import io
import os
import time
import pandas as pd
import numpy as np
import keras
from keras.models import Sequential
from keras.layers import Dense
import keras.backend as K
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, roc_curve, roc_auc_score, precision_recall_fscore_support, accuracy_score
import matplotlib.pyplot as plt
from IPython.core.display import display, HTML
# from sklearn.grid_search import GridSearchCV
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import fbeta_score, make_scorer
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier
# custom loss function
def custom_loss_wrapper(fn_cost=1, fp_cost=1):
def custom_loss(y_true, y_pred, fn_cost=fn_cost, fp_cost=fp_cost):
h = K.ones_like(y_pred)
fn_value = fn_cost * h
fp_value = fp_cost * h
weighted_values = y_true * K.abs(1-y_pred)*fn_value + (1-y_true) * K.abs(y_pred)*fp_value
loss = K.mean(weighted_values)
return loss
return custom_loss
custom_loss_five = custom_loss_wrapper(fn_cost=5, fp_cost=1)
# TODO: Initialize the classifier
clf = AdaBoostClassifier(random_state=0)
# TODO: Create the parameters list you wish to tune
parameters = {'n_estimators':[100,200,300],'learning_rate':[1.0,2.0,4.0]}
# TODO: Make an fbeta_score scoring object
# scorer = make_scorer(fbeta_score, beta=0.5)
scorer2 = make_scorer(custom_loss_five)
# TODO: Perform grid search on the classifier using 'scorer' as the scoring method
grid_obj2 = GridSearchCV(clf,parameters,scoring=scorer2)
# TODO: Fit the grid search object to the training data and find the optimal parameters
grid_fit2 = grid_obj2.fit(x_train_scld,y_train)
# Get the estimator
best_clf2 = grid_fit2.best_estimator_
# Make predictions using the unoptimized and model
predictions = (clf.fit(x_train_scld, y_train)).predict(x_test_scld)
best_predictions = best_clf.predict(x_test_scld)
# Report the before-and-afterscores
print("Unoptimized model\n------")
print("Accuracy score on testing data: {:.4f}".format(accuracy_score(y_test, predictions)))
# print("F-score on testing data: {:.4f}".format(fbeta_score(y_test, predictions, beta = 0.5)))
print("F-score on testing data: {:.4f}".format(fbeta_score(y_test, predictions, beta = 0.5)))
print("\nOptimized Model\n------")
print("Final accuracy score on the testing data: {:.4f}".format(accuracy_score(y_test, best_predictions)))
# print("Final F-score on the testing data: {:.4f}".format(fbeta_score(y_test, best_predictions, beta = 0.5)))
print("Final F-score on the testing data: {:.4f}".format(fbeta_score(y_test, best_predictions, beta = 0.5)))
error:
/Users/sshields/anaconda2/envs/py36/lib/python3.6/site-packages/sklearn/model_selection/_split.py:2053: FutureWarning: You should specify a value for 'cv' instead of relying on the default value. The default value will change from 3 to 5 in version 0.22.
warnings.warn(CV_WARNING, FutureWarning)
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-34-b87eab01e7ec> in <module>()
24
25 # TODO: Fit the grid search object to the training data and find the optimal parameters
---> 26 grid_fit2 = grid_obj2.fit(x_train_scld,y_train)
27
28 # Get the estimator
~/anaconda2/envs/py36/lib/python3.6/site-packages/sklearn/model_selection/_search.py in fit(self, X, y, groups, **fit_params)
720 return results_container[0]
721
--> 722 self._run_search(evaluate_candidates)
723
724 results = results_container[0]
~/anaconda2/envs/py36/lib/python3.6/site-packages/sklearn/model_selection/_search.py in _run_search(self, evaluate_candidates)
1189 def _run_search(self, evaluate_candidates):
1190 """Search all candidates in param_grid"""
-> 1191 evaluate_candidates(ParameterGrid(self.param_grid))
1192
1193
~/anaconda2/envs/py36/lib/python3.6/site-packages/sklearn/model_selection/_search.py in evaluate_candidates(candidate_params)
709 for parameters, (train, test)
710 in product(candidate_params,
--> 711 cv.split(X, y, groups)))
712
713 all_candidate_params.extend(candidate_params)
~/anaconda2/envs/py36/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py in __call__(self, iterable)
915 # remaining jobs.
916 self._iterating = False
--> 917 if self.dispatch_one_batch(iterator):
918 self._iterating = self._original_iterator is not None
919
~/anaconda2/envs/py36/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py in dispatch_one_batch(self, iterator)
757 return False
758 else:
--> 759 self._dispatch(tasks)
760 return True
761
~/anaconda2/envs/py36/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py in _dispatch(self, batch)
714 with self._lock:
715 job_idx = len(self._jobs)
--> 716 job = self._backend.apply_async(batch, callback=cb)
717 # A job can complete so quickly than its callback is
718 # called before we get here, causing self._jobs to
~/anaconda2/envs/py36/lib/python3.6/site-packages/sklearn/externals/joblib/_parallel_backends.py in apply_async(self, func, callback)
180 def apply_async(self, func, callback=None):
181 """Schedule a func to be run"""
--> 182 result = ImmediateResult(func)
183 if callback:
184 callback(result)
~/anaconda2/envs/py36/lib/python3.6/site-packages/sklearn/externals/joblib/_parallel_backends.py in __init__(self, batch)
547 # Don't delay the application, to avoid keeping the input
548 # arguments in memory
--> 549 self.results = batch()
550
551 def get(self):
~/anaconda2/envs/py36/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py in __call__(self)
223 with parallel_backend(self._backend, n_jobs=self._n_jobs):
224 return [func(*args, **kwargs)
--> 225 for func, args, kwargs in self.items]
226
227 def __len__(self):
~/anaconda2/envs/py36/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py in <listcomp>(.0)
223 with parallel_backend(self._backend, n_jobs=self._n_jobs):
224 return [func(*args, **kwargs)
--> 225 for func, args, kwargs in self.items]
226
227 def __len__(self):
~/anaconda2/envs/py36/lib/python3.6/site-packages/sklearn/model_selection/_validation.py in _fit_and_score(estimator, X, y, scorer, train, test, verbose, parameters, fit_params, return_train_score, return_parameters, return_n_test_samples, return_times, return_estimator, error_score)
566 fit_time = time.time() - start_time
567 # _score will return dict if is_multimetric is True
--> 568 test_scores = _score(estimator, X_test, y_test, scorer, is_multimetric)
569 score_time = time.time() - start_time - fit_time
570 if return_train_score:
~/anaconda2/envs/py36/lib/python3.6/site-packages/sklearn/model_selection/_validation.py in _score(estimator, X_test, y_test, scorer, is_multimetric)
603 """
604 if is_multimetric:
--> 605 return _multimetric_score(estimator, X_test, y_test, scorer)
606 else:
607 if y_test is None:
~/anaconda2/envs/py36/lib/python3.6/site-packages/sklearn/model_selection/_validation.py in _multimetric_score(estimator, X_test, y_test, scorers)
633 score = scorer(estimator, X_test)
634 else:
--> 635 score = scorer(estimator, X_test, y_test)
636
637 if hasattr(score, 'item'):
~/anaconda2/envs/py36/lib/python3.6/site-packages/sklearn/metrics/scorer.py in __call__(self, estimator, X, y_true, sample_weight)
96 else:
97 return self._sign * self._score_func(y_true, y_pred,
---> 98 **self._kwargs)
99
100
<ipython-input-4-afa574df52f0> in custom_loss(y_true, y_pred, fn_cost, fp_cost)
11 weighted_values = y_true * K.abs(1-y_pred)*fn_value + (1-y_true) * K.abs(y_pred)*fp_value
12
---> 13 loss = K.mean(weighted_values)
14 return loss
15
~/anaconda2/envs/py36/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py in mean(x, axis, keepdims)
1377 A tensor with the mean of elements of `x`.
1378 """
-> 1379 if x.dtype.base_dtype == tf.bool:
1380 x = tf.cast(x, floatx())
1381 return tf.reduce_mean(x, axis, keepdims)
AttributeError: 'numpy.dtype' object has no attribute 'base_dtype'

The custom scoring function need not has to be a Keras function.
Here is a working example.
from sklearn import svm, datasets
import numpy as np
from sklearn.metrics import make_scorer
from sklearn.model_selection import GridSearchCV
iris = datasets.load_iris()
parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10]}
def custom_loss(y_true, y_pred):
fn_cost, fp_cost = 5, 1
h = np.ones(len(y_pred))
fn_value = fn_cost * h
fp_value = fp_cost * h
weighted_values = y_true * np.abs(1-y_pred)*fn_value + (1-y_true) * np.abs(y_pred)*fp_value
loss = np.mean(weighted_values)
return loss
svc = svm.SVC()
clf = GridSearchCV(svc, parameters, cv=5,scoring= make_scorer(custom_loss, greater_is_better=True))
clf.fit(iris.data, iris.target)

Related

Scikit learn custom scoring function - Specificity

I'm trying to do a random grid search on randomforestclassifier.
# Instantiate a RandomForestClassifier
RFC = RandomForestClassifier()
# Instantiate the RandomizedSearchCV object: RFC
rand_search3 = RandomizedSearchCV(RFC, param_grid, n_iter=10, cv=5,n_jobs=-1, verbose=1, scoring = "f1_macro")
# Fit it to the data
rand_search3.fit(X_train_transformed,y_train)
I'm trying to get the best model by assessing specificity.
Went through the documentation for custom scoring. Also looked at lots of posts that are related. I have came up with 2 ways for the specificity.
1 :
from sklearn.metrics import make_scorer
def my_custom_func(y_true, y_pred):
cm = confusion_matrix(y_true, y_pred)
return cm[1][1] / (cm[1][0] + cm[1][1])
Specificity_score = make_scorer(my_custom_func, greater_is_better=True)
2:
from sklearn.metrics import recall_score
from sklearn.metrics import make_scorer
specificity = make_scorer(recall_score, pos_label=0, greater_is_better=True)
specificity
When I try using the custom function for the scoring,
rand_search3 = RandomizedSearchCV(RFC, param_grid, n_iter=10, cv=5,n_jobs=-1, verbose=1, scoring = Specificity_score)
rand_search3 = RandomizedSearchCV(RFC, param_grid, n_iter=10, cv=5,n_jobs=-1, verbose=1, scoring = specificity)
both failed with the same error message.
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_10548/1204393696.py in <module>
20
21 # Fit it to the data
---> 22 rand_search3.fit(X_train_transformed,y_train)
23
~\anaconda3\lib\site-packages\sklearn\utils\validation.py in inner_f(*args, **kwargs)
61 #wraps(f)
62 def inner_f(*args, **kwargs):
---> 63 extra_args = len(args) - len(all_args)
64 if extra_args <= 0:
65 return f(*args, **kwargs)
~\anaconda3\lib\site-packages\sklearn\model_selection\_search.py in fit(self, X, y, groups, **fit_params)
839 delayed(_fit_and_score)(
840 clone(base_estimator),
--> 841 X,
842 y,
843 train=train,
~\anaconda3\lib\site-packages\sklearn\model_selection\_search.py in _run_search(self, evaluate_candidates)
1631 Mean cross-validated score of the best_estimator.
1632
-> 1633 For multi-metric evaluation, this is not available if ``refit`` is
1634 ``False``. See ``refit`` parameter for more information.
1635
~\anaconda3\lib\site-packages\sklearn\model_selection\_search.py in evaluate_candidates(candidate_params, cv, more_results)
825 def evaluate_candidates(candidate_params, cv=None, more_results=None):
826 cv = cv or cv_orig
--> 827 candidate_params = list(candidate_params)
828 n_candidates = len(candidate_params)
829
~\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py in _insert_error_scores(results, error_score)
293
294 results = _aggregate_score_dicts(results)
--> 295
296 ret = {}
297 ret["fit_time"] = results["fit_time"]
KeyError: 'fit_failed'
Any solutions?

Error in the Random Forest model fit with GridSearch while using a pipeline()

I am trying to implement ColumnTransformer() and Pipeline() following this document:
https://scikit-learn.org/stable/auto_examples/compose/plot_column_transformer_mixed_types.html
I have encountered a ValueError using the prediction pipeline in a grid search.
My data is of mixed types with a binomial target (the 'left' column has values 'yes' or 'no').
department category
promoted int64
review float64
projects int64
salary object
tenure float64
satisfaction float64
bonus int64
avg_hrs_month float64
left object
dtype: object
here is my code.
import pandas as pd
import numpy as np
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder, StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.compose import make_column_selector as selector
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
department promoted review projects salary ... left
0 operations 0 0.577569 3 low no
1 operations 0 0.751900 3 medium no
2 support 0 0.722548 3 medium yes
3 logistics 0 0.675158 4 high no
4 sales 0 0.676203 3 high no
ord_features = ["salary","left"]
ordinal_transformer = OrdinalEncoder()
cat_features = ["department"]
categorical_transformer = OneHotEncoder(handle_unknown="ignore")
preprocessor = ColumnTransformer(
transformers=[
("num", ordinal_transformer, selector(dtype_include="object")),
("cat", categorical_transformer, selector(dtype_include="category")),
]
)
clf = Pipeline(
steps=[("preprocessor", preprocessor), ("classifier", RandomForestClassifier())]
)
X_train, X_test, y_train, y_test = train_test_split(X_over, y_over, train_size=0.80, test_size=0.20, random_state=32)
clf.fit(X_train, y_train)
Out:
Pipeline(steps=[('preprocessor',
ColumnTransformer(transformers=[('num', OrdinalEncoder(),
<sklearn.compose._column_transformer.make_column_selector object at 0x0000028FD9C38CA0>),
('cat',
OneHotEncoder(handle_unknown='ignore'),
<sklearn.compose._column_transformer.make_column_selector object at 0x0000028FD9D863A0>)])),
('classifier', RandomForestClassifier())])
param_grid = {
'max_depth':[30, 40, 50],
'n_estimators':[ 80, 100, 120, 130, 150]
}
grid_search = GridSearchCV(clf, param_grid,scoring='roc_auc', cv=6)
Out:
GridSearchCV(cv=6,
estimator=Pipeline(steps=[('preprocessor',
ColumnTransformer(transformers=[('num',
OrdinalEncoder(),
<sklearn.compose._column_transformer.make_column_selector object at 0x0000028FD9C38CA0>),
('cat',
OneHotEncoder(handle_unknown='ignore'),
<sklearn.compose._column_transformer.make_column_selector object at 0x0000028FD9D863A0>)])),
('classifier',
RandomForestClassifier())]),
param_grid={'max_depth': [30, 40, 50],
'n_estimators': [80, 100, 120, 130, 150]},
scoring='roc_auc')
grid_search.fit(X_train, y_train)
Out:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_36048/20025057.py in <module>
----> 1 grid_search.fit(X_train, y_train)
~\.conda\envs\tf-gpu\lib\site-packages\sklearn\model_selection\_search.py in fit(self, X, y, groups, **fit_params)
889 return results
890
--> 891 self._run_search(evaluate_candidates)
892
893 # multimetric is determined here because in the case of a callable
~\.conda\envs\tf-gpu\lib\site-packages\sklearn\model_selection\_search.py in _run_search(self, evaluate_candidates)
1390 def _run_search(self, evaluate_candidates):
1391 """Search all candidates in param_grid"""
-> 1392 evaluate_candidates(ParameterGrid(self.param_grid))
1393
1394
~\.conda\envs\tf-gpu\lib\site-packages\sklearn\model_selection\_search.py in evaluate_candidates(candidate_params, cv, more_results)
836 )
837
--> 838 out = parallel(
839 delayed(_fit_and_score)(
840 clone(base_estimator),
~\.conda\envs\tf-gpu\lib\site-packages\joblib\parallel.py in __call__(self, iterable)
1046 # remaining jobs.
1047 self._iterating = False
-> 1048 if self.dispatch_one_batch(iterator):
1049 self._iterating = self._original_iterator is not None
1050
~\.conda\envs\tf-gpu\lib\site-packages\joblib\parallel.py in dispatch_one_batch(self, iterator)
864 return False
865 else:
--> 866 self._dispatch(tasks)
867 return True
868
~\.conda\envs\tf-gpu\lib\site-packages\joblib\parallel.py in _dispatch(self, batch)
782 with self._lock:
783 job_idx = len(self._jobs)
--> 784 job = self._backend.apply_async(batch, callback=cb)
785 # A job can complete so quickly than its callback is
786 # called before we get here, causing self._jobs to
~\.conda\envs\tf-gpu\lib\site-packages\joblib\_parallel_backends.py in apply_async(self, func, callback)
206 def apply_async(self, func, callback=None):
207 """Schedule a func to be run"""
--> 208 result = ImmediateResult(func)
209 if callback:
210 callback(result)
~\.conda\envs\tf-gpu\lib\site-packages\joblib\_parallel_backends.py in __init__(self, batch)
570 # Don't delay the application, to avoid keeping the input
571 # arguments in memory
--> 572 self.results = batch()
573
574 def get(self):
~\.conda\envs\tf-gpu\lib\site-packages\joblib\parallel.py in __call__(self)
260 # change the default number of processes to -1
261 with parallel_backend(self._backend, n_jobs=self._n_jobs):
--> 262 return [func(*args, **kwargs)
263 for func, args, kwargs in self.items]
264
~\.conda\envs\tf-gpu\lib\site-packages\joblib\parallel.py in <listcomp>(.0)
260 # change the default number of processes to -1
261 with parallel_backend(self._backend, n_jobs=self._n_jobs):
--> 262 return [func(*args, **kwargs)
263 for func, args, kwargs in self.items]
264
~\.conda\envs\tf-gpu\lib\site-packages\sklearn\utils\fixes.py in __call__(self, *args, **kwargs)
209 def __call__(self, *args, **kwargs):
210 with config_context(**self.config):
--> 211 return self.function(*args, **kwargs)
212
213
~\.conda\envs\tf-gpu\lib\site-packages\sklearn\model_selection\_validation.py in _fit_and_score(estimator, X, y, scorer, train, test, verbose, parameters, fit_params, return_train_score, return_parameters, return_n_test_samples, return_times, return_estimator, split_progress, candidate_progress, error_score)
667 cloned_parameters[k] = clone(v, safe=False)
668
--> 669 estimator = estimator.set_params(**cloned_parameters)
670
671 start_time = time.time()
~\.conda\envs\tf-gpu\lib\site-packages\sklearn\pipeline.py in set_params(self, **kwargs)
186 Pipeline class instance.
187 """
--> 188 self._set_params("steps", **kwargs)
189 return self
190
~\.conda\envs\tf-gpu\lib\site-packages\sklearn\utils\metaestimators.py in _set_params(self, attr, **params)
52 self._replace_estimator(attr, name, params.pop(name))
53 # 3. Step parameters and other initialisation arguments
---> 54 super().set_params(**params)
55 return self
56
~\.conda\envs\tf-gpu\lib\site-packages\sklearn\base.py in set_params(self, **params)
238 key, delim, sub_key = key.partition("__")
239 if key not in valid_params:
--> 240 raise ValueError(
241 "Invalid parameter %s for estimator %s. "
242 "Check the list of available parameters "
ValueError: Invalid parameter max_depth for estimator Pipeline(steps=[('preprocessor',
ColumnTransformer(transformers=[('num', OrdinalEncoder(),
<sklearn.compose._column_transformer.make_column_selector object at 0x0000028FD9E1CE80>),
('cat',
OneHotEncoder(handle_unknown='ignore'),
<sklearn.compose._column_transformer.make_column_selector object at 0x0000028FD9EFEEB0>)])),
('classifier', RandomForestClassifier())]). Check the list of available parameters with `estimator.get_params().keys()`.
So it gives me an error after grid_search.fit(X_train, y_train), ValueError: Invalid parameter max_depth for estimator Pipeline.
I have printed out the estimator's list of parameters:
clf.get_params().keys()
dict_keys(['memory', 'steps', 'verbose', 'preprocessor', 'classifier', 'preprocessor__n_jobs', 'preprocessor__remainder', 'preprocessor__sparse_threshold', 'preprocessor__transformer_weights', 'preprocessor__transformers', 'preprocessor__verbose', 'preprocessor__verbose_feature_names_out', 'preprocessor__num', 'preprocessor__cat', 'preprocessor__num__categories', 'preprocessor__num__dtype', 'preprocessor__num__handle_unknown', 'preprocessor__num__unknown_value', 'preprocessor__cat__categories', 'preprocessor__cat__drop', 'preprocessor__cat__dtype', 'preprocessor__cat__handle_unknown', 'preprocessor__cat__sparse', 'classifier__bootstrap', 'classifier__ccp_alpha', 'classifier__class_weight', 'classifier__criterion', 'classifier__max_depth', 'classifier__max_features', 'classifier__max_leaf_nodes', 'classifier__max_samples', 'classifier__min_impurity_decrease', 'classifier__min_samples_leaf', 'classifier__min_samples_split', 'classifier__min_weight_fraction_leaf', 'classifier__n_estimators', 'classifier__n_jobs', 'classifier__oob_score', 'classifier__random_state', 'classifier__verbose', 'classifier__warm_start'])
If do a gridsearch without a pipeline, I can get best estimators as follows:
n_estimators = 150, max_depth=30.
What can be wrong with my pipeline?

Problem in GridSearching a LSTM network - Batch_size issue

I wrote code to apply the gridsearch method to a LSTM network built with keras. Everything seems to work fine, but i get some problem with passing the batch_size.
I tried to change the format of batch_size but, as i understand, it must be a tuple.
#LSTM ok
from Methods.LSTM_1HL import LSTM_method
Yhat_train_LSTM, Yhat_test_LSTM = LSTM_method(X_train, X_test, Y_train,
Y_test)
def create_model(optimizer, hl1_nodes, input_shape):
# creation of the NN - Electric Load
# LSTM layers followed by other LSTM layer must have the parameter "return_sequences" set at True
model = Sequential()
model.add(LSTM(units = hl1_nodes , input_shape=input_shape, return_sequences=False))
model.add(Dense(1, activation="linear")) # output layer
model.compile(optimizer=optimizer, loss='mean_squared_error', metrics=['accuracy'])
model.summary()
return model
def LSTM_method(X_train, X_test, Y_train, Y_test):
# normalize X and Y data
mmsx = MinMaxScaler()
mmsy = MinMaxScaler()
X_train = mmsx.fit_transform(X_train)
X_test = mmsx.transform(X_test)
Y_train = mmsy.fit_transform(Y_train)
Y_test = mmsy.transform(Y_test)
X_train = X_train.reshape(X_train.shape[0], 1, X_train.shape[1])
# NN for Electric Load
# LSTM Input Shape
time_steps = 1 # number of time-steps you are feeding a sequence (?)
inputs_numb = X_train.shape[1] # number of inputs
input_shape=(time_steps, inputs_numb)
model = KerasRegressor(build_fn=create_model,verbose=1)
#GridSearch code
start=time()
optimizers = ['rmsprop', 'adam']
epochs = np.array([100, 500, 1000])
hl1_nodes = np.array([1, 10, 50])
btcsz = np.array([1,X_train.shape[0]])
param_grid = dict(optimizer=optimizers, hl1_nodes=hl1_nodes, input_shape=input_shape, nb_epoch=epochs,batch_size=btcsz)
scoring = make_scorer(accuracy_score) #in order to use a metric as a scorer
grid = GridSearchCV(estimator=model, param_grid=param_grid, scoring = scoring)
grid_result = grid.fit(X_train, Y_train)
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
for params, mean_score, scores in grid_result.grid_scores_:
print("%f (%f) with: %r" % (scores.mean(), scores.std(), params))
print("total time:",time()-start)
# Predictions - Electric Load
Yhat_train = grid_result.predict(X_train, verbose=0)
X_test = X_test.reshape(X_test.shape[0], 1, X_test.shape[1])
Yhat_test = grid_result.predict(X_test, verbose=0)
# Denormalization - Electric Load
Yhat_train = mmsy.inverse_transform(Yhat_train)
Yhat_test = mmsy.inverse_transform(Yhat_test)
Y_train = mmsy.inverse_transform(Y_train)
Y_test = mmsy.inverse_transform(Y_test)
return Yhat_train, Yhat_test
Below the error I get:
TypeError Traceback (most recent call last)
in
10 #from Methods.LSTM_1HL import create_model
11
---> 12 Yhat_train_LSTM, Yhat_test_LSTM = LSTM_method(X_train, X_test, Y_train, Y_test)
c:\Users\ER180124\Code\LoadForecasting\Methods\LSTM_1HL.py in LSTM_method(X_train, X_test, Y_train, Y_test)
62 scoring = make_scorer(accuracy_score) #in order to use a metric as a scorer
63 grid = GridSearchCV(estimator=model, param_grid=param_grid, scoring = scoring)
---> 64 grid_result = grid.fit(X_train, Y_train)
65
66 print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
~\.conda\envs\PierEnv\lib\site-packages\sklearn\model_selection\_search.py in fit(self, X, y, groups, **fit_params)
720 return results_container[0]
721
--> 722 self._run_search(evaluate_candidates)
723
724 results = results_container[0]
~\.conda\envs\PierEnv\lib\site-packages\sklearn\model_selection\_search.py in _run_search(self, evaluate_candidates)
1189 def _run_search(self, evaluate_candidates):
1190 """Search all candidates in param_grid"""
-> 1191 evaluate_candidates(ParameterGrid(self.param_grid))
1192
1193
~\.conda\envs\PierEnv\lib\site-packages\sklearn\model_selection\_search.py in evaluate_candidates(candidate_params)
709 for parameters, (train, test)
710 in product(candidate_params,
--> 711 cv.split(X, y, groups)))
712
713 all_candidate_params.extend(candidate_params)
~\.conda\envs\PierEnv\lib\site-packages\sklearn\externals\joblib\parallel.py in __call__(self, iterable)
915 # remaining jobs.
916 self._iterating = False
--> 917 if self.dispatch_one_batch(iterator):
918 self._iterating = self._original_iterator is not None
919
~\.conda\envs\PierEnv\lib\site-packages\sklearn\externals\joblib\parallel.py in dispatch_one_batch(self, iterator)
757 return False
758 else:
--> 759 self._dispatch(tasks)
760 return True
761
~\.conda\envs\PierEnv\lib\site-packages\sklearn\externals\joblib\parallel.py in _dispatch(self, batch)
714 with self._lock:
715 job_idx = len(self._jobs)
--> 716 job = self._backend.apply_async(batch, callback=cb)
717 # A job can complete so quickly than its callback is
718 # called before we get here, causing self._jobs to
~\.conda\envs\PierEnv\lib\site-packages\sklearn\externals\joblib\_parallel_backends.py in apply_async(self, func, callback)
180 def apply_async(self, func, callback=None):
181 """Schedule a func to be run"""
--> 182 result = ImmediateResult(func)
183 if callback:
184 callback(result)
~\.conda\envs\PierEnv\lib\site-packages\sklearn\externals\joblib\_parallel_backends.py in __init__(self, batch)
547 # Don't delay the application, to avoid keeping the input
548 # arguments in memory
--> 549 self.results = batch()
550
551 def get(self):
~\.conda\envs\PierEnv\lib\site-packages\sklearn\externals\joblib\parallel.py in __call__(self)
223 with parallel_backend(self._backend, n_jobs=self._n_jobs):
224 return [func(*args, **kwargs)
--> 225 for func, args, kwargs in self.items]
226
227 def __len__(self):
~\.conda\envs\PierEnv\lib\site-packages\sklearn\externals\joblib\parallel.py in (.0)
223 with parallel_backend(self._backend, n_jobs=self._n_jobs):
224 return [func(*args, **kwargs)
--> 225 for func, args, kwargs in self.items]
226
227 def __len__(self):
~\.conda\envs\PierEnv\lib\site-packages\sklearn\model_selection\_validation.py in _fit_and_score(estimator, X, y, scorer, train, test, verbose, parameters, fit_params, return_train_score, return_parameters, return_n_test_samples, return_times, return_estimator, error_score)
526 estimator.fit(X_train, **fit_params)
527 else:
--> 528 estimator.fit(X_train, y_train, **fit_params)
529
530 except Exception as e:
~\.conda\envs\PierEnv\lib\site-packages\keras\wrappers\scikit_learn.py in fit(self, x, y, **kwargs)
139 **self.filter_sk_params(self.build_fn.__call__))
140 else:
--> 141 self.model = self.build_fn(**self.filter_sk_params(self.build_fn))
142
143 loss_name = self.model.loss
c:\Users\ER180124\Code\LoadForecasting\Methods\LSTM_1HL.py in create_model(optimizer, hl1_nodes, input_shape)
19 # LSTM layers followed by other LSTM layer must have the parameter "return_sequences" set at True
20 model = Sequential()
---> 21 model.add(LSTM(units = hl1_nodes , input_shape=input_shape, return_sequences=False))
22 model.add(Dense(1, activation="linear")) # output layer
23 model.compile(optimizer=optimizer, loss='mean_squared_error', metrics=['accuracy'])
~\.conda\envs\PierEnv\lib\site-packages\keras\legacy\interfaces.py in wrapper(*args, **kwargs)
89 warnings.warn('Update your `' + object_name + '` call to the ' +
90 'Keras 2 API: ' + signature, stacklevel=2)
---> 91 return func(*args, **kwargs)
92 wrapper._original_function = func
93 return wrapper
~\.conda\envs\PierEnv\lib\site-packages\keras\layers\recurrent.py in __init__(self, units, activation, recurrent_activation, use_bias, kernel_initializer, recurrent_initializer, bias_initializer, unit_forget_bias, kernel_regularizer, recurrent_regularizer, bias_regularizer, activity_regularizer, kernel_constraint, recurrent_constraint, bias_constraint, dropout, recurrent_dropout, implementation, return_sequences, return_state, go_backwards, stateful, unroll, **kwargs)
2183 stateful=stateful,
2184 unroll=unroll,
-> 2185 **kwargs)
2186 self.activity_regularizer = regularizers.get(activity_regularizer)
2187
~\.conda\envs\PierEnv\lib\site-packages\keras\layers\recurrent.py in __init__(self, cell, return_sequences, return_state, go_backwards, stateful, unroll, **kwargs)
406 '(tuple of integers, '
407 'one integer per RNN state).')
--> 408 super(RNN, self).__init__(**kwargs)
409 self.cell = cell
410 self.return_sequences = return_sequences
~\.conda\envs\PierEnv\lib\site-packages\keras\engine\base_layer.py in __init__(self, **kwargs)
145 batch_size = None
146 batch_input_shape = (
--> 147 batch_size,) + tuple(kwargs['input_shape'])
148 self.batch_input_shape = batch_input_shape
149
TypeError: 'int' object is not iterable
I do not understand why in the last part of the error message i get: "batch_size = None" while i define a batch size that is a tuple.
Well, I think I got your problem.
When you are doing CV Search, a param grid is generated from your param dictionary using most probably a cross product of possible configurations. Your param dictionary has input_shape of (time_steps, inputs_numb) which is a sequence of two integers actually. So, your input shape parameter is either time_steps or inputs_numb. Which then becomes (None,) + (times_steps) or (None,) + (inputs_numb) in the final line of the stack trace. This is a tuple + int operation so it is not valid. Instead, you want your configuration space to have only one possible input_shape.
What you should do is to convert this line
input_shape=(time_steps, inputs_numb)
to this:
input_shape=[(time_steps, inputs_numb)]

How to predict target on a validation set with a Pipeline containing OneHotEncode and LightGBM?

I am trying to use sklearn and LightGBM with both numerical and categorical features. I created a Pipeline with:
1 step for data preprocessing relying on ColumnTransformer (categorical variables are encoded with OneHOtEncoder).
1 step for actual model training with LightGBM.
It trains my model just fine but I have an error message when I want to use my model for prediction on a test dataset. It looks like the preprocessing is not applied to this test dataset but I don't get why. In the tutorials I've found online, it seems to work, though with sklearn classifiers.
Here is my code:
from lightgbm import LGBMClassifier
from sklearn.preprocessing import StandardScaler, OneHotEncoder,
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import roc_auc_score
from sklearn.compose import ColumnTransformer,
from sklearn.impute import SimpleImputer
# Numerical features
numerical_features = ['Distance']
numerical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())])
# Categorical features
categorical_features = ['Travel', 'Month', 'DayofMonth', 'DayOfWeek', 'UniqueCarrier']
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
('onehot', OneHotEncoder())])
# Build the preprocessor with ColumnTransformer
preprocess = ColumnTransformer(transformers=[
('num', numerical_transformer, numerical_features),
('cat', categorical_transformer, categorical_features)
]
)
# Build a pipeline
clf = Pipeline(steps=[('preprocess', preprocess),
('classifier', LGBMClassifier(random_state=17))])
# Fit
clf.fit(X_build, y_build)
# Scores
print("model training score (clf internal scoring function with standards parameters): {0}".format(clf.score(X_build, y_build))) # returns a score
print("Score: %f" % clf.score(X_valid, y_valid)) # Here is the problem
And here is the error:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-116-70bf0e236540> in <module>()
----> 1 print("Score: %f" % clf.predict(X_valid))
~/anaconda3/lib/python3.6/site-packages/sklearn/utils/metaestimators.py in <lambda>(*args, **kwargs)
116
117 # lambda, but not partial, allows help() to work with update_wrapper
--> 118 out = lambda *args, **kwargs: self.fn(obj, *args, **kwargs)
119 # update the docstring of the returned function
120 update_wrapper(out, self.fn)
~/anaconda3/lib/python3.6/site-packages/sklearn/pipeline.py in predict(self, X, **predict_params)
329 for name, transform in self.steps[:-1]:
330 if transform is not None:
--> 331 Xt = transform.transform(Xt)
332 return self.steps[-1][-1].predict(Xt, **predict_params)
333
~/anaconda3/lib/python3.6/site-packages/sklearn/compose/_column_transformer.py in transform(self, X)
491
492 X = _check_X(X)
--> 493 Xs = self._fit_transform(X, None, _transform_one, fitted=True)
494 self._validate_output(Xs)
495
~/anaconda3/lib/python3.6/site-packages/sklearn/compose/_column_transformer.py in _fit_transform(self, X, y, func, fitted)
391 _get_column(X, column), y, weight)
392 for _, trans, column, weight in self._iter(
--> 393 fitted=fitted, replace_strings=True))
394 except ValueError as e:
395 if "Expected 2D array, got 1D array instead" in str(e):
~/anaconda3/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py in __call__(self, iterable)
918 self._iterating = self._original_iterator is not None
919
--> 920 while self.dispatch_one_batch(iterator):
921 pass
922
~/anaconda3/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py in dispatch_one_batch(self, iterator)
757 return False
758 else:
--> 759 self._dispatch(tasks)
760 return True
761
~/anaconda3/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py in _dispatch(self, batch)
714 with self._lock:
715 job_idx = len(self._jobs)
--> 716 job = self._backend.apply_async(batch, callback=cb)
717 # A job can complete so quickly than its callback is
718 # called before we get here, causing self._jobs to
~/anaconda3/lib/python3.6/site-packages/sklearn/externals/joblib/_parallel_backends.py in apply_async(self, func, callback)
180 def apply_async(self, func, callback=None):
181 """Schedule a func to be run"""
--> 182 result = ImmediateResult(func)
183 if callback:
184 callback(result)
~/anaconda3/lib/python3.6/site-packages/sklearn/externals/joblib/_parallel_backends.py in __init__(self, batch)
547 # Don't delay the application, to avoid keeping the input
548 # arguments in memory
--> 549 self.results = batch()
550
551 def get(self):
~/anaconda3/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py in __call__(self)
223 with parallel_backend(self._backend, n_jobs=self._n_jobs):
224 return [func(*args, **kwargs)
--> 225 for func, args, kwargs in self.items]
226
227 def __len__(self):
~/anaconda3/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py in <listcomp>(.0)
223 with parallel_backend(self._backend, n_jobs=self._n_jobs):
224 return [func(*args, **kwargs)
--> 225 for func, args, kwargs in self.items]
226
227 def __len__(self):
~/anaconda3/lib/python3.6/site-packages/sklearn/pipeline.py in _transform_one(transformer, X, y, weight, **fit_params)
603
604 def _transform_one(transformer, X, y, weight, **fit_params):
--> 605 res = transformer.transform(X)
606 # if we have a weight for this transformer, multiply output
607 if weight is None:
~/anaconda3/lib/python3.6/site-packages/sklearn/pipeline.py in _transform(self, X)
449 for name, transform in self.steps:
450 if transform is not None:
--> 451 Xt = transform.transform(Xt)
452 return Xt
453
~/anaconda3/lib/python3.6/site-packages/sklearn/preprocessing/_encoders.py in transform(self, X)
611 copy=True)
612 else:
--> 613 return self._transform_new(X)
614
615 def inverse_transform(self, X):
~/anaconda3/lib/python3.6/site-packages/sklearn/preprocessing/_encoders.py in _transform_new(self, X)
572 n_samples, n_features = X.shape
573
--> 574 X_int, X_mask = self._transform(X, handle_unknown=self.handle_unknown)
575
576 mask = X_mask.ravel()
~/anaconda3/lib/python3.6/site-packages/sklearn/preprocessing/_encoders.py in _transform(self, X, handle_unknown)
105 msg = ("Found unknown categories {0} in column {1}"
106 " during transform".format(diff, i))
--> 107 raise ValueError(msg)
108 else:
109 # Set the problematic rows to an acceptable value and
ValueError: Found unknown categories ['BOS-CHS', 'ORD-JAC', 'LAS-OKC', 'VCT-IAH', 'CVG-EGE', 'PIT-PVD', 'BDL-SLC', 'TEX-PHX', 'LAX-LGA', 'LEX-LGA', 'CLE-SLC', 'KOA-SNA', 'SNA-HNL', 'MDW-SNA', 'MIA-SEA', 'MEM-RDU', 'YUM-IPL', 'SLC-KOA', 'EGE-EWR', 'MTJ-DFW', 'TPA-CHS', 'FLL-OAK', 'PVD-MCI', 'SLC-DSM', 'RSW-DEN', 'ORD-JAN', 'ATL-FSD', 'CHS-JAX', 'MCO-MLI', 'FSD-SLC', 'SLC-LGA', 'GRB-DFW', 'PNS-JAX', 'BDL-LAX', 'ATL-SOP', 'MSP-FAI', 'CLT-CAE', 'PIT-SEA', 'SRQ-IND', 'PHF-CLT', 'MIA-CMH', 'FAR-SLC', 'TUL-LAS', 'EWR-TUS', 'ORD-STT', 'CLT-TRI', 'BHM-CLE', 'ORD-PWM', 'SRQ-IAH', 'BOI-ORD', 'ATL-EGE', 'ATL-CID', 'IND-MSY', 'EGE-LAX', 'BUR-PDX', 'BTR-LGA', 'MIA-SLC', 'ONT-PDX', 'CLE-SBN', 'MSP-JAC', 'CMH-FLL', 'MEM-AUS', 'PHX-MFR', 'SJU-STL', 'ASE-SLC', 'CID-ATL', 'DFW-MLI', 'SCC-BRW', 'LGA-MSN', 'MCO-PFN', 'MDW-SJU', 'SEA-SIT', 'DTW-OMA', 'GRR-TPA', 'EGE-SFO', 'DFW-RST', 'GRR-LAS', 'TPA-TLH', 'PWM-CLT', 'TLH-MIA', 'PHF-FLL', 'SFO-EGE', 'SAT-STL', 'RSW-MKE', 'DTW-MSY', 'IAH-TXK', 'TLH-JFK', 'ATL-GUC', 'IAH-VCT', 'DEN-GRR', 'IND-SEA', 'PIE-MDW', 'BHM-IAD', 'IAD-BHM', 'BUR-MCO', 'MTJ-EWR', 'CLE-HOU', 'MSY-STL', 'DFW-SYR', 'BUF-LAS', 'LEX-EWR'] in column 0 during transform
Do you know what the problem is ?
Thanks
The problem seems to be that OHE finds new categories in the validation sample that were not there in the training sample.
Unfortunately, sklearn's implementation can not handle such situation out-of-the-box. So one has to check that categories in the new data are the same as in the training set. There can be different strategies on how to treat new categories. Try to google and experiment with different things. Examples can be: make OHE aware of all possible categories including those in the new data (using categories argument in the constructor) or drop those new categories in new data (compare new data to automatically learned categories_ parameter). Of course, the first option does not make sense in production, but the second can be always implemented

scoring "roc_auc" value is not working with gridsearchCV appling RandomForestclassifer

I keep getting this error when perform this with gridsearchCV with scoring value is 'roc_auc'('f1', 'precision','recall' work fine)
# Construct a pipeline
pipe = Pipeline([
('reduce_dim',PCA()),
('rf',RandomForestClassifier(min_samples_leaf=5,random_state=123))
])
N_FEATURES_OPTIONS = [2] # for PCA [2, 4, 8]
# these below param is for RandomForestClassifier
N_ESTIMATORS = [10,50] # 10,50,100
MAX_DEPTH = [5,6] # 5,6,7,8,9
MIN_SAMPLE_LEAF = 5
param_grid = [
{
'reduce_dim': [PCA()],
'reduce_dim__n_components': N_FEATURES_OPTIONS,
'rf__n_estimators' : N_ESTIMATORS,
'rf__max_depth': MAX_DEPTH
},
{
'reduce_dim': [SelectKBest(f_classif)],
'reduce_dim__k': N_FEATURES_OPTIONS,
'rf__n_estimators' : N_ESTIMATORS,
'rf__max_depth': MAX_DEPTH
},
]
grid = GridSearchCV(pipe, param_grid= param_grid, cv =10,n_jobs=1,scoring = 'roc_auc')
grid.fit(X_train_s,y_train_s)
And I get this error
AttributeError Traceback (most recent call last)
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/metrics/scorer.py in __call__(self, clf, X, y, sample_weight)
186 try:
--> 187 y_pred = clf.decision_function(X)
188
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/utils/metaestimators.py in __get__(self, obj, type)
108 else:
--> 109 getattr(delegate, self.attribute_name)
110 break
AttributeError: 'RandomForestClassifier' object has no attribute 'decision_function'
During handling of the above exception, another exception occurred:
IndexError Traceback (most recent call last)
<ipython-input-16-86491f3b6aa7> in <module>()
----> 1 grid.fit(X_train_s,y_train_s)
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/model_selection/_search.py in fit(self, X, y, groups, **fit_params)
637 error_score=self.error_score)
638 for parameters, (train, test) in product(candidate_params,
--> 639 cv.split(X, y, groups)))
640
641 # if one choose to see train score, "out" will contain train score info
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py in __call__(self, iterable)
777 # was dispatched. In particular this covers the edge
778 # case of Parallel used with an exhausted iterator.
--> 779 while self.dispatch_one_batch(iterator):
780 self._iterating = True
781 else:
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py in dispatch_one_batch(self, iterator)
623 return False
624 else:
--> 625 self._dispatch(tasks)
626 return True
627
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py in _dispatch(self, batch)
586 dispatch_timestamp = time.time()
587 cb = BatchCompletionCallBack(dispatch_timestamp, len(batch), self)
--> 588 job = self._backend.apply_async(batch, callback=cb)
589 self._jobs.append(job)
590
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/externals/joblib/_parallel_backends.py in apply_async(self, func, callback)
109 def apply_async(self, func, callback=None):
110 """Schedule a func to be run"""
--> 111 result = ImmediateResult(func)
112 if callback:
113 callback(result)
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/externals/joblib/_parallel_backends.py in __init__(self, batch)
330 # Don't delay the application, to avoid keeping the input
331 # arguments in memory
--> 332 self.results = batch()
333
334 def get(self):
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py in __call__(self)
129
130 def __call__(self):
--> 131 return [func(*args, **kwargs) for func, args, kwargs in self.items]
132
133 def __len__(self):
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py in <listcomp>(.0)
129
130 def __call__(self):
--> 131 return [func(*args, **kwargs) for func, args, kwargs in self.items]
132
133 def __len__(self):
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/model_selection/_validation.py in _fit_and_score(estimator, X, y, scorer, train, test, verbose, parameters, fit_params, return_train_score, return_parameters, return_n_test_samples, return_times, error_score)
486 fit_time = time.time() - start_time
487 # _score will return dict if is_multimetric is True
--> 488 test_scores = _score(estimator, X_test, y_test, scorer, is_multimetric)
489 score_time = time.time() - start_time - fit_time
490 if return_train_score:
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/model_selection/_validation.py in _score(estimator, X_test, y_test, scorer, is_multimetric)
521 """
522 if is_multimetric:
--> 523 return _multimetric_score(estimator, X_test, y_test, scorer)
524 else:
525 if y_test is None:
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/model_selection/_validation.py in _multimetric_score(estimator, X_test, y_test, scorers)
551 score = scorer(estimator, X_test)
552 else:
--> 553 score = scorer(estimator, X_test, y_test)
554
555 if hasattr(score, 'item'):
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/metrics/scorer.py in __call__(self, clf, X, y, sample_weight)
195
196 if y_type == "binary":
--> 197 y_pred = y_pred[:, 1]
198 elif isinstance(y_pred, list):
199 y_pred = np.vstack([p[:, -1] for p in y_pred]).T
IndexError: index 1 is out of bounds for axis 1 with size 1
I have looked up for this error and found some kind of similar problem here with Kerasclassifier. But I have no idea how to fix it
Keras Wrappers for Scikit Learn - AUC scorer is not working
can anyone explain to me what is wrong???
The error could be because som causes:
If you have only one target class: it fails
If you have >=3 target classes: if fails.
Maybe you have 2 classes, and in one fold of the CV, the test labels are only from one class.
When sklearn compute the AUC metric, it must have 2 classes, because the method for getting the AUC requires only two classes (to compute tpr and fpr with all thresholds).
Example of errors:
grid.fit(np.random.rand(100,2), np.random.randint(1, size=100)) #one class labels
grid.fit(np.random.rand(100,2), np.random.randint(3, size=100)) #3 class labels
#BOTH Throws same error when computing AUC
Example that should not thow an error but it could happen depends of the folds of the CV:
grid.fit(np.random.rand(100,2), np.random.randint(2, size=100)) #two class labels
#This shouldnt throw an error
SOLUTION
If you have more than 2 classes: you have to compute manually (or maybe there are some libraries, but I dont know about it), the 1 vs all, in which you compute auc with 2 classes (one class vs all the others), or All vs All AUC (pairwise AUC, where you compute one class vs ALL being the single class one class at a time, and then calculate the mean).
If you have 2 classes:
grid = GridSearchCV(pipe, param_grid= param_grid, cv = StratifiedKFold(), n_jobs=1, scoring = 'roc_auc')

Resources