Error when using GridSearchCV and RandomizedSearchCV - python-3.x

When attempting to fit my training data using either GridSearchCV or RandomizedSearchCV, I keep getting the following error:
TypeError: no supported conversion for types: (dtype('O'), dtype('O'))
Here's a sample of the relevant code:
from xgboost.sklearn import XGBRegressor as XGR
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
xgbRegModel = XGR()
params = {'max_depth':[3, 6, 9], 'learning_rate':[.05, .1, .5], 'n_estimators': [50, 100, 200]}
rscv = RandomizedSearchCV(xgbRegModel, params)
rscv.fit(X, y)
rscv.best_model_
where X is a (39942, 112577) scipy.sparse.csr.csr_matrix and y is a (39942,) numpy.ndarray.
The dtypes are all either int64 or float64, and I've tried running it both with np.nan values and after filling the np.nan values with 0... (I thought that might be the problem, but no.)
Can anyone tell me what's going on here? It works just fine when I train the model without using GridSearchCV or RandomizedSearchCV.
Any ideas would be appreciated - thanks!
ps - the traceback for the error is really long, but here it is, if it helps..
TypeError Traceback (most recent call last)
<ipython-input-54-63d54d4cd03e> in <module>()
3 xgbRegModel = XGR()
4 rscv = RandomizedSearchCV(xgbRegModel, params)
----> 5 rscv.fit(X, y)
6 rscv.best_model_
~\Anaconda3\lib\site-packages\sklearn\model_selection\_search.py in fit(self, X, y, groups, **fit_params)
636 error_score=self.error_score)
637 for parameters, (train, test) in product(candidate_params,
--> 638 cv.split(X, y, groups)))
639
640 # if one choose to see train score, "out" will contain train score info
~\Anaconda3\lib\site-packages\sklearn\externals\joblib\parallel.py in __call__(self, iterable)
777 # was dispatched. In particular this covers the edge
778 # case of Parallel used with an exhausted iterator.
--> 779 while self.dispatch_one_batch(iterator):
780 self._iterating = True
781 else:
~\Anaconda3\lib\site-packages\sklearn\externals\joblib\parallel.py in dispatch_one_batch(self, iterator)
623 return False
624 else:
--> 625 self._dispatch(tasks)
626 return True
627
~\Anaconda3\lib\site-packages\sklearn\externals\joblib\parallel.py in _dispatch(self, batch)
586 dispatch_timestamp = time.time()
587 cb = BatchCompletionCallBack(dispatch_timestamp, len(batch), self)
--> 588 job = self._backend.apply_async(batch, callback=cb)
589 self._jobs.append(job)
590
~\Anaconda3\lib\site-packages\sklearn\externals\joblib\_parallel_backends.py in apply_async(self, func, callback)
109 def apply_async(self, func, callback=None):
110 """Schedule a func to be run"""
--> 111 result = ImmediateResult(func)
112 if callback:
113 callback(result)
~\Anaconda3\lib\site-packages\sklearn\externals\joblib\_parallel_backends.py in __init__(self, batch)
330 # Don't delay the application, to avoid keeping the input
331 # arguments in memory
--> 332 self.results = batch()
333
334 def get(self):
~\Anaconda3\lib\site-packages\sklearn\externals\joblib\parallel.py in __call__(self)
129
130 def __call__(self):
--> 131 return [func(*args, **kwargs) for func, args, kwargs in self.items]
132
133 def __len__(self):
~\Anaconda3\lib\site-packages\sklearn\externals\joblib\parallel.py in <listcomp>(.0)
129
130 def __call__(self):
--> 131 return [func(*args, **kwargs) for func, args, kwargs in self.items]
132
133 def __len__(self):
~\Anaconda3\lib\site-packages\sklearn\model_selection\_validation.py in _fit_and_score(estimator, X, y, scorer, train, test, verbose, parameters, fit_params, return_train_score, return_parameters, return_n_test_samples, return_times, error_score)
425 start_time = time.time()
426
--> 427 X_train, y_train = _safe_split(estimator, X, y, train)
428 X_test, y_test = _safe_split(estimator, X, y, test, train)
429
~\Anaconda3\lib\site-packages\sklearn\utils\metaestimators.py in _safe_split(estimator, X, y, indices, train_indices)
198 X_subset = X[np.ix_(indices, train_indices)]
199 else:
--> 200 X_subset = safe_indexing(X, indices)
201
202 if y is not None:
~\Anaconda3\lib\site-packages\sklearn\utils\__init__.py in safe_indexing(X, indices)
160 return X.take(indices, axis=0)
161 else:
--> 162 return X[indices]
163 else:
164 return [X[idx] for idx in indices]
~\Anaconda3\lib\site-packages\scipy\sparse\csr.py in __getitem__(self, key)
315 if isintlike(col) or isinstance(col,slice):
316 P = extractor(row, self.shape[0]) # [[1,2],j] or [[1,2],1:2]
--> 317 extracted = P * self
318 if col == slice(None, None, None):
319 return extracted
~\Anaconda3\lib\site-packages\scipy\sparse\base.py in __mul__(self, other)
367 if self.shape[1] != other.shape[0]:
368 raise ValueError('dimension mismatch')
--> 369 return self._mul_sparse_matrix(other)
370
371 # If it's a list or whatever, treat it like a matrix
~\Anaconda3\lib\site-packages\scipy\sparse\compressed.py in _mul_sparse_matrix(self, other)
539 indptr = np.asarray(indptr, dtype=idx_dtype)
540 indices = np.empty(nnz, dtype=idx_dtype)
--> 541 data = np.empty(nnz, dtype=upcast(self.dtype, other.dtype))
542
543 fn = getattr(_sparsetools, self.format + '_matmat_pass2')
~\Anaconda3\lib\site-packages\scipy\sparse\sputils.py in upcast(*args)
49 return t
50
---> 51 raise TypeError('no supported conversion for types: %r' % (args,))
52
53
TypeError: no supported conversion for types: (dtype('O'), dtype('O'))

Thats because GridSearchCV doesn't support sparse matrices in the fit() method. Please have a look at the signature of fit method here:
Parameters:
X : array-like, shape = [n_samples, n_features]
As you see its written that only array-like inputs are supported.
As for why its working normally without grid search, thats because XGBRegressor supports sparse matrices.
The actual error arises when during cross_validation, the X is splitted into train and test which doesn't work for sparse matrices same way as normal arrays.
Also, make sure that for XGBRegressor the sparse matrix is of type CSC and not CSR as you have now, because it will give you wrong results. Its described here: https://github.com/dmlc/xgboost/issues/1238

Related

I was trying to fit and score logistic Regression model but getting error ,Can anyone help me this error

i am trying to experimenting Logistic Regression machine learning models, but i don't know why m i getting error.
models = {"Logistic Regression":LogisticRegression(),}
def fit_and_score(models,x_train,x_test,y_train,y_test):
np.random.seed(42)
model_scores = {}
#loop through model
for name, model in models.items():
model.fit(x_train,y_train)
model_scores[name] = model.score(x_test,y_test)
return model_scores
model_scores = fit_and_score(models=models,
x_train=x_train,
x_test=x_test,
y_train=y_train,
y_test=y_test)
model_scores
AttributeError Traceback (most recent call last)
<ipython-input-33-9c05affc041a> in <module>
----> 1 model_score = fit_and_score(models=models,
2 x_train=x_train,
3 x_test=x_test,
4 y_train=y_train,
5 y_test=y_test)
<ipython-input-32-b7a75c9edc31> in fit_and_score(models, x_train, x_test, y_train, y_test)
21 for name , model in models.items():
22 # fit the model to the data
---> 23 model.fit(x_train,y_train)
24 # Evaluate the model and append it's score to model scores
25 model_scores[name] = model.score(x_test,y_test)
~\Desktop\heart_disease_project\env\lib\site-packages\sklearn\linear_model\_logistic.py
in fit(self, X, y, sample_weight)
1405 else:
1406 prefer = 'processes'
-> 1407 fold_coefs_ = Parallel(n_jobs=self.n_jobs, verbose=self.verbose,
1408 **joblib_parallel_args(prefer=prefer))(
1409 path_func(X, y, pos_class=class, Cs=[C_],
~\Desktop\heart_disease_project\env\lib\site-packages\joblib\parallel.py
in call(self, iterable)
1039 # remaining jobs.
1040 self._iterating = False
-> 1041 if self.dispatch_one_batch(iterator):
1042 self._iterating = self._original_iterator is not None
1043
~\Desktop\heart_disease_project\env\lib\site-packages\joblib\parallel.py
in dispatch_one_batch(self, iterator)
857 return False
858 else:
--> 859 self._dispatch(tasks)
860 return True
861
~\Desktop\heart_disease_project\env\lib\site-packages\joblib\parallel.py
in _dispatch(self, batch)
775 with self._lock:
776 job_idx = len(self._jobs)
--> 777 job = self._backend.apply_async(batch, callback=cb)
778 # A job can complete so quickly than its callback is
779 # called before we get here, causing self._jobs to
~\Desktop\heart_disease_project\env\lib\site-packages\joblib\_parallel_backends.py
in apply_async(self, func, callback)
206 def apply_async(self, func, callback=None):
207 """Schedule a func to be run"""
--> 208 result = ImmediateResult(func)
209 if callback:
210 callback(result)
~\Desktop\heart_disease_project\env\lib\site-packages\joblib\_parallel_backends.py
in init(self, batch)
570 # Don't delay the application, to avoid keeping the input
571 # arguments in memory
--> 572 self.results = batch()
573
574 def get(self):
~\Desktop\heart_disease_project\env\lib\site-packages\joblib\parallel.py
in call(self)
260 # change the default number of processes to -1
261 with parallel_backend(self._backend, n_jobs=self._n_jobs):
--> 262 return [func(*args, **kwargs)
263 for func, args, kwargs in self.items]
264
~\Desktop\heart_disease_project\env\lib\site-packages\joblib\parallel.py
in (.0)
260 # change the default number of processes to -1
261 with parallel_backend(self._backend, n_jobs=self._n_jobs):
--> 262 return [func(*args, **kwargs)
263 for func, args, kwargs in self.items]
~\Desktop\heart_disease_project\env\lib\site-packages\sklearn\linear_model\_logistic.py
in _logistic_regression_path(X, y, pos_class, Cs, fit_intercept,
max_iter, tol, verbose, solver, coef, class_weight, dual, penalty,
intercept_scaling, multi_class, random_state, check_input,
max_squared_sum, sample_weight, l1_ratio)
760 options={"iprint": iprint, "gtol": tol, "maxiter": max_iter}
761 )
--> 762 n_iter_i = _check_optimize_result(
763 solver, opt_res, max_iter,
764 extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG
~\Desktop\heart_disease_project\env\lib\site-packages\sklearn\utils\optimize.py
in _check_optimize_result(solver, result, max_iter, extra_warning_msg)
241 " https://scikit-learn.org/stable/modules/"
242 "preprocessing.html"
--> 243 ).format(solver, result.status, result.message.decode("latin1"))
244 if extra_warning_msg is not None:
245 warning_msg += "\n" + extra_warning_msg
AttributeError: 'str' object has no attribute 'decode'
I have tried diffrent parameter and it solved the issue.
models = {"Logistic Regression" : LogisticRegression(solver='liblinear'),
"KNN" : KNeighborsClassifier(),
"Random Forest" : RandomForestClassifier()}

can't apply sklearn.compose.ColumnTransformer on only one column of pandas dataframe

I have defined a custom tansformer that takes a pandas dataframe, apply a function on only one column and leaves all the remaining columns untouched. The transformer is working fine during testing, but not when I include it as part of a Pipeline.
Here's the transformer:
import re
from sklearn.base import BaseEstimator, TransformerMixin
class SynopsisCleaner(BaseEstimator, TransformerMixin):
def __init__(self):
return None
def fit(self, X, y=None, **fit_params):
# nothing to learn from data.
return self
def clean_text(self, text):
text = text.lower()
text = re.sub(r'#[a-zA-Z0-9_]+', '', text)
text = re.sub(r'https?://[A-Za-z0-9./]+', '', text)
text = re.sub(r'www.[^ ]+', '', text)
text = re.sub(r'[a-zA-Z0-9]*www[a-zA-Z0-9]*com[a-zA-Z0-9]*', '', text)
text = re.sub(r'[^a-zA-Z]', ' ', text)
text = [token for token in text.split() if len(token) > 2]
text = ' '.join(text)
return text
def transform(self, X, y=None, **fit_params):
for i in range(X.shape[0]):
X[i] = self.clean_text(X[i])
return X
When I test it manually like this, it is working just as expected.
train_synopsis = SynopsisCleaner().transform(train_data['Synopsis'])
But, when I include it as a part of sklearn pipeline:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
# part 1: defining a column transformer that learns on only one column and transforms it
synopsis_clean_col_tran = ColumnTransformer(transformers=[('synopsis_clean_col_tran', SynopsisCleaner(), ['Synopsis'])],
# set remainder to passthrough to pass along all the un-specified columns untouched to the next steps
remainder='passthrough')
# make a pipeline now with all the steps
pipe_1 = Pipeline(steps=[('synopsis_cleaning', synopsis_clean_col_tran)])
pipe_1.fit(train_data)
I get KeyError, like shown below:
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
/usr/local/lib/python3.6/dist-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
2890 try:
-> 2891 return self._engine.get_loc(casted_key)
2892 except KeyError as err:
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: 0
The above exception was the direct cause of the following exception:
KeyError Traceback (most recent call last)
16 frames
<ipython-input-10-3396fa5d6092> in <module>()
6 # make a pipeline now with all the steps
7 pipe_1 = Pipeline(steps=[('synopsis_cleaning', synopsis_clean_col_tran)])
----> 8 pipe_1.fit(train_data)
/usr/local/lib/python3.6/dist-packages/sklearn/pipeline.py in fit(self, X, y, **fit_params)
352 self._log_message(len(self.steps) - 1)):
353 if self._final_estimator != 'passthrough':
--> 354 self._final_estimator.fit(Xt, y, **fit_params)
355 return self
356
/usr/local/lib/python3.6/dist-packages/sklearn/compose/_column_transformer.py in fit(self, X, y)
482 # we use fit_transform to make sure to set sparse_output_ (for which we
483 # need the transformed data) to have consistent output type in predict
--> 484 self.fit_transform(X, y=y)
485 return self
486
/usr/local/lib/python3.6/dist-packages/sklearn/compose/_column_transformer.py in fit_transform(self, X, y)
516 self._validate_remainder(X)
517
--> 518 result = self._fit_transform(X, y, _fit_transform_one)
519
520 if not result:
/usr/local/lib/python3.6/dist-packages/sklearn/compose/_column_transformer.py in _fit_transform(self, X, y, func, fitted)
455 message=self._log_message(name, idx, len(transformers)))
456 for idx, (name, trans, column, weight) in enumerate(
--> 457 self._iter(fitted=fitted, replace_strings=True), 1))
458 except ValueError as e:
459 if "Expected 2D array, got 1D array instead" in str(e):
/usr/local/lib/python3.6/dist-packages/joblib/parallel.py in __call__(self, iterable)
1027 # remaining jobs.
1028 self._iterating = False
-> 1029 if self.dispatch_one_batch(iterator):
1030 self._iterating = self._original_iterator is not None
1031
/usr/local/lib/python3.6/dist-packages/joblib/parallel.py in dispatch_one_batch(self, iterator)
845 return False
846 else:
--> 847 self._dispatch(tasks)
848 return True
849
/usr/local/lib/python3.6/dist-packages/joblib/parallel.py in _dispatch(self, batch)
763 with self._lock:
764 job_idx = len(self._jobs)
--> 765 job = self._backend.apply_async(batch, callback=cb)
766 # A job can complete so quickly than its callback is
767 # called before we get here, causing self._jobs to
/usr/local/lib/python3.6/dist-packages/joblib/_parallel_backends.py in apply_async(self, func, callback)
206 def apply_async(self, func, callback=None):
207 """Schedule a func to be run"""
--> 208 result = ImmediateResult(func)
209 if callback:
210 callback(result)
/usr/local/lib/python3.6/dist-packages/joblib/_parallel_backends.py in __init__(self, batch)
570 # Don't delay the application, to avoid keeping the input
571 # arguments in memory
--> 572 self.results = batch()
573
574 def get(self):
/usr/local/lib/python3.6/dist-packages/joblib/parallel.py in __call__(self)
251 with parallel_backend(self._backend, n_jobs=self._n_jobs):
252 return [func(*args, **kwargs)
--> 253 for func, args, kwargs in self.items]
254
255 def __reduce__(self):
/usr/local/lib/python3.6/dist-packages/joblib/parallel.py in <listcomp>(.0)
251 with parallel_backend(self._backend, n_jobs=self._n_jobs):
252 return [func(*args, **kwargs)
--> 253 for func, args, kwargs in self.items]
254
255 def __reduce__(self):
/usr/local/lib/python3.6/dist-packages/sklearn/pipeline.py in _fit_transform_one(transformer, X, y, weight, message_clsname, message, **fit_params)
726 with _print_elapsed_time(message_clsname, message):
727 if hasattr(transformer, 'fit_transform'):
--> 728 res = transformer.fit_transform(X, y, **fit_params)
729 else:
730 res = transformer.fit(X, y, **fit_params).transform(X)
/usr/local/lib/python3.6/dist-packages/sklearn/base.py in fit_transform(self, X, y, **fit_params)
569 if y is None:
570 # fit method of arity 1 (unsupervised transformation)
--> 571 return self.fit(X, **fit_params).transform(X)
572 else:
573 # fit method of arity 2 (supervised transformation)
<ipython-input-6-004ee595d544> in transform(self, X, y, **fit_params)
20 def transform(self, X, y=None, **fit_params):
21 for i in range(X.shape[0]):
---> 22 X[i] = self.clean_text(X[i])
23 return X
/usr/local/lib/python3.6/dist-packages/pandas/core/frame.py in __getitem__(self, key)
2900 if self.columns.nlevels > 1:
2901 return self._getitem_multilevel(key)
-> 2902 indexer = self.columns.get_loc(key)
2903 if is_integer(indexer):
2904 indexer = [indexer]
/usr/local/lib/python3.6/dist-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
2891 return self._engine.get_loc(casted_key)
2892 except KeyError as err:
-> 2893 raise KeyError(key) from err
2894
2895 if tolerance is not None:
KeyError: 0
What am I doing wrong here?
EDIT 1: without brackets and the column name specified as string, this is the error I see:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-11-bdd42b09e2af> in <module>()
6 # make a pipeline now with all the steps
7 pipe_1 = Pipeline(steps=[('synopsis_cleaning', synopsis_clean_col_tran)])
----> 8 pipe_1.fit(train_data)
3 frames
/usr/local/lib/python3.6/dist-packages/sklearn/pipeline.py in fit(self, X, y, **fit_params)
352 self._log_message(len(self.steps) - 1)):
353 if self._final_estimator != 'passthrough':
--> 354 self._final_estimator.fit(Xt, y, **fit_params)
355 return self
356
/usr/local/lib/python3.6/dist-packages/sklearn/compose/_column_transformer.py in fit(self, X, y)
482 # we use fit_transform to make sure to set sparse_output_ (for which we
483 # need the transformed data) to have consistent output type in predict
--> 484 self.fit_transform(X, y=y)
485 return self
486
/usr/local/lib/python3.6/dist-packages/sklearn/compose/_column_transformer.py in fit_transform(self, X, y)
536
537 self._update_fitted_transformers(transformers)
--> 538 self._validate_output(Xs)
539
540 return self._hstack(list(Xs))
/usr/local/lib/python3.6/dist-packages/sklearn/compose/_column_transformer.py in _validate_output(self, result)
400 raise ValueError(
401 "The output of the '{0}' transformer should be 2D (scipy "
--> 402 "matrix, array, or pandas DataFrame).".format(name))
403
404 def _validate_features(self, n_features, feature_names):
ValueError: The output of the 'synopsis_clean_col_tran' transformer should be 2D (scipy matrix, array, or pandas DataFrame).
In your manual test, you are passing the Series train_data['Synopsis'], but the column transformer is passing the Frame train_data[['Synopsis']]. (So, to clarify the error: X[i] is trying to get the column named 0, which indeed does not exist.) You should be able to fix this as easily as dropping the brackets around 'Synopsis' in the column specification of the transformer. From the docs:
...A scalar string or int should be used where transformer expects X to be a 1d array-like (vector), otherwise a 2d array will be passed to the transformer. ...
That is,
synopsis_clean_col_tran = ColumnTransformer(
transformers=[('synopsis_clean_col_tran', SynopsisCleaner(), 'Synopsis')],
# set remainder to passthrough to pass along all the un-specified columns untouched to the next steps
remainder='passthrough',
)
Ah, but then ColumnTransformer complains that the output of your transformer is one-dimensional; that's unfortunate. I think the cleanest thing then is to switch your transform to expect both input and output as 2D. If you'll only ever need dataframes as input (no other sklearn transformers converting to numpy arrays), then this can be relatively simple using a FunctionTransformer instead of your custom class.
def clean_text_frame(X):
return X.applymap(clean_text) # the function "clean_text" currently in your class.
synopsis_clean_col_tran = ColumnTransformer(
transformers=[('synopsis_clean_col_tran', FunctionTransformer(clean_text_frame), ['Synopsis'])],
# set remainder to passthrough to pass along all the un-specified columns untouched to the next steps
remainder='passthrough',
)

using class weights with sklearn votingClassifier

I have an imbalance dataset for a classification problem. My target variable is binary and has two category.
I implemented Random Forest and Logistic Regression by assigning class_weights as parameter.
When I fit data to random forest and logistic regression separately it works fine. But when I use Voting Classifier on random forest and logistic regression from sklearn.ensemble to fit on the data it gives error Class label no_payment not present. I need to take ensemble of 3 or more models. I have check that this error is not because of naive_bayes implemented in the code.
My code:
rf_param = { 'class_weight': {'no_payment': 1, 'payment': 3},'criterion': 'gini', 'max_depth': 5, 'min_samples_leaf': 30, 'min_samples_split': 15, 'n_estimators': 100}
lr_param = {'C': 0.1, 'class_weight': {'no_payment': 1, 'payment': 3}, 'fit_intercept': False, 'penalty': 'l2'}
rf = ensemble.RandomForestClassifier(**rf_param)
lr = linear_model.LogisticRegression(**lr_param)
nb = naive_bayes.MultinomialNB(alpha=0.0, class_prior=None, fit_prior=False)
rf.fit(train_x, train_y)
lr.fit(train_x, train_y)
nb.fit(train_x, train_y)
model = ensemble.VotingClassifier(estimators=[('rf', rf), ('lr', lr), ('nb',nb)], voting='hard'
,weights = [2,2,1])
model.fit(train_x, train_y)
predictions = model.predict(valid_x)
This code runs perfectly if I remove class_weight from parameter list.
Below is complete error message.
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-35-e05cd516f347> in <module>()
15 )
16
---> 17 model.fit(train_x, train_y)
18
19 predictions = model.predict(valid_x)
/home/.local/lib/python3.6/site-packages/sklearn/ensemble/_voting.py in fit(self, X, y, sample_weight)
220 transformed_y = self.le_.transform(y)
221
--> 222 return super().fit(X, transformed_y, sample_weight)
223
224 def predict(self, X):
/home/.local/lib/python3.6/site-packages/sklearn/ensemble/_voting.py in fit(self, X, y, sample_weight)
66 delayed(_parallel_fit_estimator)(clone(clf), X, y,
67 sample_weight=sample_weight)
---> 68 for clf in clfs if clf not in (None, 'drop')
69 )
70
/home/.local/lib/python3.6/site-packages/joblib/parallel.py in __call__(self, iterable)
1002 # remaining jobs.
1003 self._iterating = False
-> 1004 if self.dispatch_one_batch(iterator):
1005 self._iterating = self._original_iterator is not None
1006
/home/.local/lib/python3.6/site-packages/joblib/parallel.py in dispatch_one_batch(self, iterator)
833 return False
834 else:
--> 835 self._dispatch(tasks)
836 return True
837
/home/.local/lib/python3.6/site-packages/joblib/parallel.py in _dispatch(self, batch)
752 with self._lock:
753 job_idx = len(self._jobs)
--> 754 job = self._backend.apply_async(batch, callback=cb)
755 # A job can complete so quickly than its callback is
756 # called before we get here, causing self._jobs to
/home/.local/lib/python3.6/site-packages/joblib/_parallel_backends.py in apply_async(self, func, callback)
207 def apply_async(self, func, callback=None):
208 """Schedule a func to be run"""
--> 209 result = ImmediateResult(func)
210 if callback:
211 callback(result)
/home/.local/lib/python3.6/site-packages/joblib/_parallel_backends.py in __init__(self, batch)
588 # Don't delay the application, to avoid keeping the input
589 # arguments in memory
--> 590 self.results = batch()
591
592 def get(self):
/home/.local/lib/python3.6/site-packages/joblib/parallel.py in __call__(self)
254 with parallel_backend(self._backend, n_jobs=self._n_jobs):
255 return [func(*args, **kwargs)
--> 256 for func, args, kwargs in self.items]
257
258 def __len__(self):
/home/.local/lib/python3.6/site-packages/joblib/parallel.py in <listcomp>(.0)
254 with parallel_backend(self._backend, n_jobs=self._n_jobs):
255 return [func(*args, **kwargs)
--> 256 for func, args, kwargs in self.items]
257
258 def __len__(self):
/home/.local/lib/python3.6/site-packages/sklearn/ensemble/_base.py in _parallel_fit_estimator(estimator, X, y, sample_weight)
34 raise
35 else:
---> 36 estimator.fit(X, y)
37 return estimator
38
/home/.local/lib/python3.6/site-packages/sklearn/ensemble/_forest.py in fit(self, X, y, sample_weight)
319 self.n_outputs_ = y.shape[1]
320
--> 321 y, expanded_class_weight = self._validate_y_class_weight(y)
322
323 if getattr(y, "dtype", None) != DOUBLE or not y.flags.contiguous:
/home/.local/lib/python3.6/site-packages/sklearn/ensemble/_forest.py in _validate_y_class_weight(self, y)
585 class_weight = self.class_weight
586 expanded_class_weight = compute_sample_weight(class_weight,
--> 587 y_original)
588
589 return y, expanded_class_weight
/home/.local/lib/python3.6/site-packages/sklearn/utils/class_weight.py in compute_sample_weight(class_weight, y, indices)
161 weight_k = compute_class_weight(class_weight_k,
162 classes_full,
--> 163 y_full)
164
165 weight_k = weight_k[np.searchsorted(classes_full, y_full)]
/home/.local/lib/python3.6/site-packages/sklearn/utils/class_weight.py in compute_class_weight(class_weight, classes, y)
63 i = np.searchsorted(classes, c)
64 if i >= len(classes) or classes[i] != c:
---> 65 raise ValueError("Class label {} not present.".format(c))
66 else:
67 weight[i] = class_weight[c]
ValueError: Class label no_payment not present.

How to predict target on a validation set with a Pipeline containing OneHotEncode and LightGBM?

I am trying to use sklearn and LightGBM with both numerical and categorical features. I created a Pipeline with:
1 step for data preprocessing relying on ColumnTransformer (categorical variables are encoded with OneHOtEncoder).
1 step for actual model training with LightGBM.
It trains my model just fine but I have an error message when I want to use my model for prediction on a test dataset. It looks like the preprocessing is not applied to this test dataset but I don't get why. In the tutorials I've found online, it seems to work, though with sklearn classifiers.
Here is my code:
from lightgbm import LGBMClassifier
from sklearn.preprocessing import StandardScaler, OneHotEncoder,
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import roc_auc_score
from sklearn.compose import ColumnTransformer,
from sklearn.impute import SimpleImputer
# Numerical features
numerical_features = ['Distance']
numerical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())])
# Categorical features
categorical_features = ['Travel', 'Month', 'DayofMonth', 'DayOfWeek', 'UniqueCarrier']
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
('onehot', OneHotEncoder())])
# Build the preprocessor with ColumnTransformer
preprocess = ColumnTransformer(transformers=[
('num', numerical_transformer, numerical_features),
('cat', categorical_transformer, categorical_features)
]
)
# Build a pipeline
clf = Pipeline(steps=[('preprocess', preprocess),
('classifier', LGBMClassifier(random_state=17))])
# Fit
clf.fit(X_build, y_build)
# Scores
print("model training score (clf internal scoring function with standards parameters): {0}".format(clf.score(X_build, y_build))) # returns a score
print("Score: %f" % clf.score(X_valid, y_valid)) # Here is the problem
And here is the error:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-116-70bf0e236540> in <module>()
----> 1 print("Score: %f" % clf.predict(X_valid))
~/anaconda3/lib/python3.6/site-packages/sklearn/utils/metaestimators.py in <lambda>(*args, **kwargs)
116
117 # lambda, but not partial, allows help() to work with update_wrapper
--> 118 out = lambda *args, **kwargs: self.fn(obj, *args, **kwargs)
119 # update the docstring of the returned function
120 update_wrapper(out, self.fn)
~/anaconda3/lib/python3.6/site-packages/sklearn/pipeline.py in predict(self, X, **predict_params)
329 for name, transform in self.steps[:-1]:
330 if transform is not None:
--> 331 Xt = transform.transform(Xt)
332 return self.steps[-1][-1].predict(Xt, **predict_params)
333
~/anaconda3/lib/python3.6/site-packages/sklearn/compose/_column_transformer.py in transform(self, X)
491
492 X = _check_X(X)
--> 493 Xs = self._fit_transform(X, None, _transform_one, fitted=True)
494 self._validate_output(Xs)
495
~/anaconda3/lib/python3.6/site-packages/sklearn/compose/_column_transformer.py in _fit_transform(self, X, y, func, fitted)
391 _get_column(X, column), y, weight)
392 for _, trans, column, weight in self._iter(
--> 393 fitted=fitted, replace_strings=True))
394 except ValueError as e:
395 if "Expected 2D array, got 1D array instead" in str(e):
~/anaconda3/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py in __call__(self, iterable)
918 self._iterating = self._original_iterator is not None
919
--> 920 while self.dispatch_one_batch(iterator):
921 pass
922
~/anaconda3/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py in dispatch_one_batch(self, iterator)
757 return False
758 else:
--> 759 self._dispatch(tasks)
760 return True
761
~/anaconda3/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py in _dispatch(self, batch)
714 with self._lock:
715 job_idx = len(self._jobs)
--> 716 job = self._backend.apply_async(batch, callback=cb)
717 # A job can complete so quickly than its callback is
718 # called before we get here, causing self._jobs to
~/anaconda3/lib/python3.6/site-packages/sklearn/externals/joblib/_parallel_backends.py in apply_async(self, func, callback)
180 def apply_async(self, func, callback=None):
181 """Schedule a func to be run"""
--> 182 result = ImmediateResult(func)
183 if callback:
184 callback(result)
~/anaconda3/lib/python3.6/site-packages/sklearn/externals/joblib/_parallel_backends.py in __init__(self, batch)
547 # Don't delay the application, to avoid keeping the input
548 # arguments in memory
--> 549 self.results = batch()
550
551 def get(self):
~/anaconda3/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py in __call__(self)
223 with parallel_backend(self._backend, n_jobs=self._n_jobs):
224 return [func(*args, **kwargs)
--> 225 for func, args, kwargs in self.items]
226
227 def __len__(self):
~/anaconda3/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py in <listcomp>(.0)
223 with parallel_backend(self._backend, n_jobs=self._n_jobs):
224 return [func(*args, **kwargs)
--> 225 for func, args, kwargs in self.items]
226
227 def __len__(self):
~/anaconda3/lib/python3.6/site-packages/sklearn/pipeline.py in _transform_one(transformer, X, y, weight, **fit_params)
603
604 def _transform_one(transformer, X, y, weight, **fit_params):
--> 605 res = transformer.transform(X)
606 # if we have a weight for this transformer, multiply output
607 if weight is None:
~/anaconda3/lib/python3.6/site-packages/sklearn/pipeline.py in _transform(self, X)
449 for name, transform in self.steps:
450 if transform is not None:
--> 451 Xt = transform.transform(Xt)
452 return Xt
453
~/anaconda3/lib/python3.6/site-packages/sklearn/preprocessing/_encoders.py in transform(self, X)
611 copy=True)
612 else:
--> 613 return self._transform_new(X)
614
615 def inverse_transform(self, X):
~/anaconda3/lib/python3.6/site-packages/sklearn/preprocessing/_encoders.py in _transform_new(self, X)
572 n_samples, n_features = X.shape
573
--> 574 X_int, X_mask = self._transform(X, handle_unknown=self.handle_unknown)
575
576 mask = X_mask.ravel()
~/anaconda3/lib/python3.6/site-packages/sklearn/preprocessing/_encoders.py in _transform(self, X, handle_unknown)
105 msg = ("Found unknown categories {0} in column {1}"
106 " during transform".format(diff, i))
--> 107 raise ValueError(msg)
108 else:
109 # Set the problematic rows to an acceptable value and
ValueError: Found unknown categories ['BOS-CHS', 'ORD-JAC', 'LAS-OKC', 'VCT-IAH', 'CVG-EGE', 'PIT-PVD', 'BDL-SLC', 'TEX-PHX', 'LAX-LGA', 'LEX-LGA', 'CLE-SLC', 'KOA-SNA', 'SNA-HNL', 'MDW-SNA', 'MIA-SEA', 'MEM-RDU', 'YUM-IPL', 'SLC-KOA', 'EGE-EWR', 'MTJ-DFW', 'TPA-CHS', 'FLL-OAK', 'PVD-MCI', 'SLC-DSM', 'RSW-DEN', 'ORD-JAN', 'ATL-FSD', 'CHS-JAX', 'MCO-MLI', 'FSD-SLC', 'SLC-LGA', 'GRB-DFW', 'PNS-JAX', 'BDL-LAX', 'ATL-SOP', 'MSP-FAI', 'CLT-CAE', 'PIT-SEA', 'SRQ-IND', 'PHF-CLT', 'MIA-CMH', 'FAR-SLC', 'TUL-LAS', 'EWR-TUS', 'ORD-STT', 'CLT-TRI', 'BHM-CLE', 'ORD-PWM', 'SRQ-IAH', 'BOI-ORD', 'ATL-EGE', 'ATL-CID', 'IND-MSY', 'EGE-LAX', 'BUR-PDX', 'BTR-LGA', 'MIA-SLC', 'ONT-PDX', 'CLE-SBN', 'MSP-JAC', 'CMH-FLL', 'MEM-AUS', 'PHX-MFR', 'SJU-STL', 'ASE-SLC', 'CID-ATL', 'DFW-MLI', 'SCC-BRW', 'LGA-MSN', 'MCO-PFN', 'MDW-SJU', 'SEA-SIT', 'DTW-OMA', 'GRR-TPA', 'EGE-SFO', 'DFW-RST', 'GRR-LAS', 'TPA-TLH', 'PWM-CLT', 'TLH-MIA', 'PHF-FLL', 'SFO-EGE', 'SAT-STL', 'RSW-MKE', 'DTW-MSY', 'IAH-TXK', 'TLH-JFK', 'ATL-GUC', 'IAH-VCT', 'DEN-GRR', 'IND-SEA', 'PIE-MDW', 'BHM-IAD', 'IAD-BHM', 'BUR-MCO', 'MTJ-EWR', 'CLE-HOU', 'MSY-STL', 'DFW-SYR', 'BUF-LAS', 'LEX-EWR'] in column 0 during transform
Do you know what the problem is ?
Thanks
The problem seems to be that OHE finds new categories in the validation sample that were not there in the training sample.
Unfortunately, sklearn's implementation can not handle such situation out-of-the-box. So one has to check that categories in the new data are the same as in the training set. There can be different strategies on how to treat new categories. Try to google and experiment with different things. Examples can be: make OHE aware of all possible categories including those in the new data (using categories argument in the constructor) or drop those new categories in new data (compare new data to automatically learned categories_ parameter). Of course, the first option does not make sense in production, but the second can be always implemented

scoring "roc_auc" value is not working with gridsearchCV appling RandomForestclassifer

I keep getting this error when perform this with gridsearchCV with scoring value is 'roc_auc'('f1', 'precision','recall' work fine)
# Construct a pipeline
pipe = Pipeline([
('reduce_dim',PCA()),
('rf',RandomForestClassifier(min_samples_leaf=5,random_state=123))
])
N_FEATURES_OPTIONS = [2] # for PCA [2, 4, 8]
# these below param is for RandomForestClassifier
N_ESTIMATORS = [10,50] # 10,50,100
MAX_DEPTH = [5,6] # 5,6,7,8,9
MIN_SAMPLE_LEAF = 5
param_grid = [
{
'reduce_dim': [PCA()],
'reduce_dim__n_components': N_FEATURES_OPTIONS,
'rf__n_estimators' : N_ESTIMATORS,
'rf__max_depth': MAX_DEPTH
},
{
'reduce_dim': [SelectKBest(f_classif)],
'reduce_dim__k': N_FEATURES_OPTIONS,
'rf__n_estimators' : N_ESTIMATORS,
'rf__max_depth': MAX_DEPTH
},
]
grid = GridSearchCV(pipe, param_grid= param_grid, cv =10,n_jobs=1,scoring = 'roc_auc')
grid.fit(X_train_s,y_train_s)
And I get this error
AttributeError Traceback (most recent call last)
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/metrics/scorer.py in __call__(self, clf, X, y, sample_weight)
186 try:
--> 187 y_pred = clf.decision_function(X)
188
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/utils/metaestimators.py in __get__(self, obj, type)
108 else:
--> 109 getattr(delegate, self.attribute_name)
110 break
AttributeError: 'RandomForestClassifier' object has no attribute 'decision_function'
During handling of the above exception, another exception occurred:
IndexError Traceback (most recent call last)
<ipython-input-16-86491f3b6aa7> in <module>()
----> 1 grid.fit(X_train_s,y_train_s)
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/model_selection/_search.py in fit(self, X, y, groups, **fit_params)
637 error_score=self.error_score)
638 for parameters, (train, test) in product(candidate_params,
--> 639 cv.split(X, y, groups)))
640
641 # if one choose to see train score, "out" will contain train score info
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py in __call__(self, iterable)
777 # was dispatched. In particular this covers the edge
778 # case of Parallel used with an exhausted iterator.
--> 779 while self.dispatch_one_batch(iterator):
780 self._iterating = True
781 else:
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py in dispatch_one_batch(self, iterator)
623 return False
624 else:
--> 625 self._dispatch(tasks)
626 return True
627
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py in _dispatch(self, batch)
586 dispatch_timestamp = time.time()
587 cb = BatchCompletionCallBack(dispatch_timestamp, len(batch), self)
--> 588 job = self._backend.apply_async(batch, callback=cb)
589 self._jobs.append(job)
590
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/externals/joblib/_parallel_backends.py in apply_async(self, func, callback)
109 def apply_async(self, func, callback=None):
110 """Schedule a func to be run"""
--> 111 result = ImmediateResult(func)
112 if callback:
113 callback(result)
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/externals/joblib/_parallel_backends.py in __init__(self, batch)
330 # Don't delay the application, to avoid keeping the input
331 # arguments in memory
--> 332 self.results = batch()
333
334 def get(self):
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py in __call__(self)
129
130 def __call__(self):
--> 131 return [func(*args, **kwargs) for func, args, kwargs in self.items]
132
133 def __len__(self):
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py in <listcomp>(.0)
129
130 def __call__(self):
--> 131 return [func(*args, **kwargs) for func, args, kwargs in self.items]
132
133 def __len__(self):
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/model_selection/_validation.py in _fit_and_score(estimator, X, y, scorer, train, test, verbose, parameters, fit_params, return_train_score, return_parameters, return_n_test_samples, return_times, error_score)
486 fit_time = time.time() - start_time
487 # _score will return dict if is_multimetric is True
--> 488 test_scores = _score(estimator, X_test, y_test, scorer, is_multimetric)
489 score_time = time.time() - start_time - fit_time
490 if return_train_score:
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/model_selection/_validation.py in _score(estimator, X_test, y_test, scorer, is_multimetric)
521 """
522 if is_multimetric:
--> 523 return _multimetric_score(estimator, X_test, y_test, scorer)
524 else:
525 if y_test is None:
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/model_selection/_validation.py in _multimetric_score(estimator, X_test, y_test, scorers)
551 score = scorer(estimator, X_test)
552 else:
--> 553 score = scorer(estimator, X_test, y_test)
554
555 if hasattr(score, 'item'):
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/metrics/scorer.py in __call__(self, clf, X, y, sample_weight)
195
196 if y_type == "binary":
--> 197 y_pred = y_pred[:, 1]
198 elif isinstance(y_pred, list):
199 y_pred = np.vstack([p[:, -1] for p in y_pred]).T
IndexError: index 1 is out of bounds for axis 1 with size 1
I have looked up for this error and found some kind of similar problem here with Kerasclassifier. But I have no idea how to fix it
Keras Wrappers for Scikit Learn - AUC scorer is not working
can anyone explain to me what is wrong???
The error could be because som causes:
If you have only one target class: it fails
If you have >=3 target classes: if fails.
Maybe you have 2 classes, and in one fold of the CV, the test labels are only from one class.
When sklearn compute the AUC metric, it must have 2 classes, because the method for getting the AUC requires only two classes (to compute tpr and fpr with all thresholds).
Example of errors:
grid.fit(np.random.rand(100,2), np.random.randint(1, size=100)) #one class labels
grid.fit(np.random.rand(100,2), np.random.randint(3, size=100)) #3 class labels
#BOTH Throws same error when computing AUC
Example that should not thow an error but it could happen depends of the folds of the CV:
grid.fit(np.random.rand(100,2), np.random.randint(2, size=100)) #two class labels
#This shouldnt throw an error
SOLUTION
If you have more than 2 classes: you have to compute manually (or maybe there are some libraries, but I dont know about it), the 1 vs all, in which you compute auc with 2 classes (one class vs all the others), or All vs All AUC (pairwise AUC, where you compute one class vs ALL being the single class one class at a time, and then calculate the mean).
If you have 2 classes:
grid = GridSearchCV(pipe, param_grid= param_grid, cv = StratifiedKFold(), n_jobs=1, scoring = 'roc_auc')

Resources