How to use SHAP with a linear SVC model from sklearn using Pipeline? - scikit-learn

I am doing text classification using a linear SVC model from sklearn. Now I want to visualize which words/tokens have the highest impact on the classification decision by using SHAP (https://github.com/slundberg/shap).
Right now this does not work because I am getting an error that seems to originate from the vectorizer step in the pipeline I have defined - whats wrong here?
Is my general approach on how to use SHAP in this case correct?
x_Train, x_Test, y_Train, y_Test = train_test_split(df_all['PDFText'], df_all['class'], test_size = 0.2, random_state = 1234)
pipeline = Pipeline([
(
'tfidv',
TfidfVectorizer(
ngram_range=(1,3),
analyzer='word',
strip_accents = ascii,
use_idf = True,
sublinear_tf=True,
max_features=6000,
min_df=2,
max_df=1.0
)
),
(
'lin_svc',
svm.SVC(
C=1.0,
probability=True,
kernel='linear'
)
)
])
pipeline.fit(x_Train, y_Train)
shap.initjs()
explainer = shap.KernelExplainer(pipeline.predict_proba, x_Train)
shap_values = explainer.shap_values(x_Test, nsamples=100)
shap.force_plot(explainer.expected_value[0], shap_values[0][0,:], x_Test.iloc[0,:])
This is the error message I get:
Provided model function fails when applied to the provided data set.
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-81-4bca63616b3b> in <module>
3
4 # use Kernel SHAP to explain test set predictions
----> 5 explainer = shap.KernelExplainer(pipeline.predict_proba, x_Train)
6 shap_values = explainer.shap_values(x_Test, nsamples=100)
7
c:\users\s.p\appdata\local\programs\python\python37\lib\site-packages\shap\explainers\kernel.py in __init__(self, model, data, link, **kwargs)
95 self.keep_index_ordered = kwargs.get("keep_index_ordered", False)
96 self.data = convert_to_data(data, keep_index=self.keep_index)
---> 97 model_null = match_model_to_data(self.model, self.data)
98
99 # enforce our current input type limitations
c:\users\s.p\appdata\local\programs\python\python37\lib\site-packages\shap\common.py in match_model_to_data(model, data)
80 out_val = model.f(data.convert_to_df())
81 else:
---> 82 out_val = model.f(data.data)
83 except:
84 print("Provided model function fails when applied to the provided data set.")
c:\users\s.p\appdata\local\programs\python\python37\lib\site-packages\sklearn\utils\metaestimators.py in <lambda>(*args, **kwargs)
116
117 # lambda, but not partial, allows help() to work with update_wrapper
--> 118 out = lambda *args, **kwargs: self.fn(obj, *args, **kwargs)
119 # update the docstring of the returned function
120 update_wrapper(out, self.fn)
c:\users\s.p\appdata\local\programs\python\python37\lib\site-packages\sklearn\pipeline.py in predict_proba(self, X)
379 for name, transform in self.steps[:-1]:
380 if transform is not None:
--> 381 Xt = transform.transform(Xt)
382 return self.steps[-1][-1].predict_proba(Xt)
383
c:\users\s.p\appdata\local\programs\python\python37\lib\site-packages\sklearn\feature_extraction\text.py in transform(self, raw_documents, copy)
1631 check_is_fitted(self, '_tfidf', 'The tfidf vector is not fitted')
1632
-> 1633 X = super(TfidfVectorizer, self).transform(raw_documents)
1634 return self._tfidf.transform(X, copy=False)
c:\users\s.p\appdata\local\programs\python\python37\lib\site-packages\sklearn\feature_extraction\text.py in transform(self, raw_documents)
1084
1085 # use the same matrix-building strategy as fit_transform
-> 1086 _, X = self._count_vocab(raw_documents, fixed_vocab=True)
1087 if self.binary:
1088 X.data.fill(1)
c:\users\s.p\appdata\local\programs\python\python37\lib\site-packages\sklearn\feature_extraction\text.py in _count_vocab(self, raw_documents, fixed_vocab)
940 for doc in raw_documents:
941 feature_counter = {}
--> 942 for feature in analyze(doc):
943 try:
944 feature_idx = vocabulary[feature]
c:\users\s.p\appdata\local\programs\python\python37\lib\site-packages\sklearn\feature_extraction\text.py in <lambda>(doc)
326 tokenize)
327 return lambda doc: self._word_ngrams(
--> 328 tokenize(preprocess(self.decode(doc))), stop_words)
329
330 else:
c:\users\s.p\appdata\local\programs\python\python37\lib\site-packages\sklearn\feature_extraction\text.py in <lambda>(x)
254
255 if self.lowercase:
--> 256 return lambda x: strip_accents(x.lower())
257 else:
258 return strip_accents
AttributeError: 'numpy.ndarray' object has no attribute 'lower'

KernelExplainer expects to receive a classification model as the first argument. Please check the use of Pipeline with Shap following the link.
In your case, you can use the Pipeline as follows:
x_Train = pipeline.named_steps['tfidv'].fit_transform(x_Train)
explainer = shap.KernelExplainer(pipeline.named_steps['lin_svc'].predict_proba, x_Train)

Related

XGBoost: While using the `eval_set` in .fit causing Error

I'm trying to train the model using Xgboost. The code is doing split using KFold. And for each fold, it's running the Xgboost model using fit. Within the fit function, I'm trying to evaluate both train and valid data to check if the errors. And then doing the prediction in test set.
I'm running the following code using Xgboost.
kf = GroupKFold(n_splits=4)
for trn_idx, test_idx in kf.split(X, groups=X.year) :
x_train, x_valid = X.iloc[trn_idx], X.iloc[test_idx]
y_train, y_valid = y.iloc[trn_idx], y.iloc[test_idx]
xgb_model = xgb.XGBRegressor(
booster = 'dart',
eta = 0.1,
gamma = 0,
colsample_bytree = 0.7,
n_estimators = 1200,
max_depth = 1,
reg_alpha = 1.1,
reg_lambda = 1.1,
subsample = 0.03,
eval_metric=my_smape)
xgb_model.fit(x_train, y_train,
eval_set=[(x_train, y_train), (x_valid,y_valid)], early_stopping_rounds=20,
verbose=True)
But I'm getting the following error. I checked this doc, and my code is according to the doc. Can someone please help me find the solution?
AttributeError Traceback (most recent call last)
<ipython-input-38-81b11a21472c> in <module>
23 eval_metric=my_smape)
24
---> 25 xgb_model.fit(x_train, y_train,
26 eval_set=[(x_valid,y_valid)], early_stopping_rounds=20,
27 verbose=True)
D:\Anaconda\lib\site-packages\xgboost\core.py in inner_f(*args, **kwargs)
573 for k, arg in zip(sig.parameters, args):
574 kwargs[k] = arg
--> 575 return f(**kwargs)
576
577 return inner_f
D:\Anaconda\lib\site-packages\xgboost\sklearn.py in fit(self, X, y, sample_weight, base_margin, eval_set, eval_metric, early_stopping_rounds, verbose, xgb_model, sample_weight_eval_set, base_margin_eval_set, feature_weights, callbacks)
959 xgb_model, eval_metric, params, early_stopping_rounds, callbacks
960 )
--> 961 self._Booster = train(
962 params,
963 train_dmatrix,
D:\Anaconda\lib\site-packages\xgboost\core.py in inner_f(*args, **kwargs)
573 for k, arg in zip(sig.parameters, args):
574 kwargs[k] = arg
--> 575 return f(**kwargs)
576
577 return inner_f
D:\Anaconda\lib\site-packages\xgboost\training.py in train(params, dtrain, num_boost_round, evals, obj, feval, maximize, early_stopping_rounds, evals_result, verbose_eval, xgb_model, callbacks, custom_metric)
180 break
181 bst.update(dtrain, i, obj)
--> 182 if cb_container.after_iteration(bst, i, dtrain, evals):
183 break
184
D:\Anaconda\lib\site-packages\xgboost\callback.py in after_iteration(self, model, epoch, dtrain, evals)
237 for _, name in evals:
238 assert name.find('-') == -1, 'Dataset name should not contain `-`'
--> 239 score: str = model.eval_set(evals, epoch, self.metric, self._output_margin)
240 splited = score.split()[1:] # into datasets
241 # split up `test-error:0.1234`
D:\Anaconda\lib\site-packages\xgboost\core.py in eval_set(self, evals, iteration, feval, output_margin)
1860 if feval is not None:
1861 for dmat, evname in evals:
-> 1862 feval_ret = feval(
1863 self.predict(dmat, training=False, output_margin=output_margin), dmat
1864 )
D:\Anaconda\lib\site-packages\xgboost\sklearn.py in inner(y_score, dmatrix)
99 def inner(y_score: np.ndarray, dmatrix: DMatrix) -> Tuple[str, float]:
100 y_true = dmatrix.get_label()
--> 101 return func.__name__, func(y_true, y_score)
102 return inner
103
AttributeError: '_PredictScorer' object has no attribute '__name__'
It looks like you've run make_scorer() on your custom metric. Try supplying the original function as eval_metric instead, this should fix the issue.

Scikit learn custom scoring function - Specificity

I'm trying to do a random grid search on randomforestclassifier.
# Instantiate a RandomForestClassifier
RFC = RandomForestClassifier()
# Instantiate the RandomizedSearchCV object: RFC
rand_search3 = RandomizedSearchCV(RFC, param_grid, n_iter=10, cv=5,n_jobs=-1, verbose=1, scoring = "f1_macro")
# Fit it to the data
rand_search3.fit(X_train_transformed,y_train)
I'm trying to get the best model by assessing specificity.
Went through the documentation for custom scoring. Also looked at lots of posts that are related. I have came up with 2 ways for the specificity.
1 :
from sklearn.metrics import make_scorer
def my_custom_func(y_true, y_pred):
cm = confusion_matrix(y_true, y_pred)
return cm[1][1] / (cm[1][0] + cm[1][1])
Specificity_score = make_scorer(my_custom_func, greater_is_better=True)
2:
from sklearn.metrics import recall_score
from sklearn.metrics import make_scorer
specificity = make_scorer(recall_score, pos_label=0, greater_is_better=True)
specificity
When I try using the custom function for the scoring,
rand_search3 = RandomizedSearchCV(RFC, param_grid, n_iter=10, cv=5,n_jobs=-1, verbose=1, scoring = Specificity_score)
rand_search3 = RandomizedSearchCV(RFC, param_grid, n_iter=10, cv=5,n_jobs=-1, verbose=1, scoring = specificity)
both failed with the same error message.
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_10548/1204393696.py in <module>
20
21 # Fit it to the data
---> 22 rand_search3.fit(X_train_transformed,y_train)
23
~\anaconda3\lib\site-packages\sklearn\utils\validation.py in inner_f(*args, **kwargs)
61 #wraps(f)
62 def inner_f(*args, **kwargs):
---> 63 extra_args = len(args) - len(all_args)
64 if extra_args <= 0:
65 return f(*args, **kwargs)
~\anaconda3\lib\site-packages\sklearn\model_selection\_search.py in fit(self, X, y, groups, **fit_params)
839 delayed(_fit_and_score)(
840 clone(base_estimator),
--> 841 X,
842 y,
843 train=train,
~\anaconda3\lib\site-packages\sklearn\model_selection\_search.py in _run_search(self, evaluate_candidates)
1631 Mean cross-validated score of the best_estimator.
1632
-> 1633 For multi-metric evaluation, this is not available if ``refit`` is
1634 ``False``. See ``refit`` parameter for more information.
1635
~\anaconda3\lib\site-packages\sklearn\model_selection\_search.py in evaluate_candidates(candidate_params, cv, more_results)
825 def evaluate_candidates(candidate_params, cv=None, more_results=None):
826 cv = cv or cv_orig
--> 827 candidate_params = list(candidate_params)
828 n_candidates = len(candidate_params)
829
~\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py in _insert_error_scores(results, error_score)
293
294 results = _aggregate_score_dicts(results)
--> 295
296 ret = {}
297 ret["fit_time"] = results["fit_time"]
KeyError: 'fit_failed'
Any solutions?

ValueError: Unable to create tensor, you should probably activate padding with 'padding=True'

I am trying to evaluate facebook/hubert-base-ls9601 Huggingface pre-trained model after fine-tuning on a private dataset.
I am using facebook/hubert-base-ls9601 pre-trained model, and Wav2vec2 feature extractor, and pooling mode set to mean.
Here's the evaluation code:
test_dataset = load_dataset("csv", data_files={"test": "/content/drive/MyDrive/freelancing/test.csv"}, delimiter="\t")["test"]
def speech_file_to_array_fn(batch):
speech_array, sampling_rate = torchaudio.load(batch["path"])
resampler = torchaudio.transforms.Resample(sampling_rate, target_sampling_rate)
speech = resampler(speech_array).squeeze().numpy()
batch["speech"] = speech_array
return batch
def predict(batch):
features = feature_extractor(batch["speech"], sampling_rate=feature_extractor.sampling_rate, return_tensors="pt", padding=True)
input_values = features.input_values.to(device)
with torch.no_grad():
logits = model(input_values).logits
pred_ids = torch.argmax(logits, dim=-1).detach().cpu().numpy()
batch["predicted"] = pred_ids
return batch
test_dataset = test_dataset.map(speech_file_to_array_fn)
result = test_dataset.map(predict, batched=True, batch_size=2)
On the last line of code, I encounter the following error block:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
/usr/local/lib/python3.7/dist-packages/transformers/feature_extraction_utils.py in convert_to_tensors(self, tensor_type)
168 if not is_tensor(value):
--> 169 tensor = as_tensor(value)
170
ValueError: could not broadcast input array from shape (2,220683) into shape (2,)
During handling of the above exception, another exception occurred:
ValueError Traceback (most recent call last)
12 frames
<ipython-input-73-7bd88adad349> in <module>()
----> 1 result = test_dataset.map(predict, batched=True, batch_size=2)
/usr/local/lib/python3.7/dist-packages/datasets/arrow_dataset.py in map(self, function, with_indices, with_rank, input_columns, batched, batch_size, drop_last_batch, remove_columns, keep_in_memory, load_from_cache_file, cache_file_name, writer_batch_size, features, disable_nullable, fn_kwargs, num_proc, suffix_template, new_fingerprint, desc)
1970 new_fingerprint=new_fingerprint,
1971 disable_tqdm=disable_tqdm,
-> 1972 desc=desc,
1973 )
1974 else:
/usr/local/lib/python3.7/dist-packages/datasets/arrow_dataset.py in wrapper(*args, **kwargs)
517 self: "Dataset" = kwargs.pop("self")
518 # apply actual function
--> 519 out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
520 datasets: List["Dataset"] = list(out.values()) if isinstance(out, dict) else [out]
521 for dataset in datasets:
/usr/local/lib/python3.7/dist-packages/datasets/arrow_dataset.py in wrapper(*args, **kwargs)
484 }
485 # apply actual function
--> 486 out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
487 datasets: List["Dataset"] = list(out.values()) if isinstance(out, dict) else [out]
488 # re-apply format to the output
/usr/local/lib/python3.7/dist-packages/datasets/fingerprint.py in wrapper(*args, **kwargs)
456 # Call actual function
457
--> 458 out = func(self, *args, **kwargs)
459
460 # Update fingerprint of in-place transforms + update in-place history of transforms
/usr/local/lib/python3.7/dist-packages/datasets/arrow_dataset.py in _map_single(self, function, with_indices, with_rank, input_columns, batched, batch_size, drop_last_batch, remove_columns, keep_in_memory, load_from_cache_file, cache_file_name, writer_batch_size, features, disable_nullable, fn_kwargs, new_fingerprint, rank, offset, disable_tqdm, desc, cache_only)
2340 indices,
2341 check_same_num_examples=len(input_dataset.list_indexes()) > 0,
-> 2342 offset=offset,
2343 )
2344 except NumExamplesMismatchError:
/usr/local/lib/python3.7/dist-packages/datasets/arrow_dataset.py in apply_function_on_filtered_inputs(inputs, indices, check_same_num_examples, offset)
2217 if with_rank:
2218 additional_args += (rank,)
-> 2219 processed_inputs = function(*fn_args, *additional_args, **fn_kwargs)
2220 if update_data is None:
2221 # Check if the function returns updated examples
/usr/local/lib/python3.7/dist-packages/datasets/arrow_dataset.py in decorated(item, *args, **kwargs)
1912 )
1913 # Use the LazyDict internally, while mapping the function
-> 1914 result = f(decorated_item, *args, **kwargs)
1915 # Return a standard dict
1916 return result.data if isinstance(result, LazyDict) else result
<ipython-input-71-6f845da29c00> in predict(batch)
11
12 def predict(batch):
---> 13 features = feature_extractor(batch["speech"], sampling_rate=feature_extractor.sampling_rate, return_tensors="pt", padding=True)
14
15 input_values = features.input_values.to(device)
/usr/local/lib/python3.7/dist-packages/transformers/models/wav2vec2/feature_extraction_wav2vec2.py in __call__(self, raw_speech, padding, max_length, truncation, pad_to_multiple_of, return_attention_mask, return_tensors, sampling_rate, **kwargs)
200 truncation=truncation,
201 pad_to_multiple_of=pad_to_multiple_of,
--> 202 return_attention_mask=return_attention_mask,
203 )
204
/usr/local/lib/python3.7/dist-packages/transformers/feature_extraction_sequence_utils.py in pad(self, processed_features, padding, max_length, truncation, pad_to_multiple_of, return_attention_mask, return_tensors)
230 batch_outputs[key].append(value)
231
--> 232 return BatchFeature(batch_outputs, tensor_type=return_tensors)
233
234 def _pad(
/usr/local/lib/python3.7/dist-packages/transformers/feature_extraction_utils.py in __init__(self, data, tensor_type)
78 def __init__(self, data: Optional[Dict[str, Any]] = None, tensor_type: Union[None, str, TensorType] = None):
79 super().__init__(data)
---> 80 self.convert_to_tensors(tensor_type=tensor_type)
81
82 def __getitem__(self, item: str) -> Union[Any]:
/usr/local/lib/python3.7/dist-packages/transformers/feature_extraction_utils.py in convert_to_tensors(self, tensor_type)
174 raise ValueError("Unable to create tensor returning overflowing values of different lengths. ")
175 raise ValueError(
--> 176 "Unable to create tensor, you should probably activate padding "
177 "with 'padding=True' to have batched tensors with the same length."
178 )
ValueError: Unable to create tensor, you should probably activate padding with 'padding=True' to have batched tensors with the same length.
I am working on Google Colab. Those are the environment variables:
%env LC_ALL=C.UTF-8
%env LANG=C.UTF-8
%env TRANSFORMERS_CACHE=/content/cache
%env HF_DATASETS_CACHE=/content/cache
%env CUDA_LAUNCH_BLOCKING=1
The padding is already activated in the predict function.
Can you please help me fix it?

xgb.train(): TypeError: float() argument must be a string or a number, not 'DMatrix'

When I look at the documentation, the argument is supposed to be a 'DMatrix' (xgboost version 1.5.0).
https://xgboost.readthedocs.io/en/latest/python/python_api.html#:~:text=Customized%20objective%20function.-,Learning%20API,num_boost_round%20(int)%20%E2%80%93%20Number%20of%20boosting%20iterations,-.
Indicates pretty much the same thing for the version I'm using (goto subheading '1.2.2 Python' in document link below):
https://xgboost.readthedocs.io/_/downloads/en/release_1.3.0/pdf/
I don't understand why it is asking for a float argument when it is supposed to be a DMatrix.
I've looked at all the Stack posts that have the string 'TypeError: float() argument must be a string or a number, not...', but none of them include 'DMatrix' and I have not been able to find a solution that I could adapt this particular issue.
The the following is the bit of code that elicits this error (go to 'clf - xgb.train(...)'):
def grid_search(timeout_seconds, cv_splits, num_boost_round):
# Read input data
X, y = preprocessing()
y.replace({1:0,2:1,3:2,4:3,5:4,6:5,7:6,8:7,9:8,10:9,11:10,12:11,13:12,14:13,
15:14,16:15,17:16,18:17,19:18,20:19,21:20,22:21}, inplace = True)
# Create dataframe to collect the results
tests_columns = ["test_nr", "cv_mean", "cv_min", "cv_max", "cv_median", "params"]
test_id = 0
tests = pd.DataFrame(columns=tests_columns)
# Cross validation number of splits
kf = KFold(n_splits=cv_splits)
# Execute until timeout occurs
with timeout(timeout_seconds, exception=RuntimeError):
# Get the grid
grid_iter, keys, length = get_grid_iterable()
try:
# For every element of the grid
for df_grid in grid_iter:
# Prepare a list to collect the scores
score = []
params = dict(zip(keys, df_grid))
# The objective function
params["objective"] = "multi:softprob"
params['num_class'] = 22
print('X.reason_action_converted: ', X.reason_action_converted)
# For each fold, train XGBoost and spit out the results
for train_index, test_index in kf.split(X.values):
# Get X train and X test
X_train, X_test = X.iloc[train_index], X.iloc[test_index]
**# Get y train and y test**
y_train, y_test = y.iloc[train_index], y.iloc[test_index]
# Convert into DMatrix
d_train = xgb.DMatrix(X_train, label=y_train, missing=np.NaN)
d_valid = xgb.DMatrix(X_test, label=y_test, missing=np.NaN)
d_test = xgb.DMatrix(X_test, missing=np.NaN)
watchlist = [(d_train, 'train'), (d_valid, 'valid')]
# Create the classifier using the current grid params. Apply early stopping of 50 rounds
'''clf = xgb.train(params, d_train, boosting_rounds, watchlist, early_stopping_rounds=50, feval=log_loss, maximize=True, verbose_eval=10)'''
**clf = xgb.train(params, d_train, num_boost_round, watchlist, early_stopping_rounds=50, feval=log_loss, maximize=True, verbose_eval=10)**
y_hat = clf.predict(d_test)
# Append Scores on the fold kept out
score.append(r2_score(y_test, y_hat))
# Store the result into a dataframe
score_df = pd.DataFrame(columns=tests_columns, data=[
[test_id, np.mean(score), np.min(score), np.max(score), np.median(score),
json.dumps(dict(zip(keys, [str(g) for g in df_grid])))]])
test_id += 1
tests = pd.concat([tests, score_df])
except RuntimeError:
# When timeout occurs an exception is raised and the main cycle is broken
pass
# Spit out the results
tests.to_csv("grid-search.csv", index=False)
print(tests)
**if __name__ == "__main__":
grid_search(timeout_seconds=3600, cv_splits=4, num_boost_round=500)**
The error message:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<command-3902447645915365> in <module>
106
107 if __name__ == "__main__":
--> 108 grid_search(timeout_seconds=3600,
cv_splits=4, num_boost_round=500)
<command-3902447645915365> in grid_search(timeout_seconds, cv_splits, num_boost_round)
84 # Create the classifier using the current grid params. Apply early stopping of 50 rounds
85 '''clf = xgb.train(params,
d_train, boosting_rounds, watchlist,
early_stopping_rounds=50, feval=log_loss,
maximize=True, verbose_eval=10)'''
---> 86 clf = xgb.train(params,
d_train, num_boost_round, watchlist,
early_stopping_rounds=50, feval=log_loss,
maximize=True, verbose_eval=10)
87 y_hat = clf.predict(d_test)
88
/databricks/python/lib/python3.8/site-
packages/xgboost/training.py in train(params, dtrain,
num_boost_round, evals, obj, feval, maximize,
early_stopping_rounds, evals_result, verbose_eval,
xgb_model, callbacks)
204 Booster : a trained booster model
205 """
--> 206 bst = _train_internal(params, dtrain,
207
num_boost_round=num_boost_round,
208 evals=evals,
/databricks/python/lib/python3.8/site-packages/xgboost/training.py in _train_internal(params, dtrain, num_boost_round, evals, obj, feval, xgb_model, callbacks, evals_result, maximize, verbose_eval, early_stopping_rounds)
107 nboost += 1
108 # check evaluation result.
--> 109 if callbacks.after_iteration(bst, i,
dtrain, evals):
110 break
111 # do checkpoint after evaluation, in
case evaluation also updates
/databricks/python/lib/python3.8/site-
packages/xgboost/callback.py in after_iteration(self,
model, epoch, dtrain, evals)
421 for _, name in evals:
422 assert name.find('-') == -1,
'Dataset name should not contain `-`'
--> 423 score = model.eval_set(evals,
epoch, self.metric)
424 score = score.split()[1:] # into
datasets
425 # split up `test-error:0.1234`
/databricks/python/lib/python3.8/site-
packages/xgboost/core.py in eval_set(self, evals,
iteration, feval)
1350 if feval is not None:
1351 for dmat, evname in evals:
-> 1352 feval_ret =
feval(self.predict(dmat, training=False,
1353
output_margin=True), dmat)
1354 if isinstance(feval_ret, list):
/databricks/python/lib/python3.8/site-
packages/sklearn/utils/validation.py in inner_f(*args,
**kwargs)
70 FutureWarning)
71 kwargs.update({k: arg for k, arg in
zip(sig.parameters, args)})
---> 72 return f(**kwargs)
73 return inner_f
74
/databricks/python/lib/python3.8/site-
packages/sklearn/metrics/_classification.py in
log_loss(y_true, y_pred, eps, normalize, sample_weight,
labels)
2184 The logarithm used is the natural logarithm
(base-e).
2185 """
-> 2186 y_pred = check_array(y_pred,
ensure_2d=False)
2187 check_consistent_length(y_pred, y_true,
sample_weight)
2188
/databricks/python/lib/python3.8/site-
packages/sklearn/utils/validation.py in inner_f(*args,
**kwargs)
70 FutureWarning)
71 kwargs.update({k: arg for k, arg in
zip(sig.parameters, args)})
---> 72 return f(**kwargs)
73 return inner_f
74
/databricks/python/lib/python3.8/site-
packages/sklearn/utils/validation.py in
check_array(array, accept_sparse, accept_large_sparse,
dtype, order, copy, force_all_finite, ensure_2d,
allow_nd, ensure_min_samples, ensure_min_features,
estimator)
636 # make sure we actually converted to
numeric:
637 if dtype_numeric and array.dtype.kind
== "O":
--> 638 array = array.astype(np.float64)
639 if not allow_nd and array.ndim >= 3:
640 raise ValueError("Found array with
dim %d. %s expected <= 2."
TypeError: float() argument must be a string or a number, not 'DMatrix'
I'm using Databricks, Python 3.8.8, and xgboost 1.3.1.
I am trying to adapt code from the following tutorial: Effortless Hyperparameters Tuning with Apache Spark.

scoring "roc_auc" value is not working with gridsearchCV appling RandomForestclassifer

I keep getting this error when perform this with gridsearchCV with scoring value is 'roc_auc'('f1', 'precision','recall' work fine)
# Construct a pipeline
pipe = Pipeline([
('reduce_dim',PCA()),
('rf',RandomForestClassifier(min_samples_leaf=5,random_state=123))
])
N_FEATURES_OPTIONS = [2] # for PCA [2, 4, 8]
# these below param is for RandomForestClassifier
N_ESTIMATORS = [10,50] # 10,50,100
MAX_DEPTH = [5,6] # 5,6,7,8,9
MIN_SAMPLE_LEAF = 5
param_grid = [
{
'reduce_dim': [PCA()],
'reduce_dim__n_components': N_FEATURES_OPTIONS,
'rf__n_estimators' : N_ESTIMATORS,
'rf__max_depth': MAX_DEPTH
},
{
'reduce_dim': [SelectKBest(f_classif)],
'reduce_dim__k': N_FEATURES_OPTIONS,
'rf__n_estimators' : N_ESTIMATORS,
'rf__max_depth': MAX_DEPTH
},
]
grid = GridSearchCV(pipe, param_grid= param_grid, cv =10,n_jobs=1,scoring = 'roc_auc')
grid.fit(X_train_s,y_train_s)
And I get this error
AttributeError Traceback (most recent call last)
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/metrics/scorer.py in __call__(self, clf, X, y, sample_weight)
186 try:
--> 187 y_pred = clf.decision_function(X)
188
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/utils/metaestimators.py in __get__(self, obj, type)
108 else:
--> 109 getattr(delegate, self.attribute_name)
110 break
AttributeError: 'RandomForestClassifier' object has no attribute 'decision_function'
During handling of the above exception, another exception occurred:
IndexError Traceback (most recent call last)
<ipython-input-16-86491f3b6aa7> in <module>()
----> 1 grid.fit(X_train_s,y_train_s)
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/model_selection/_search.py in fit(self, X, y, groups, **fit_params)
637 error_score=self.error_score)
638 for parameters, (train, test) in product(candidate_params,
--> 639 cv.split(X, y, groups)))
640
641 # if one choose to see train score, "out" will contain train score info
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py in __call__(self, iterable)
777 # was dispatched. In particular this covers the edge
778 # case of Parallel used with an exhausted iterator.
--> 779 while self.dispatch_one_batch(iterator):
780 self._iterating = True
781 else:
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py in dispatch_one_batch(self, iterator)
623 return False
624 else:
--> 625 self._dispatch(tasks)
626 return True
627
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py in _dispatch(self, batch)
586 dispatch_timestamp = time.time()
587 cb = BatchCompletionCallBack(dispatch_timestamp, len(batch), self)
--> 588 job = self._backend.apply_async(batch, callback=cb)
589 self._jobs.append(job)
590
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/externals/joblib/_parallel_backends.py in apply_async(self, func, callback)
109 def apply_async(self, func, callback=None):
110 """Schedule a func to be run"""
--> 111 result = ImmediateResult(func)
112 if callback:
113 callback(result)
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/externals/joblib/_parallel_backends.py in __init__(self, batch)
330 # Don't delay the application, to avoid keeping the input
331 # arguments in memory
--> 332 self.results = batch()
333
334 def get(self):
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py in __call__(self)
129
130 def __call__(self):
--> 131 return [func(*args, **kwargs) for func, args, kwargs in self.items]
132
133 def __len__(self):
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py in <listcomp>(.0)
129
130 def __call__(self):
--> 131 return [func(*args, **kwargs) for func, args, kwargs in self.items]
132
133 def __len__(self):
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/model_selection/_validation.py in _fit_and_score(estimator, X, y, scorer, train, test, verbose, parameters, fit_params, return_train_score, return_parameters, return_n_test_samples, return_times, error_score)
486 fit_time = time.time() - start_time
487 # _score will return dict if is_multimetric is True
--> 488 test_scores = _score(estimator, X_test, y_test, scorer, is_multimetric)
489 score_time = time.time() - start_time - fit_time
490 if return_train_score:
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/model_selection/_validation.py in _score(estimator, X_test, y_test, scorer, is_multimetric)
521 """
522 if is_multimetric:
--> 523 return _multimetric_score(estimator, X_test, y_test, scorer)
524 else:
525 if y_test is None:
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/model_selection/_validation.py in _multimetric_score(estimator, X_test, y_test, scorers)
551 score = scorer(estimator, X_test)
552 else:
--> 553 score = scorer(estimator, X_test, y_test)
554
555 if hasattr(score, 'item'):
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/metrics/scorer.py in __call__(self, clf, X, y, sample_weight)
195
196 if y_type == "binary":
--> 197 y_pred = y_pred[:, 1]
198 elif isinstance(y_pred, list):
199 y_pred = np.vstack([p[:, -1] for p in y_pred]).T
IndexError: index 1 is out of bounds for axis 1 with size 1
I have looked up for this error and found some kind of similar problem here with Kerasclassifier. But I have no idea how to fix it
Keras Wrappers for Scikit Learn - AUC scorer is not working
can anyone explain to me what is wrong???
The error could be because som causes:
If you have only one target class: it fails
If you have >=3 target classes: if fails.
Maybe you have 2 classes, and in one fold of the CV, the test labels are only from one class.
When sklearn compute the AUC metric, it must have 2 classes, because the method for getting the AUC requires only two classes (to compute tpr and fpr with all thresholds).
Example of errors:
grid.fit(np.random.rand(100,2), np.random.randint(1, size=100)) #one class labels
grid.fit(np.random.rand(100,2), np.random.randint(3, size=100)) #3 class labels
#BOTH Throws same error when computing AUC
Example that should not thow an error but it could happen depends of the folds of the CV:
grid.fit(np.random.rand(100,2), np.random.randint(2, size=100)) #two class labels
#This shouldnt throw an error
SOLUTION
If you have more than 2 classes: you have to compute manually (or maybe there are some libraries, but I dont know about it), the 1 vs all, in which you compute auc with 2 classes (one class vs all the others), or All vs All AUC (pairwise AUC, where you compute one class vs ALL being the single class one class at a time, and then calculate the mean).
If you have 2 classes:
grid = GridSearchCV(pipe, param_grid= param_grid, cv = StratifiedKFold(), n_jobs=1, scoring = 'roc_auc')

Resources