HuggingFace | PipelineException: No mask_token (<mask>) found on the input - python-3.x

Goal: to for-loop over multiple models, print() elapsed time.
Processing one Model works fine:
i=0
start = time.time()
unmasker = pipeline('fill-mask', model=models[i])
unmasker("Hello I'm a [MASK] model.", top_k=1)
end = time.time()
df = df.append({'Model': models[i], 'Time': end-start}, ignore_index=True)
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
However, iterating over many model names causes the titled error.
Code:
from transformers import pipeline
import time
models = ['bert-base-uncased', 'roberta-base', 'distilbert-base-uncased', 'bert-base-cased', 'albert-base-v2', 'roberta-large', 'bert-large-uncased albert-large-v2', 'albert-base-v2', 'bert-large-cased', 'albert-base-v1', 'bert-large-cased-whole-word-masking', 'bert-large-uncased-whole-word-masking', 'albert-xxlarge-v2', 'google/bigbird-roberta-large', 'albert-xlarge-v2', 'albert-xxlarge-v1', 'facebook/muppet-roberta-large', 'facebook/muppet-roberta-base', 'albert-large-v1', 'albert-xlarge-v1']
for _model in models:
start = time.time()
unmasker = pipeline('fill-mask', model=_model)
unmasker("Hello I'm a [MASK] model.", top_k=1) # default: top_k=5
end = time.time()
print(end-start)
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
---------------------------------------------------------------------------
PipelineException Traceback (most recent call last)
<ipython-input-19-13b5f651657e> in <module>
3 start = time.time()
4 unmasker = pipeline('fill-mask', model=_model)
----> 5 unmasker("Hello I'm a [MASK] model.", top_k=1) # default: top_k=5
6 end = time.time()
7
~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/transformers/pipelines/fill_mask.py in __call__(self, inputs, *args, **kwargs)
224 - **token** (`str`) -- The predicted token (to replace the masked one).
225 """
--> 226 outputs = super().__call__(inputs, **kwargs)
227 if isinstance(inputs, list) and len(inputs) == 1:
228 return outputs[0]
~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/transformers/pipelines/base.py in __call__(self, inputs, num_workers, batch_size, *args, **kwargs)
1099 return self.iterate(inputs, preprocess_params, forward_params, postprocess_params)
1100 else:
-> 1101 return self.run_single(inputs, preprocess_params, forward_params, postprocess_params)
1102
1103 def run_multi(self, inputs, preprocess_params, forward_params, postprocess_params):
~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/transformers/pipelines/base.py in run_single(self, inputs, preprocess_params, forward_params, postprocess_params)
1105
1106 def run_single(self, inputs, preprocess_params, forward_params, postprocess_params):
-> 1107 model_inputs = self.preprocess(inputs, **preprocess_params)
1108 model_outputs = self.forward(model_inputs, **forward_params)
1109 outputs = self.postprocess(model_outputs, **postprocess_params)
~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/transformers/pipelines/fill_mask.py in preprocess(self, inputs, return_tensors, **preprocess_parameters)
82 return_tensors = self.framework
83 model_inputs = self.tokenizer(inputs, return_tensors=return_tensors)
---> 84 self.ensure_exactly_one_mask_token(model_inputs)
85 return model_inputs
86
~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/transformers/pipelines/fill_mask.py in ensure_exactly_one_mask_token(self, model_inputs)
76 else:
77 for input_ids in model_inputs["input_ids"]:
---> 78 self._ensure_exactly_one_mask_token(input_ids)
79
80 def preprocess(self, inputs, return_tensors=None, **preprocess_parameters) -> Dict[str, GenericTensor]:
~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/transformers/pipelines/fill_mask.py in _ensure_exactly_one_mask_token(self, input_ids)
67 "fill-mask",
68 self.model.base_model_prefix,
---> 69 f"No mask_token ({self.tokenizer.mask_token}) found on the input",
70 )
71
PipelineException: No mask_token (<mask>) found on the input
Please let me know if there's anything else I can add to post to clarify.

Only certain models would throw that error.
Since I am experimenting with runtimes for any model, the below suffices. I was successful at running the majority of models.
I applied try except logic. Note, it is considered bad practice to handle exceptions without naming the error specifically in the except statement.
for _model in models:
for i in range(10):
start = time.time()
try:
unmasker = pipeline('fill-mask', model=_model)
unmasker("Hello I'm a [MASK] model.", top_k=1) # default: top_k=5
print(_model)
except: continue
end = time.time()
df = df.append({'Model': _model, 'Time': end-start}, ignore_index=True)
print(df)
df.to_csv('model_performance.csv', index=False)

Related

Scikit learn custom scoring function - Specificity

I'm trying to do a random grid search on randomforestclassifier.
# Instantiate a RandomForestClassifier
RFC = RandomForestClassifier()
# Instantiate the RandomizedSearchCV object: RFC
rand_search3 = RandomizedSearchCV(RFC, param_grid, n_iter=10, cv=5,n_jobs=-1, verbose=1, scoring = "f1_macro")
# Fit it to the data
rand_search3.fit(X_train_transformed,y_train)
I'm trying to get the best model by assessing specificity.
Went through the documentation for custom scoring. Also looked at lots of posts that are related. I have came up with 2 ways for the specificity.
1 :
from sklearn.metrics import make_scorer
def my_custom_func(y_true, y_pred):
cm = confusion_matrix(y_true, y_pred)
return cm[1][1] / (cm[1][0] + cm[1][1])
Specificity_score = make_scorer(my_custom_func, greater_is_better=True)
2:
from sklearn.metrics import recall_score
from sklearn.metrics import make_scorer
specificity = make_scorer(recall_score, pos_label=0, greater_is_better=True)
specificity
When I try using the custom function for the scoring,
rand_search3 = RandomizedSearchCV(RFC, param_grid, n_iter=10, cv=5,n_jobs=-1, verbose=1, scoring = Specificity_score)
rand_search3 = RandomizedSearchCV(RFC, param_grid, n_iter=10, cv=5,n_jobs=-1, verbose=1, scoring = specificity)
both failed with the same error message.
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_10548/1204393696.py in <module>
20
21 # Fit it to the data
---> 22 rand_search3.fit(X_train_transformed,y_train)
23
~\anaconda3\lib\site-packages\sklearn\utils\validation.py in inner_f(*args, **kwargs)
61 #wraps(f)
62 def inner_f(*args, **kwargs):
---> 63 extra_args = len(args) - len(all_args)
64 if extra_args <= 0:
65 return f(*args, **kwargs)
~\anaconda3\lib\site-packages\sklearn\model_selection\_search.py in fit(self, X, y, groups, **fit_params)
839 delayed(_fit_and_score)(
840 clone(base_estimator),
--> 841 X,
842 y,
843 train=train,
~\anaconda3\lib\site-packages\sklearn\model_selection\_search.py in _run_search(self, evaluate_candidates)
1631 Mean cross-validated score of the best_estimator.
1632
-> 1633 For multi-metric evaluation, this is not available if ``refit`` is
1634 ``False``. See ``refit`` parameter for more information.
1635
~\anaconda3\lib\site-packages\sklearn\model_selection\_search.py in evaluate_candidates(candidate_params, cv, more_results)
825 def evaluate_candidates(candidate_params, cv=None, more_results=None):
826 cv = cv or cv_orig
--> 827 candidate_params = list(candidate_params)
828 n_candidates = len(candidate_params)
829
~\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py in _insert_error_scores(results, error_score)
293
294 results = _aggregate_score_dicts(results)
--> 295
296 ret = {}
297 ret["fit_time"] = results["fit_time"]
KeyError: 'fit_failed'
Any solutions?

keras agents fails in DQNAgent using PQC during clonation for target

I have some issues using keras-rl2 with tensorflow_quantum and VQC (using identical architecture as https://www.tensorflow.org/quantum/tutorials/quantum_reinforcement_learning)
After the creation of the model and DqnAgent, in dqn.compile:
############################################################
def generate_model_Qlearning(qubits, n_layers, n_actions, observables, target):
qubits = cirq.GridQubit.rect(1, n_qubits)
ops = [cirq.Z(q) for q in qubits]
observables = [ops[0]*ops[1], ops[2]*ops[3]] # Z_0*Z_1 for
action 0 and Z_2*Z_3 for action 1
input_tensor = tf.keras.Input(shape=(len(qubits), ),
dtype=tf.dtypes.float32, name='input')
re_uploading_pqc = ReUploadingPQC(qubits, n_layers,
observables, activation='tanh')([input_tensor])
process = tf.keras.Sequential(
[Rescaling(len(observables))],
name=target*"Target"+"Q-values"
)
Q_values = process(re_uploading_pqc)
model = tf.keras.Model(inputs=[input_tensor],
outputs=Q_values)
return model
############################################################
model = generate_model_Qlearning(qubits, n_layers, n_actions,
observables, False)
model_target = generate_model_Qlearning(qubits, n_layers,
n_actions, observables, True)
model_target.set_weights(model.get_weights())
dqn = DQNAgent(model=model, enable_double_dqn = True,
nb_actions=num_actions,
dqn.compile(Adam(learning_rate=1e-3), metrics=['mae'])
history = dqn.fit(env, nb_steps=50000, visualize=False,
verbose=2)
The following exception appears:
---------------------------------------------------------------------------
NotImplementedError Traceback (most recent call last)
Input In [119], in <module>
----> 1 dqn.compile(Adam(learning_rate=1e-3), metrics=['mae'])
3 history = dqn.fit(env, nb_steps=50000,
4 visualize=False,
5 verbose=2)
File ~.local/lib/python3.8/site-packages/rl/agents/dqn.py:167, in DQNAgent.compile(self, optimizer, metrics)
164 metrics += [mean_q] # register default metrics
166 # We never train the target model, hence we can set the optimizer and loss arbitrarily.
--> 167 **self.target_model = clone_model(self.model, self.custom_model_objects)**
168 self.target_model.compile(optimizer='sgd', loss='mse')
169 self.model.compile(optimizer='sgd', loss='mse')
File ~.local/lib/python3.8/site-packages/rl/util.py:13, in clone_model(model, custom_objects)
9 def clone_model(model, custom_objects={}):
10 # Requires Keras 1.0.7 since get_config has breaking changes.
11 config = {
12 'class_name': model.__class__.__name__,
---> 13 **'config': model.get_config(),**
14 }
15 clone = model_from_config(config, custom_objects=custom_objects)
16 clone.set_weights(model.get_weights())
File ~.local/lib/python3.8/site-packages/keras/engine/functional.py:685, in Functional.get_config(self)
684 def get_config(self):
--> 685 return copy.deepcopy(get_network_config(self))
File ~.local/lib/python3.8/site-packages/keras/engine/functional.py:1410, in get_network_config(network, serialize_layer_fn)
1407 node_data = node.serialize(_make_node_key, node_conversion_map)
1408 filtered_inbound_nodes.append(node_data)
-> 1410 layer_config = serialize_layer_fn(layer)
1411 layer_config['name'] = layer.name
1412 layer_config['inbound_nodes'] = filtered_inbound_nodes
File ~.local/lib/python3.8/site-packages/keras/utils/generic_utils.py:510, in serialize_keras_object(instance)
507 if _SKIP_FAILED_SERIALIZATION:
508 return serialize_keras_class_and_config(
509 name, {_LAYER_UNDEFINED_CONFIG_KEY: True})
--> 510 raise e
511 serialization_config = {}
512 for key, item in config.items():
File ~.local/lib/python3.8/site-packages/keras/utils/generic_utils.py:505, in serialize_keras_object(instance)
503 name = get_registered_name(instance.__class__)
504 try:
--> 505 config = instance.get_config()
506 except NotImplementedError as e:
507 if _SKIP_FAILED_SERIALIZATION:
File ~.local/lib/python3.8/site-packages/keras/engine/base_layer_v1.py:497, in Layer.get_config(self)
494 # Check that either the only argument in the `__init__` is `self`,
495 # or that `get_config` has been overridden:
496 if len(extra_args) > 1 and hasattr(self.get_config, '_is_default'):
--> 497 raise NotImplementedError('Layers with arguments in `__init__` must '
498 'override `get_config`.')
499 return config
NotImplementedError: Layers with arguments in `__init__` must override `get_config`.
the topology:
It could be great if this library let us specify the dqn_target instead of doing Clone.
Because working with hybrid neural networks with a cirquit with parameters in a layer, it's difficult to serialize it. So, when it runs the line: model.get_config(), it fails.
Any idea to solve it?
Thanks!

xgb.train(): TypeError: float() argument must be a string or a number, not 'DMatrix'

When I look at the documentation, the argument is supposed to be a 'DMatrix' (xgboost version 1.5.0).
https://xgboost.readthedocs.io/en/latest/python/python_api.html#:~:text=Customized%20objective%20function.-,Learning%20API,num_boost_round%20(int)%20%E2%80%93%20Number%20of%20boosting%20iterations,-.
Indicates pretty much the same thing for the version I'm using (goto subheading '1.2.2 Python' in document link below):
https://xgboost.readthedocs.io/_/downloads/en/release_1.3.0/pdf/
I don't understand why it is asking for a float argument when it is supposed to be a DMatrix.
I've looked at all the Stack posts that have the string 'TypeError: float() argument must be a string or a number, not...', but none of them include 'DMatrix' and I have not been able to find a solution that I could adapt this particular issue.
The the following is the bit of code that elicits this error (go to 'clf - xgb.train(...)'):
def grid_search(timeout_seconds, cv_splits, num_boost_round):
# Read input data
X, y = preprocessing()
y.replace({1:0,2:1,3:2,4:3,5:4,6:5,7:6,8:7,9:8,10:9,11:10,12:11,13:12,14:13,
15:14,16:15,17:16,18:17,19:18,20:19,21:20,22:21}, inplace = True)
# Create dataframe to collect the results
tests_columns = ["test_nr", "cv_mean", "cv_min", "cv_max", "cv_median", "params"]
test_id = 0
tests = pd.DataFrame(columns=tests_columns)
# Cross validation number of splits
kf = KFold(n_splits=cv_splits)
# Execute until timeout occurs
with timeout(timeout_seconds, exception=RuntimeError):
# Get the grid
grid_iter, keys, length = get_grid_iterable()
try:
# For every element of the grid
for df_grid in grid_iter:
# Prepare a list to collect the scores
score = []
params = dict(zip(keys, df_grid))
# The objective function
params["objective"] = "multi:softprob"
params['num_class'] = 22
print('X.reason_action_converted: ', X.reason_action_converted)
# For each fold, train XGBoost and spit out the results
for train_index, test_index in kf.split(X.values):
# Get X train and X test
X_train, X_test = X.iloc[train_index], X.iloc[test_index]
**# Get y train and y test**
y_train, y_test = y.iloc[train_index], y.iloc[test_index]
# Convert into DMatrix
d_train = xgb.DMatrix(X_train, label=y_train, missing=np.NaN)
d_valid = xgb.DMatrix(X_test, label=y_test, missing=np.NaN)
d_test = xgb.DMatrix(X_test, missing=np.NaN)
watchlist = [(d_train, 'train'), (d_valid, 'valid')]
# Create the classifier using the current grid params. Apply early stopping of 50 rounds
'''clf = xgb.train(params, d_train, boosting_rounds, watchlist, early_stopping_rounds=50, feval=log_loss, maximize=True, verbose_eval=10)'''
**clf = xgb.train(params, d_train, num_boost_round, watchlist, early_stopping_rounds=50, feval=log_loss, maximize=True, verbose_eval=10)**
y_hat = clf.predict(d_test)
# Append Scores on the fold kept out
score.append(r2_score(y_test, y_hat))
# Store the result into a dataframe
score_df = pd.DataFrame(columns=tests_columns, data=[
[test_id, np.mean(score), np.min(score), np.max(score), np.median(score),
json.dumps(dict(zip(keys, [str(g) for g in df_grid])))]])
test_id += 1
tests = pd.concat([tests, score_df])
except RuntimeError:
# When timeout occurs an exception is raised and the main cycle is broken
pass
# Spit out the results
tests.to_csv("grid-search.csv", index=False)
print(tests)
**if __name__ == "__main__":
grid_search(timeout_seconds=3600, cv_splits=4, num_boost_round=500)**
The error message:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<command-3902447645915365> in <module>
106
107 if __name__ == "__main__":
--> 108 grid_search(timeout_seconds=3600,
cv_splits=4, num_boost_round=500)
<command-3902447645915365> in grid_search(timeout_seconds, cv_splits, num_boost_round)
84 # Create the classifier using the current grid params. Apply early stopping of 50 rounds
85 '''clf = xgb.train(params,
d_train, boosting_rounds, watchlist,
early_stopping_rounds=50, feval=log_loss,
maximize=True, verbose_eval=10)'''
---> 86 clf = xgb.train(params,
d_train, num_boost_round, watchlist,
early_stopping_rounds=50, feval=log_loss,
maximize=True, verbose_eval=10)
87 y_hat = clf.predict(d_test)
88
/databricks/python/lib/python3.8/site-
packages/xgboost/training.py in train(params, dtrain,
num_boost_round, evals, obj, feval, maximize,
early_stopping_rounds, evals_result, verbose_eval,
xgb_model, callbacks)
204 Booster : a trained booster model
205 """
--> 206 bst = _train_internal(params, dtrain,
207
num_boost_round=num_boost_round,
208 evals=evals,
/databricks/python/lib/python3.8/site-packages/xgboost/training.py in _train_internal(params, dtrain, num_boost_round, evals, obj, feval, xgb_model, callbacks, evals_result, maximize, verbose_eval, early_stopping_rounds)
107 nboost += 1
108 # check evaluation result.
--> 109 if callbacks.after_iteration(bst, i,
dtrain, evals):
110 break
111 # do checkpoint after evaluation, in
case evaluation also updates
/databricks/python/lib/python3.8/site-
packages/xgboost/callback.py in after_iteration(self,
model, epoch, dtrain, evals)
421 for _, name in evals:
422 assert name.find('-') == -1,
'Dataset name should not contain `-`'
--> 423 score = model.eval_set(evals,
epoch, self.metric)
424 score = score.split()[1:] # into
datasets
425 # split up `test-error:0.1234`
/databricks/python/lib/python3.8/site-
packages/xgboost/core.py in eval_set(self, evals,
iteration, feval)
1350 if feval is not None:
1351 for dmat, evname in evals:
-> 1352 feval_ret =
feval(self.predict(dmat, training=False,
1353
output_margin=True), dmat)
1354 if isinstance(feval_ret, list):
/databricks/python/lib/python3.8/site-
packages/sklearn/utils/validation.py in inner_f(*args,
**kwargs)
70 FutureWarning)
71 kwargs.update({k: arg for k, arg in
zip(sig.parameters, args)})
---> 72 return f(**kwargs)
73 return inner_f
74
/databricks/python/lib/python3.8/site-
packages/sklearn/metrics/_classification.py in
log_loss(y_true, y_pred, eps, normalize, sample_weight,
labels)
2184 The logarithm used is the natural logarithm
(base-e).
2185 """
-> 2186 y_pred = check_array(y_pred,
ensure_2d=False)
2187 check_consistent_length(y_pred, y_true,
sample_weight)
2188
/databricks/python/lib/python3.8/site-
packages/sklearn/utils/validation.py in inner_f(*args,
**kwargs)
70 FutureWarning)
71 kwargs.update({k: arg for k, arg in
zip(sig.parameters, args)})
---> 72 return f(**kwargs)
73 return inner_f
74
/databricks/python/lib/python3.8/site-
packages/sklearn/utils/validation.py in
check_array(array, accept_sparse, accept_large_sparse,
dtype, order, copy, force_all_finite, ensure_2d,
allow_nd, ensure_min_samples, ensure_min_features,
estimator)
636 # make sure we actually converted to
numeric:
637 if dtype_numeric and array.dtype.kind
== "O":
--> 638 array = array.astype(np.float64)
639 if not allow_nd and array.ndim >= 3:
640 raise ValueError("Found array with
dim %d. %s expected <= 2."
TypeError: float() argument must be a string or a number, not 'DMatrix'
I'm using Databricks, Python 3.8.8, and xgboost 1.3.1.
I am trying to adapt code from the following tutorial: Effortless Hyperparameters Tuning with Apache Spark.

Removing last 2 layers from a BERT classifier results in " 'tuple' object has no attribute 'dim' " error. Why?

I fine tuned a huggingface transformer using Keras (with ktrain) and then reloaded the model in Pytorch.
I want to access the third to last layer (pre_classifier), so I removed the two last layers:
BERT2 = torch.nn.Sequential(*(list(BERT.children())[:-2]))
Running an encoded sentence through this yields the following error message:
AttributeError Traceback (most recent call last)
<ipython-input-38-640702475573> in <module>
----> 1 ans2=BERT2(torch.tensor([e1]))
2 print (ans2)
C:\ProgramData\Anaconda3\lib\site-packages\torch\nn\modules\module.py in __call__(self, *input, **kwargs)
539 result = self._slow_forward(*input, **kwargs)
540 else:
--> 541 result = self.forward(*input, **kwargs)
542 for hook in self._forward_hooks.values():
543 hook_result = hook(self, input, result)
C:\ProgramData\Anaconda3\lib\site-packages\torch\nn\modules\container.py in forward(self, input)
90 def forward(self, input):
91 for module in self._modules.values():
---> 92 input = module(input)
93 return input
94
C:\ProgramData\Anaconda3\lib\site-packages\torch\nn\modules\module.py in __call__(self, *input, **kwargs)
539 result = self._slow_forward(*input, **kwargs)
540 else:
--> 541 result = self.forward(*input, **kwargs)
542 for hook in self._forward_hooks.values():
543 hook_result = hook(self, input, result)
C:\ProgramData\Anaconda3\lib\site-packages\torch\nn\modules\linear.py in forward(self, input)
85
86 def forward(self, input):
---> 87 return F.linear(input, self.weight, self.bias)
88
89 def extra_repr(self):
C:\ProgramData\Anaconda3\lib\site-packages\torch\nn\functional.py in linear(input, weight, bias)
1366 - Output: :math:`(N, *, out\_features)`
1367 """
-> 1368 if input.dim() == 2 and bias is not None:
1369 # fused op is marginally faster
1370 ret = torch.addmm(bias, input, weight.t())
AttributeError: 'tuple' object has no attribute 'dim'
Meanwhile deleting the classifier entirely (all three layers)
BERT3 = torch.nn.Sequential(*(list(BERT.children())[:-3]))
Yields the expected tensor (within a size 1 tuple) with the expected shape ([sentence_num,token_num,768]).
Why does the removal of two (but not three) layers breaks the model?
And how can I access the pre_classifier results?
It is not accessible by setting config with output_hidden_states=True as this flag returns the hidden values of the BERT transformer stack, not those of the classifier layers downstream to it.
--
PS
The code used to initialize the BERT model:
def collect_data_for_FT():
from sklearn.datasets import fetch_20newsgroups
train_data = fetch_20newsgroups(subset='train', shuffle=True, random_state=42)
test_data = fetch_20newsgroups(subset='test', shuffle=True, random_state=42)
print('size of training set: %s' % (len(train_b['data'])))
print('size of validation set: %s' % (len(test_b['data'])))
print('classes: %s' % (train_b.target_names))
x_train = train_data.data
y_train = train_data.target
x_test = test_data.data
y_test = test_data.target
return(x_train,y_train,x_test,y_test)
bert_name = 'distilbert-base-uncased'
from transformers import DistilBertForSequenceClassification,AutoConfig,AutoTokenizer
import os
dir_path = os.getcwd()
dir_path=os.path.join(dir_path,'models')
config = AutoConfig.from_pretrained(bert_name,num_labels=20) # change model configuration to access hidden values.
try:
BERT = DistilBertForSequenceClassification.from_pretrained(dir_path,config=config)
print ("Finetuned predictor loaded")
except:
import tensorflow.keras as keras
print ("No finetuned predictor found.\nTraining.")
(x_train,y_train,x_test,y_test)=collect_data_for_FT()
####
# prework:
import ktrain
from ktrain import text
t = text.Transformer(bert_name, maxlen=500, classes=train_b.target_names)
trn = t.preprocess_train(x_train, y_train)
val = t.preprocess_test(x_test, y_test)
pre_trained_model = t.get_classifier()
learner = ktrain.get_learner(pre_trained_model, train_data=trn, val_data=val, batch_size=6)
####
####
# Find best learning rate
learner.lr_find()
learner.lr_plot()
####
learner.fit_onecycle(2e-4, 4) # choosen based on the learning rate/loss plot.
####
# prepare and save:
predictor = ktrain.get_predictor(learner.model, preproc=t)
predictor.save('my_distilbertbase_predictor')
predictor.model.save_pretrained(dir_path)
####
BERT = DistilBertForSequenceClassification.from_pretrained(os.path.join(dir_path), from_tf=True,config=config) # re-load tensorflow to pytorch
BERT.save_pretrained(dir_path) # save as a "full blooded" pytorch model
BERT = DistilBertForSequenceClassification.from_pretrained(dir_path,config=config) # re-load
from tensorflow.keras import backend as K
K.clear_session() # loading from tensorflow takes up space and the GPU. This releases it/

How to use SHAP with a linear SVC model from sklearn using Pipeline?

I am doing text classification using a linear SVC model from sklearn. Now I want to visualize which words/tokens have the highest impact on the classification decision by using SHAP (https://github.com/slundberg/shap).
Right now this does not work because I am getting an error that seems to originate from the vectorizer step in the pipeline I have defined - whats wrong here?
Is my general approach on how to use SHAP in this case correct?
x_Train, x_Test, y_Train, y_Test = train_test_split(df_all['PDFText'], df_all['class'], test_size = 0.2, random_state = 1234)
pipeline = Pipeline([
(
'tfidv',
TfidfVectorizer(
ngram_range=(1,3),
analyzer='word',
strip_accents = ascii,
use_idf = True,
sublinear_tf=True,
max_features=6000,
min_df=2,
max_df=1.0
)
),
(
'lin_svc',
svm.SVC(
C=1.0,
probability=True,
kernel='linear'
)
)
])
pipeline.fit(x_Train, y_Train)
shap.initjs()
explainer = shap.KernelExplainer(pipeline.predict_proba, x_Train)
shap_values = explainer.shap_values(x_Test, nsamples=100)
shap.force_plot(explainer.expected_value[0], shap_values[0][0,:], x_Test.iloc[0,:])
This is the error message I get:
Provided model function fails when applied to the provided data set.
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-81-4bca63616b3b> in <module>
3
4 # use Kernel SHAP to explain test set predictions
----> 5 explainer = shap.KernelExplainer(pipeline.predict_proba, x_Train)
6 shap_values = explainer.shap_values(x_Test, nsamples=100)
7
c:\users\s.p\appdata\local\programs\python\python37\lib\site-packages\shap\explainers\kernel.py in __init__(self, model, data, link, **kwargs)
95 self.keep_index_ordered = kwargs.get("keep_index_ordered", False)
96 self.data = convert_to_data(data, keep_index=self.keep_index)
---> 97 model_null = match_model_to_data(self.model, self.data)
98
99 # enforce our current input type limitations
c:\users\s.p\appdata\local\programs\python\python37\lib\site-packages\shap\common.py in match_model_to_data(model, data)
80 out_val = model.f(data.convert_to_df())
81 else:
---> 82 out_val = model.f(data.data)
83 except:
84 print("Provided model function fails when applied to the provided data set.")
c:\users\s.p\appdata\local\programs\python\python37\lib\site-packages\sklearn\utils\metaestimators.py in <lambda>(*args, **kwargs)
116
117 # lambda, but not partial, allows help() to work with update_wrapper
--> 118 out = lambda *args, **kwargs: self.fn(obj, *args, **kwargs)
119 # update the docstring of the returned function
120 update_wrapper(out, self.fn)
c:\users\s.p\appdata\local\programs\python\python37\lib\site-packages\sklearn\pipeline.py in predict_proba(self, X)
379 for name, transform in self.steps[:-1]:
380 if transform is not None:
--> 381 Xt = transform.transform(Xt)
382 return self.steps[-1][-1].predict_proba(Xt)
383
c:\users\s.p\appdata\local\programs\python\python37\lib\site-packages\sklearn\feature_extraction\text.py in transform(self, raw_documents, copy)
1631 check_is_fitted(self, '_tfidf', 'The tfidf vector is not fitted')
1632
-> 1633 X = super(TfidfVectorizer, self).transform(raw_documents)
1634 return self._tfidf.transform(X, copy=False)
c:\users\s.p\appdata\local\programs\python\python37\lib\site-packages\sklearn\feature_extraction\text.py in transform(self, raw_documents)
1084
1085 # use the same matrix-building strategy as fit_transform
-> 1086 _, X = self._count_vocab(raw_documents, fixed_vocab=True)
1087 if self.binary:
1088 X.data.fill(1)
c:\users\s.p\appdata\local\programs\python\python37\lib\site-packages\sklearn\feature_extraction\text.py in _count_vocab(self, raw_documents, fixed_vocab)
940 for doc in raw_documents:
941 feature_counter = {}
--> 942 for feature in analyze(doc):
943 try:
944 feature_idx = vocabulary[feature]
c:\users\s.p\appdata\local\programs\python\python37\lib\site-packages\sklearn\feature_extraction\text.py in <lambda>(doc)
326 tokenize)
327 return lambda doc: self._word_ngrams(
--> 328 tokenize(preprocess(self.decode(doc))), stop_words)
329
330 else:
c:\users\s.p\appdata\local\programs\python\python37\lib\site-packages\sklearn\feature_extraction\text.py in <lambda>(x)
254
255 if self.lowercase:
--> 256 return lambda x: strip_accents(x.lower())
257 else:
258 return strip_accents
AttributeError: 'numpy.ndarray' object has no attribute 'lower'
KernelExplainer expects to receive a classification model as the first argument. Please check the use of Pipeline with Shap following the link.
In your case, you can use the Pipeline as follows:
x_Train = pipeline.named_steps['tfidv'].fit_transform(x_Train)
explainer = shap.KernelExplainer(pipeline.named_steps['lin_svc'].predict_proba, x_Train)

Resources