How to find Top features from Naive Bayes using sklearn pipeline - scikit-learn

How to find Top features from Naive Bayes using sklearn pipeline
Hi all,
I am trying to apply Naive Bayes(MultinomialNB ) using pipelines and i came up with the code. However I am interested in finding top 10 positve and negative words , but not able to succeed. when I searched , I got the code for finding top features which i mentioned below. However when i tried using the code using pipeline i am getting the error which i mentioned below. I tried searching exhaustively , but got the code without using pipeline.But when i use the code with my output from pipeline, it is not working. COuld you please help me on how to find feature importance from pipeline output.
# Pipeline dictionary
pipelines = {
'bow_MultinomialNB' : make_pipeline(
CountVectorizer(),
preprocessing.Normalizer(),
MultinomialNB()
)
}
# List tuneable hyperparameters of our pipeline
pipelines['bow_MultinomialNB'].get_params()
# BOW - MultinomialNB hyperparameters
bow_MultinomialNB_hyperparameters = {
'multinomialnb__alpha' : [1000,500,100,50,10,5,1,0.5,0.1,0.05,0.01,0.005,0.001,0.0005,0.0001]
}
# Create hyperparameters dictionary
hyperparameters = {
'bow_MultinomialNB' : bow_MultinomialNB_hyperparameters
}
tscv = TimeSeriesSplit(n_splits=3) #For time based splitting
for name, pipeline in pipelines.items():
print("NAME:",name)
print("PIPELINE:",pipeline)
%time
# Create empty dictionary called fitted_models
fitted_models = {}
# Loop through model pipelines, tuning each one and saving it to fitted_models
for name, pipeline in pipelines.items():
# Create cross-validation object from pipeline and hyperparameters
model = GridSearchCV(pipeline, hyperparameters[name], cv=tscv, n_jobs=1,verbose=1)
# Fit model on X_train, y_train
model.fit(X_train, y_train)
# Store model in fitted_models[name]
fitted_models[name] = model
# Print '{name} has been fitted'
print(name, 'has been fitted.')
FEAURE IMPORTANCE:-
pipelines['bow_MultinomialNB'].steps[2][1].classes__
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-125-7d45b007e86b> in <module>()
----> 1 pipelines['bow_MultinomialNB'].steps[2][1].classes_
AttributeError: 'MultinomialNB' object has no attribute 'classes_'
pipelines['bow_MultinomialNB'].steps[0][1].get_feature_names()
---------------------------------------------------------------------------
NotFittedError Traceback (most recent call last)
<ipython-input-126-2883929221d1> in <module>()
----> 1 pipelines['bow_MultinomialNB'].steps[0][1].get_feature_names()
~\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py in get_feature_names(self)
958 def get_feature_names(self):
959 """Array mapping from feature integer indices to feature name"""
--> 960 self._check_vocabulary()
961
962 return [t for t, i in sorted(six.iteritems(self.vocabulary_),
~\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py in _check_vocabulary(self)
301 """Check if vocabulary is empty or missing (not fit-ed)"""
302 msg = "%(name)s - Vocabulary wasn't fitted."
--> 303 check_is_fitted(self, 'vocabulary_', msg=msg),
304
305 if len(self.vocabulary_) == 0:
~\Anaconda3\lib\site-packages\sklearn\utils\validation.py in check_is_fitted(estimator, attributes, msg, all_or_any)
766
767 if not all_or_any([hasattr(estimator, attr) for attr in attributes]):
--> 768 raise NotFittedError(msg % {'name': type(estimator).__name__})
769
770
NotFittedError: CountVectorizer - Vocabulary wasn't fitted.
x=pipelines['bow_MultinomialNB'].steps[0][1]._validate_vocabulary()
x.get_feature_names()
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-120-f620c754a34e> in <module>()
----> 1 x.get_feature_names()
AttributeError: 'NoneType' object has no attribute 'get_feature_names'
Regards,
Shree

Related

Cannot interpret SVM model using Shapash

Currently, I'm exploring machine learning interpretability tools for one of my project. I found Shapash quite a new tool and many people suggesting to use it to create a few easily interpretable charts for ML model. When I tried it with RandomForestClassifier it worked fine and generate a webpage full of different charts but the same I cannot achieve while using SVM(just exploring this library, not focusing on the perfect ML model for a problem).
Note - using Shapash link here
#Fit blackbox model
svc = svm.SVC()
svc.fit(X_train_smote, y_train_smote)
y_pred = svc.predict(X_test)
print(f"F1 Score {f1_score(y_test, y_pred, average='macro')}")
print(f"Accuracy {accuracy_score(y_test, y_pred)}")
from shapash import SmartExplainer
xpl = SmartExplainer(model=svc)
error which I'm getting -
---------------------------------------------------------------------------
Exception Traceback (most recent call last)
/tmp/ipykernel_13648/1233939729.py in <module>
----> 1 xpl = SmartExplainer(model=svc)
~/Python_AI/ai_env/lib/python3.8/site-packages/shapash/explainer/smart_explainer.py in __init__(self, model, backend, preprocessing, postprocessing, features_groups, features_dict, label_dict, title_story, palette_name, colors_dict, **kwargs)
194 if isinstance(backend, str):
195 backend_cls = get_backend_cls_from_name(backend)
--> 196 self.backend = backend_cls(
197 model=self.model, preprocessing=preprocessing, **kwargs)
198 elif isinstance(backend, BaseBackend):
~/Python_AI/ai_env/lib/python3.8/site-packages/shapash/backend/shap_backend.py in __init__(self, model, preprocessing, explainer_args, explainer_compute_args)
16 self.explainer_args = explainer_args if explainer_args else {}
17 self.explainer_compute_args = explainer_compute_args if explainer_compute_args else {}
---> 18 self.explainer = shap.Explainer(model=model, **self.explainer_args)
19
20 def run_explainer(self, x: pd.DataFrame) -> dict:
~/Python_AI/ai_env/lib/python3.8/site-packages/shap/explainers/_explainer.py in __init__(self, model, masker, link, algorithm, output_names, feature_names, **kwargs)
166 # if we get here then we don't know how to handle what was given to us
167 else:
--> 168 raise Exception("The passed model is not callable and cannot be analyzed directly with the given masker! Model: " + str(model))
169
170 # build the right subclass
Exception: The passed model is not callable and cannot be analyzed directly with the given masker! Model: SVC()

Setting `remove_unused_columns=False` causes error in HuggingFace Trainer class

I am training a model using HuggingFace Trainer class. The following code does a decent job:
!pip install datasets
!pip install transformers
from datasets import load_dataset
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer, AutoTokenizer
dataset = load_dataset('glue', 'mnli')
model = AutoModelForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=3)
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased', use_fast=True)
def preprocess_function(examples):
return tokenizer(examples["premise"], examples["hypothesis"], truncation=True, padding=True)
encoded_dataset = dataset.map(preprocess_function, batched=True)
args = TrainingArguments(
"test-glue",
learning_rate=3e-5,
per_device_train_batch_size=8,
num_train_epochs=3,
remove_unused_columns=True
)
trainer = Trainer(
model,
args,
train_dataset=encoded_dataset["train"],
tokenizer=tokenizer
)
trainer.train()
However, setting remove_unused_columns=False results in the following error:
ValueError Traceback (most recent call last)
/usr/local/lib/python3.7/dist-packages/transformers/tokenization_utils_base.py in convert_to_tensors(self, tensor_type, prepend_batch_axis)
704 if not is_tensor(value):
--> 705 tensor = as_tensor(value)
706
ValueError: too many dimensions 'str'
During handling of the above exception, another exception occurred:
ValueError Traceback (most recent call last)
8 frames
/usr/local/lib/python3.7/dist-packages/transformers/tokenization_utils_base.py in convert_to_tensors(self, tensor_type, prepend_batch_axis)
720 )
721 raise ValueError(
--> 722 "Unable to create tensor, you should probably activate truncation and/or padding "
723 "with 'padding=True' 'truncation=True' to have batched tensors with the same length."
724 )
ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length.
Any suggestions are highly appreciated.
It fails because the value in line 705 is a list of str, which points to hypothesis. And hypothesis is one of the ignored_columns in trainer.py.
/usr/local/lib/python3.7/dist-packages/transformers/tokenization_utils_base.py in convert_to_tensors(self, tensor_type, prepend_batch_axis)
704 if not is_tensor(value):
--> 705 tensor = as_tensor(value)
See the below snippet from trainer.py for the remove_unused_columns flag:
def _remove_unused_columns(self, dataset: "datasets.Dataset", description: Optional[str] = None):
if not self.args.remove_unused_columns:
return dataset
if self._signature_columns is None:
# Inspect model forward signature to keep only the arguments it accepts.
signature = inspect.signature(self.model.forward)
self._signature_columns = list(signature.parameters.keys())
# Labels may be named label or label_ids, the default data collator handles that.
self._signature_columns += ["label", "label_ids"]
columns = [k for k in self._signature_columns if k in dataset.column_names]
ignored_columns = list(set(dataset.column_names) - set(self._signature_columns))
There could be a potential pull request on HuggingFace to provide a fallback option in case the flag is False. But in general, it looks like that the flag implementation is not complete for e.g. it can't be used with Tensorflow.
On the contrary, it doesn't hurt to keep it True, unless there is some special need.

huggingface's ReformerForMaskedLM configuration issue

I'm trying to pass the all of the huggingface's ...ForMaskedLM to the FitBert model for fill-in-the-blank task and see which pretrained yields the best result on the data I've prepared. But in the Reformer module I have this error says that I need to do 'config.is_decoder=False' but I don't really get what this means (This is my first time using huggingface). I tried to pass a ReformerConfig(is_decoder=False) to the model but still get the same error. How can I fix this?
My code:
pretrained_weights = ['google/reformer-crime-and-punishment',
'google/reformer-enwik8']
configurations = ReformerConfig(is_decoder=False)
for weight in pretrained_weights:
print(weight)
model = ReformerForMaskedLM(configurations).from_pretrained(weight)
tokenizer = ReformerTokenizer.from_pretrained(weight)
fb = FitBert(model=model, tokenizer=tokenizer)
predicts = []
for _, row in df.iterrows():
predicts.append(fb.rank(row['question'], options=[row['1'], row['2'], row['3'], row['4']])[0])
print(weight,':', np.sum(df.anwser==predicts) / df.shape[0])
Error:
AssertionError Traceback (most recent call last)
<ipython-input-5-a6016e0015ba> in <module>()
4 for weight in pretrained_weights:
5 print(weight)
----> 6 model = ReformerForMaskedLM(configurations).from_pretrained(weight)
7 tokenizer = ReformerTokenizer.from_pretrained(weight)
8 fb = FitBert(model=model, tokenizer=tokenizer)
/usr/local/lib/python3.7/dist-packages/transformers/modeling_utils.py in from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs)
1032
1033 # Instantiate model.
-> 1034 model = cls(config, *model_args, **model_kwargs)
1035
1036 if state_dict is None and not from_tf:
/usr/local/lib/python3.7/dist-packages/transformers/models/reformer/modeling_reformer.py in __init__(self, config)
2304 assert (
2305 not config.is_decoder
-> 2306 ), "If you want to use `ReformerForMaskedLM` make sure `config.is_decoder=False` for bi-directional self-attention."
2307 self.reformer = ReformerModel(config)
2308 self.lm_head = ReformerOnlyLMHead(config)
AssertionError: If you want to use `ReformerForMaskedLM` make sure `config.is_decoder=False` for bi-directional self-attention.
You can override certain model configurations by loading the model config separately and providing it as parameter for the from_pretrained() method. This will assure that you are using the proper model configuration with the changes you have made:
from transformers import ReformerConfig, ReformerForMaskedLM
config = ReformerConfig.from_pretrained('google/reformer-crime-and-punishment')
print(config.is_decoder)
config.is_decoder=False
print(config.is_decoder)
model = ReformerForMaskedLM.from_pretrained('google/reformer-crime-and-punishment', config=config)
Output:
True
False

not able to use tf.metrics.recall

I am very new to tensorflow.
I am just trying to understand how to use tf.metrics.recall
I am doing the following
true = tf.zeros([64, 1])
pred = tf.random_uniform([64,1], -1.0,1.0)
with tf.Session() as sess:
t,p = sess.run([true,pred])
# print(t)
# print(p)
rec, rec_op = tf.metrics.recall(labels=t, predictions=p)
sess.run(rec_op,feed_dict={t: t,p: p})
print(recall)
And that is giving me the following error:
TypeError Traceback (most recent call last)
<ipython-input-43-7245c92d724d> in <module>
25 # print(p)
26 rec, rec_op = tf.metrics.recall(labels=t, predictions=p)
---> 27 sess.run(rec_op,feed_dict={t: t,p: p})
28 print(recall)
TypeError: unhashable type: 'numpy.ndarray'
Please help me to understand this better.
Thank you in advance
labels and predictions in your code return tensor outputs, which are numpy array. You can you numpy or your own implementation to calculate recall over them, if you wish. The benefit of using metrics is that you can run everything uniformly in one go with just tensorflow.
with tf.Session() as sess:
rec, rec_op = tf.metrics.recall(labels=true, predictions=pred)
batch_recall, _ = sess.run([rec, rec_op],feed_dict={t: t,p: p})
print(recall)
Note, that you use tensors constructing tf.metrics.recall.

Problems with com cross validation and pool on catboost

I am using catboost (version = catboost-0.10.4.1 enum34-1.1.6) in a mac (Python 3.6.3 | Anaconda) and I got some problems with the module cv (with the cat_features parameter of the pool). Here is the exemple:
1) Using catboost in a single fit:
from catboost import CatBoostClassifier, cv,
model = CatBoostClassifier(early_stopping_rounds=30,cat_features = [0,1,2,3,4,5,6,7,8],iterations=2000,eval_metric='AUC',learning_rate=0.1,)
model.fit(X_train, y_train,eval_set=(X_test,y_test))
Everything works just fine and I got my outputs!
2-) When I try to use basically the same thing with the cv module the problem appears:
cv_data = cv(pool=Pool(X_train, y_train,cat_features = [0,1,2,3,4,5,6,7,8]),params=model.get_params())
Output :
---------------------------------------------------------------------------
CatboostError Traceback (most recent call last)
<ipython-input-242-675cf317e148> in <module>()
----> 1 cv_data = cv(pool=Pool(X_train, y_train,cat_features=[0,1,2,3,4,5,6,7,8]),params=model.get_params())
~/anaconda3/lib/python3.6/site-packages/catboost/core.py in cv(pool, params, dtrain, iterations, num_boost_round, fold_count, nfold, inverted, partition_random_seed, seed, shuffle, logging_level, stratified, as_pandas, metric_period, verbose, verbose_eval, plot, early_stopping_rounds, save_snapshot, snapshot_file, snapshot_interval)
2900
2901 with log_fixup(), plot_wrapper(plot, params):
-> 2902 return _cv(params, pool, fold_count, inverted, partition_random_seed, shuffle, stratified, as_pandas)
2903
2904
_catboost.pyx in _catboost._cv()
_catboost.pyx in _catboost._cv()
CatboostError: catboost/libs/options/plain_options_helper.cpp:272: Error: unknown option cat_features with value [0,1,2,3,4,5,6,7,8]
The indexes are right for the categorical variables (and worked for a single fit).
I can't understand what is happening and couldn't find related content.
Can somebody help?

Resources