Informative Features Code not Working - python-3.x

I want to implement a most informative features function for binary NB in SciKit Learn. I am using Python3.
First off, I understand that the question of implementing some sort of 'informative features' function for SciKit's multinomial NB has been asked. However, I have tried the responses and have had no luck - so I think either SciKit updated, or I am doing something very wrong. I am using
tobigue's answer here for a function.
from nltk.corpus import stopwords
import numpy as np
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
#Array contains a list of (headline, source) tupples where there are two sources.
#I want to classify each headline as belonging to a given source.
array = [('toyota showcases humanoid that mirrors user', 'drudge'), ('virginia again delays vote certification after error in ballot distribution', 'npr'), ("do doctors need to use computers? one physician's case highlights the quandary", 'npr'), ('office sex summons', 'drudge'), ('launch calibrated to avoid military response?', 'drudge'), ('snl skewers alum al franken, trump sons', 'npr'), ('mulvaney shows up for work at consumer watchdog group, as leadership feud deepens', 'npr'), ('indonesia tries to evacuate 100,000 people away from erupting volcano on bali', 'npr'), ('downing street blasts', 'drudge'), ('stocks soar more; records smashed', 'drudge'), ('aid begins to filter back into yemen, as saudi-led blockade eases', 'npr'), ('just look at these fancy port-a-potties', 'npr'), ('nyt turns to twitter activism to thwart', 'drudge'), ('uncertainty reigns in battle for virginia house of delegates', 'npr'), ('u.s. reverses its decision to close palestinian office in d.c.', 'npr'), ("'i don't believe in science,' says flat-earther set to launch himself in own rocket", 'npr'), ("bosnian war chief 'dies' after being filmed 'drinking poison' at the hague", 'drudge'), ('federal judge blocks new texas anti-abortion law', 'npr'), ('gm unveils driverless cars, aiming to lead pack', 'drudge'), ('in japan, a growing scandal over companies faking product-quality data', 'npr')]
#I want to classify each headline as belonging to a given source.
def scikit_naivebayes(data_array):
headlines = [element[0] for element in data_array]
sources = [element[1] for element in data_array]
text_clf = Pipeline([('vect', CountVectorizer(stop_words='english')), ('tfidf', TfidfTransformer()),('clf', MultinomialNB())])
cf1 = text_clf.fit(headlines, sources)
train(cf1,headlines,sources)
#Call most_informative_features function on CountVectorizer and classifier
show_most_informative_features(CountVectorizer, cf1)
def train(classifier, X, y):
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=33)
classifier.fit(X_train, y_train)
print ("Accuracy: {}".format(classifier.score(X_test, y_test)))
#tobigue's code:
def show_most_informative_features(vectorizer, clf, n=20):
feature_names = vectorizer.get_feature_names()
coefs_with_fns = sorted(zip(clf.coef_[0], feature_names))
top = zip(coefs_with_fns[:n], coefs_with_fns[:-(n + 1):-1])
for (coef_1, fn_1), (coef_2, fn_2) in top:
print ("\t%.4f\t%-15s\t\t%.4f\t%-15s" % (coef_1, fn_1, coef_2, fn_2))
def main():
scikit_naivebayes(array)
main()
#ERROR:
# File "file_path_here", line 34, in program_name
# feature_names = vectorizer.get_feature_names()
# TypeError: get_feature_names() missing 1 required positional argument: 'self'

You need to fit the CountVectorizer before calling vectorizer.get_feature_names(). In your code, you only call the other function with the class CountVectorizer, which won't lead to anything.
You should try independtly from your pipeline to create a vectorizer with CountVectorizer, and then call fit on your text, and eventually use the function already provided, though you should further adapt it by yourself to your problem.
You should understand easily that the function you use needs an instanciated object, and not a class. Tell me if you don't.
Edit
coef_ is an attribute only accessible by an estimator, i.e a classifier (and not all). Pipeline is a sklearn object used to combined different steps in order to feed a classifier. Typically, a bag-of-words pipeline is constitued by a feature extractor and a classifier (here logistic regression):
pipeline = Pipeline([
('vectorizer', CountVectorizer(args)),
('classifier', LogisticRegression()
])
So, in your case, you should either avoid using pipeline (what I recommend you to begin), or use get_params() method from the pipeline to access the classifier.
I suggest you to fit_transform the text, then feed the transformed result to a logistic regression or naive bayes classifier, and then call the function you have :
vectorizer = CountVectorizer(stop_words='english')
X = vectorizer.fit_transform(headlines, sources)
naive_bayes = MultinomialNB()
naive_bayes.fit(X, sources)
show_most_informative_features(vectorizer, naive_bayes)
First try that, and if it works you'll understand better how to then use a pipeline. Note that your Pipeline should not work as you combine to feature extractors, the last step should be an estimator. If you want to stack to features extractors, you need to look out for FeatureUnion

Related

Can sklearn gridsearchcv rank candidates on median instead of mean_test_score?

Main question:
In a certain model scenario it seems more robust to me, to judge on candidates tested in a sklearn.model_selection.GridSearchCV based on their median performance instead of mean. Is there a way to do this?
Some more context:
Especially for small datasets or when using a CV scheme with low sample number in the test folds (e.g. LeaveOneOut), it may occur that certain folds achieve test scores that are extremely low, whereas the bulk of the folds might actually perform quite well. Selecting for the mean of all test scores, may result in a different candidate being preferred, for instance one where all folds have medium low, however, none have outrageously bad performance.
My current workaround has some problems:
I can tell gridsearchcv to write the best_*-attributes w.r.t. a custom callable passed as the refit argument, so I am using the function below for selecting the model which achieved the best median score among the cv folds:
def best_median_score(cv_results):
"""
Find the best median score from a cross-validation result dictionary.
:param cv_results: dictionary of cross-validation results
:return: index of best median score
"""
inner_test_scores = np.array([
scores for key, scores
in cv_results.items()
if key.startswith('split')
and f'test_{Config.refit_scorer}'
in key
])
median_inner_test_scores = np.median(inner_test_scores, axis=0)
return median_inner_test_scores.argmax()
and pass it as:
grid = GridSearchCV(
pipe, # pipeline object of model steps
params, # parameter grid
scoring= scorer, # dict of multiple scorers
refit= best_median_score,
cv=10,
verbose=1,
n_jobs=-1
)
However, gridsearchcv still calculates the mean_test_scores in the grid.cv_results_, where I would prefer "median_test_scores" instead. Also, this way I am loosing the grid.best_score_ attribute and get an error when trying to score manually:
grid.score(model_X, model_y)
KeyError Traceback (most recent call last)
File
~/.local/share/virtualenvs/my_env/lib/python3.9/site-packages/sklearn/model_selection/search.py:446,
in BaseSearchCV.score(self, X, y) 444 if isinstance(self.scorer,
dict): 445 if self.multimetric_:
--> 446 scorer = self.scorer_[self.refit] 447 else: 448 scorer = self.scorer_
KeyError: <function best_median_score at 0x7f4b840beca0>
Median test performance could be calculated outside the GridSearchCV and then refitted with the best hyper-parameter combination based on median score.
import pandas as pd
from sklearn import svm, datasets
from sklearn.model_selection import GridSearchCV
iris = datasets.load_iris()
parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10]}
svc = svm.SVC()
clf = GridSearchCV(svc, parameters, refit=False)
clf.fit(iris.data, iris.target)
results_df = pd.DataFrame(clf.cv_results_)
results_df['median_test_score'] = results_df.filter(regex='^split').median(axis=1)
results_df['rank_test_score'] = results_df['median_test_score'].rank(ascending=False).astype(int)
svc.set_params(**results_df.query('rank_test_score == 1')['params'].values[0])
svc.fit(iris.data, iris.target)

55 Hour Long Single Epoch Colab GPU Training Time for a Small RoBERTa-like Transformer

I am following this guide: https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/01_how_to_train.ipynb
To train a transformer for Pali. A dead language but has applicability for Buddhist texts. I've got the Pali Cannon as training material (110+MB of text). However, the training of a transformer on a single epoch with Colab takes 55 hours. The transformer has 55m parameters. This doesn't seem reasonable to me. One of the problems that could be occurring is this is one piece of text. I don't know if the model is interpreting each carriage return (\n) as new text, or if it's interpreting the whole of the Pali Canon as a single document. Both of which I can see causing the rediculous training time. From the tutorial, I gathered that <s> and <\s> mean the beginning of a document and end of document tokens in the corpora. However, I still don't know what keywords in the document set off the <s> markers. At first, I thought it might be \n but it looks like this doesn't set if off when using the OSCAR dataset the tutorial uses. I haven't integrated anything to denote end of document flags into my text of the Pali Canon. My question is right now the Pali Cannon text file is not formatted correctly for the transformer to train in optimal time so how should it be formatted so that it is correct?
Here is the link to the Pali Cannon text: https://drive.google.com/file/d/1d1J1ib8LYnvapqY9FAWiMTHEA76XxC7b/view?usp=sharing
Here is the link to the folder containing the tokenizer files: https://drive.google.com/drive/folders/1-2la-kfV_az0NdhCya5fPyleqTc4RJox?usp=sharing
Here is my code:
from transformers import RobertaTokenizerFast
from transformers import RobertaConfig
from transformers import RobertaForMaskedLM
from transformers import LineByLineTextDataset
from transformers import DataCollatorForLanguageModeling
from transformers import Trainer, TrainingArguments
import os
tokenizer = RobertaTokenizerFast("/content/drive/MyDrive/PaliBert/vocab.json","/content/drive/MyDrive/PaliBert/merges.txt", max_len=512)
config = RobertaConfig(
vocab_size=52_000,
max_position_embeddings=512,
num_attention_heads=8,
num_hidden_layers=2,
type_vocab_size=1,
)
model = RobertaForMaskedLM(config=config)
dataset = LineByLineTextDataset(
tokenizer=tokenizer,
file_path="/content/drive/MyDrive/Pali Cannon Analysis/Pali Cannon Text.txt",
block_size=128,
)
data_collator = DataCollatorForLanguageModeling(
tokenizer=tokenizer, mlm=True, mlm_probability=0.15
)
training_args = TrainingArguments(
output_dir="./EsperBERTo",
overwrite_output_dir=True,
num_train_epochs=1,
per_gpu_train_batch_size=64,
save_steps=10_000,
save_total_limit=2,
prediction_loss_only=True,
)
trainer = Trainer(
model=model,
args=training_args,
data_collator=data_collator,
train_dataset=dataset,
)
trainer.train()

Can we save the result of the Hyperopt Trials with Sparktrials

I am currently trying to optimize the hyperparameters of a gradient boosting method with the library hyperopt. When I was working on my own computer, I used the class Trials and I was able to save and reload my results with the library pickles. This allowed me to have a save of all the set of parameters I tested. My code looked like that :
from hyperopt import SparkTrials, STATUS_OK, tpe, fmin
from LearningUtils.LearningUtils import build_train_test, get_train_test, mean_error, rmse, mae
from LearningUtils.constants import MAX_EVALS, CV, XGBOOST_OPTIM_SPACE, PARALELISM
from sklearn.model_selection import cross_val_score
import pickle as pkl
if os.path.isdir(PATH_TO_TRIALS): #we reload the past results
with open(PATH_TO_TRIALS, 'rb') as trials_file:
trials = pkl.load(trials_file)
else : # We create the trials file
trials = Trials()
# classic hyperparameters optimization
def objective(space):
regressor = xgb.XGBRegressor(n_estimators = space['n_estimators'],
max_depth = int(space['max_depth']),
learning_rate = space['learning_rate'],
gamma = space['gamma'],
min_child_weight = space['min_child_weight'],
subsample = space['subsample'],
colsample_bytree = space['colsample_bytree'],
verbosity=0
)
regressor.fit(X_train, Y_train)
# Applying k-Fold Cross Validation
accuracies = cross_val_score(estimator=regressor, x=X_train, y=Y_train, cv=5)
CrossValMean = accuracies.mean()
return {'loss':1-CrossValMean, 'status': STATUS_OK}
best = fmin(fn=objective,
space=XGBOOST_OPTIM_SPACE,
algo=tpe.suggest,
max_evals=MAX_EVALS,
trials=trials,
return_argmin=False)
# Save the trials
pkl.dump(trials, open(PATH_TO_TRIALS, "wb"))
Now, I would like to make this code work on a distant serveur with more CPUs in order to allow parallelisation and gain time.
I saw that I can simply do that using the SparkTrials class of hyperopt instead ot Trials. But, SparkTrials objects cannot be saved with pickles. Do you have any idea on how I could save and reload my trials results stored in a Sparktrials object ?
so this might be a bit late, but after messing around a bit, I found a kind of hacky solution:
spark_trials= SparkTrials()
pickling_trials = dict()
for k, v in spark_trials.__dict__.items():
if not k in ['_spark_context', '_spark']:
pickling_trials[k] = v
pickle.dump(pickling_trials, open('pickling_trials.hyperopt', 'wb'))
The _spark_context and the _spark attributes of the SparkTrials instance are the culprits of not being able to serialize the object. It turns out that you dont need them if you want to re-use the object, because if you want to re-run the optimization again, a new spark context is created anyway, so you can re use the trials as:
new_sparktrials = SparkTrials()
for att, v in pickling_trials.items():
setattr(new_sparktrials, att, v)
best = fmin(loss_func,
space=search_space,
algo=tpe.suggest,
max_evals=1000,
trials=new_sparktrials)
voilà :)

Building vocabulary using document vector

I am not able to build vocabulary and getting an error:
TypeError: 'int' object is not iterable
Here is my code that is based on medium article:
https://towardsdatascience.com/implementing-multi-class-text-classification-with-doc2vec-df7c3812824d
I tried to provide pandas series, list to build_vocab function.
import pandas as pd
from gensim.test.utils import common_texts
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from sklearn.model_selection import train_test_split
import multiprocessing
import nltk
from nltk.corpus import stopwords
def tokenize_text(text):
tokens = []
for sent in nltk.sent_tokenize(text):
for word in nltk.word_tokenize(sent):
if len(word) < 2:
continue
tokens.append(word.lower())
return tokens
df = pd.read_csv("https://raw.githubusercontent.com/RaRe-Technologies/movie-plots-by-genre/master/data/tagged_plots_movielens.csv")
tags_index = {
"sci-fi": 1,
"action": 2,
"comedy": 3,
"fantasy": 4,
"animation": 5,
"romance": 6,
}
df["tindex"] = df.tag.replace(tags_index)
df = df[["plot", "tindex"]]
mylist = list()
for i, q in df.iterrows():
mylist.append(
TaggedDocument(tokenize_text(str(q["plot"])), tags=q["tindex"])
)
df["tdoc"] = mylist
X = df[["tdoc"]]
y = df["tindex"]
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
cores = multiprocessing.cpu_count()
model_doc2vec = Doc2Vec(
dm=1,
vector_size=300,
negative=5,
hs=0,
min_count=2,
sample=0,
workers=cores,
)
model_doc2vec.build_vocab([x for x in X_train["tdoc"]])
The documentation is very confusing for this method.
Doc2Vec needs an iterable sequence of TaggedDocument-like objects for its corpus (as is fed to build_vocab() or train()).
When showing an error, you should also show the full stack that accompanied it, so that it is clear what line-of-code, and surrounding call-frames, are involved.
But, it's unclear if what you've fed into the dataframe, then out via dataframe-bracket-access, then through the train_test_split(), is actually that.
So I'd suggest assigning things to descriptive interim variables, and verifying that they contain the right sorts of things at each step.
Is X_train["tdoc"][0] a proper TaggedDocument, with a words property that is a list-of-strings, and tags property a list-of-tags? (And, where each tag is probably a string, but could perhaps be a plain-int, counting upward from 0.)
Is mylist[0] a proper TaggedDocument?
Separately: many online examples of Doc2Vec use have egregious errors, and the Medium article you link is no exception. Its practice of calling train() multiple times in a loop is usually unneeded, and very error-prone, and in fact in that article results in severe learning-rate alpha mismanagement. (For example, deducting 0.002 from the starting-default alpha of 0.025 30 times results in a negative effective alpha, which is never justified and means the model is making itself worse with every example. This may be a factor contributing to the awful reported classifier accuracy.)
I would disregard that article entirely and seek better examples elsewhere.

k means cluster method score negative

guys. I am yet a beginner trying to learn ML so do forgive me for such a simple question. I had a dataset from UCI ML Repository. So, started applying all kinds of unsupervised algorithm in which i also applied K Means Cluster algorithm. When I printed out the accuracy score it was negative, not just once but many times. As far as I know scores aren't negative. So could you please help me as to why it's negative.
Any help is appreciated.
import pandas as pd
import numpy as np
a = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data', names = ["a", "b", "c", "d","e","f","g","h","i"])
b = a
c = b.filter(a.columns[[8]], axis=1)
a.drop(a.columns[[8]], axis=1, inplace=True)
from sklearn.preprocessing import LabelEncoder
le1 = LabelEncoder()
le1.fit(a.a)
a.a = le1.transform(a.a)
from sklearn.preprocessing import OneHotEncoder
x = np.array(a)
y = np.array(c)
ohe = OneHotEncoder(categorical_features=[0])
ohe.fit(x)
x = ohe.transform(x).toarray()
from sklearn.model_selection import train_test_split
xtr, xts, ytr, yts = train_test_split(x,y,test_size=0.2)
from sklearn import cluster
kmean = cluster.KMeans(n_clusters=2, init='k-means++', max_iter=100, n_init=10)
kmean.fit(xtr,ytr)
print(kmean.score(xts,yts))
Thank you!!
The k-means score is an indication of how far the points are from the centroids.
In scikit learn, the score is better the closer to zero it is.
Bad scores will return a large negative number, whereas good scores return close to zero. Generally, you will want to take the absolute value of the output from the scores method for better visualization.
Clustering is not classification.
Note that the 'y' argument of fit is ignored. Kmeans will always predict 0,1,...,k-1. So it will never make a correct label on this data set, because it doesn't even know what a label is supposed to look like. It really doesn't work to transfer what you did in classification to clustering. You need to relearn this from scratch. Different workflow, different evaluation.
It was explained in a book called "Hands-on Machine Learning with Scikit Learn Keras and TensorFlow" by Geron Aurelien.
On page 243 of the book (Chapter 9), it says that "The score() method returns the negative inertia. Why negative? Because a predictor’s score() method must always respect Scikit-Learn’s “greater is better” rule: if a predictor is better than another, its score() method should return a greater score."
Hope this helped!

Resources