Training roberta model on imdb movie reviews dataset giving this error? - python-3.x

def convert_data_to_examples(train, test, review, sentiment):
train_InputExamples = train.apply(lambda x: InputExample(guid=None, # Globally unique ID for bookkeeping, unused in this case
text_a = x[review],
label = x[sentiment]), axis = 1)
validation_InputExamples = test.apply(lambda x: InputExample(guid=None, # Globally unique ID for bookkeeping, unused in this case
text_a = x[review],
label = x[sentiment]), axis = 1,)
return train_InputExamples, validation_InputExamples
train_InputExamples, validation_InputExamples = convert_data_to_examples(train, test, 'review', 'sentiment')
NameError: name 'InputExample' is not defined
After running this part of code its gives error. Please tell me how to solve this error.

from transformers import InputExample

Related

AllenNLP DatasetReader.read returns generator instead of AllennlpDataset

While studying AllenNLP framework (version 2.0.1), I tried to implement the example code from https://guide.allennlp.org/training-and-prediction#1.
While reading the data from a Parquet file I got:
TypeError: unsupported operand type(s) for +: 'generator' and 'generator'
for the next line:
vocab = build_vocab(train_data + dev_data)
I suspect the return value should be AllennlpDataset but maybe I got it mixed up.
What did I do wrong?
Full code:
train_path = <some_path>
test_path = <some_other_path>
class ClassificationJobReader(DatasetReader):
def __init__(self,
lazy: bool = False,
tokenizer: Tokenizer = None,
token_indexers: Dict[str, TokenIndexer] = None,
max_tokens: int = None):
super().__init__(lazy)
self.tokenizer = tokenizer or WhitespaceTokenizer()
self.token_indexers = token_indexers or {'tokens': SingleIdTokenIndexer()}
self.max_tokens = max_tokens
def _read(self, file_path: str) -> Iterable[Instance]:
df = pd.read_parquet(data_path)
for idx in df.index:
text = row['title'][idx] + ' ' + row['description'][idx]
print(f'text : {text}')
label = row['class_id'][idx]
print(f'label : {label}')
tokens = self.tokenizer.tokenize(text)
if self.max_tokens:
tokens = tokens[:self.max_tokens]
text_field = TextField(tokens, self.token_indexers)
label_field = LabelField(label)
fields = {'text': text_field, 'label': label_field}
yield Instance(fields)
def build_dataset_reader() -> DatasetReader:
return ClassificationJobReader()
def read_data(reader: DatasetReader) -> Tuple[Iterable[Instance], Iterable[Instance]]:
print("Reading data")
training_data = reader.read(train_path)
validation_data = reader.read(test_path)
return training_data, validation_data
def build_vocab(instances: Iterable[Instance]) -> Vocabulary:
print("Building the vocabulary")
return Vocabulary.from_instances(instances)
dataset_reader = build_dataset_reader()
train_data, dev_data = read_data(dataset_reader)
vocab = build_vocab(train_data + dev_data)
Thanks for your help
Please find below the code fix first and the explanation afterwards.
Code Fix
# the extend_from_instances expands your vocabulary with the instances passed as an arg
# and is therefore equivalent to Vocabulary.from_instances(train_data + dev_data)
# previously
vocabulary.extend_from_instances(train_data)
vocabulary.extend_from_instances(dev_data)
Explanation
This is because the AllenNLP API have had couple of breaking changes in allennlp==2.0.1. You can find the changelog here and the upgrade guide here.The guide is outdated as per my understanding (it reflects allennlp<=1.4).
The DatasetReader returns a generator now as opposed to a List previously. DatasetReader used to have a parameter called "lazy" which was for lazy loading data. It was False by default and therefore dataset_reader.read would return a List previously. However, as of v2.0 (if i remember exactly), lazy loading is applied by default and it therefore returns a generator by default. As you know, the "+" operator has not been overridden for generator objects and therefore you cannot simply add two generators.
So, you can simply use vocab.extend_from_instances to achieve same behavior as before. Hope this helped you. If you need a full code snippet, please leave a comment below, I could post a rekated gist and share it with you.
Good day!

Simple prediction from frozen .pb saved model

I try for days to use tf exported .pb file model for prediction. The model was generated with bestExporter function as follows :
features_specs = tf.feature_column.make_parse_example_spec(serving_features)
serving_input_receiver_fn = tf.estimator.export.build_parsing_serving_input_receiver_fn(feature_spec=features_specs,default_batch_size=None)
exporter[n] = tf.estimator.BestExporter(name="best_exporter", serving_input_receiver_fn=serving_input_receiver_fn,event_file_pattern='eval/*.tfevents.*',exports_to_keep=1)
if train_params["use_early_stop"] == True:
hookModel[n] = tf.estimator.experimental.stop_if_no_decrease_hook(model[n], metric_name='average_loss', max_steps_without_decrease=train_params["early_stop_max_steps_without_decrease"], min_steps=train_params["early_stop_min_steps"],run_every_secs=train_params["early_stop_run_every_secs"], run_every_steps=train_params["early_stop_run_every_steps"],)
else:hookModel[n] = None
train_spec[n] = tf.estimator.TrainSpec(input_fn=input_fn_["train"+m],hooks=[hookModel[n]])
eval_spec[n] = tf.estimator.EvalSpec(input_fn=input_fn_["test"+m],start_delay_secs = train_params["eval_specs_start_delay_secs"],throttle_secs = train_params["eval_specs_throttle_secs"],exporters=[exporter[n]])
tf.estimator.train_and_evaluate(model[n], train_spec[n], eval_spec[n])
I think in this way input dict names are referenced...
I successfully load the model with :
model_[model_stage+"_"+model_type] = tf.saved_model.load(model_path)
but i don't know how correctly pass my features dictionnary in the model_XX['prediction'](example) wrapped function.
I saw this topic but didn't help : TensorFlow v2: Replacement for tf.contrib.predictor.from_saved_model
There's no equivalent of old tf.contrib.predictor.from_saved_model i used before...
Thanks for answer.
I found the solution to pass a dict in wrapped model. This is a slightly modified synthesis of these given solutions with modifications for TF2-4/Python 3.7 :
TensorFlow v2: Replacement for tf.contrib.predictor.from_saved_model
https://www.programcreek.com/python/example/90440/tensorflow.Example
Second is particulary complete and shows a lot of cases.
So :
my_dict = {"feature_1" : str(something), "feature_2" : int(an_int), , "feature_3" : float(a_float), ...}
# Load the model
my_model = tf.saved_model.load(model_path)
# Creates a serialized example from dict
def create_serialized_example(name_to_values):
example = tf.train.Example()
for name, values in name_to_values.items():
feature = example.features.feature[name]
if isinstance(values, str):
values = values.encode() # Modified because in new tf versions strings have to be encoded
add = feature.bytes_list.value.extend
elif isinstance(values, float):
add = feature.float_list.value.extend # Modified : float_list instead of float_32 in TF 2
elif isinstance(values, int):
add = feature.int64_list.value.extend
else:
raise AssertionError('Unsupported type: %s' % type(values[0]))
add([values]) # Modified : have to be a list, not variable
return example.SerializeToString()
# Predict function
pred = my_model.signatures["predict"] (examples=tf.constant([create_serialized_example(mydict)]))

How do I extract x co-ordinate of a point using Python

I'm trying to build an NMF model for topic extraction. For re-training of the model, I've to pass a parameter to the nmf function, for which I need to pass the x co-ordinate from a given point that the algorithm returns, here is the code for reference:
no_features = 1000
no_topics = 9
print ('Old number of topics: ', no_topics)
tfidf_vectorizer = TfidfVectorizer(max_df = 0.95, min_df = 2, max_features = no_features, stop_words = 'english')
tfidf = tfidf_vectorizer.fit_transform(documents)
tfidf_feature_names = tfidf_vectorizer.get_feature_names()
no_topics = tfidf.shape
print('New number of topics :', no_topics)
# nmf = NMF(n_components = no_topics, random_state = 1, alpha = .1, l1_ratio = .5, init = 'nndsvd').fit(tfidf)
On the third last line, the tfidf.shape returns a point (3,1000) to the variable 'no_topics', however I want that variable to be set to only the x co-ordinate, i.e (3).
How can I extract just the x co-ordinate from the point?
you can select the first values with no_topics[0]
print('New number of topics : {}'.format(no_topics[0]))
You can do a slicing on your numpy array tfidf with
topics = tfidf[0,:]

How do I get the coefficients/interecept for each group/model so I can plot the fitted line for each group?

I've written a custom class to group elements of a dataset, fit each group, and then run predictions for each group based on the fitted model. I want to be able to return the coefficients of each fitting (presumably in a dictionary), so that I can refer back to them and plot the line of best fit for each.
Calling the standard .coef_ or .get_params methods do not work because the items these methods attempt to retrieve are groupby objects. Alternatively, I tried to introduce the following:
def get_coefs():
coefs_dict = {}
for name, values in dataframe.groupby(self.groupby_column):
coefs_dict[name] = self.drugs_dict[name].coefs_
return coefs_dict
But get the following:
<bound method GroupbyEstimator.get_coefs of GroupbyEstimator(groupby_column='ndc',
pipeline_factory=<function pipeline_factory at 0x0000018DAD207268>)>
Here's the class I've written:
from sklearn import base
import numpy as np
import pandas as pd
class GroupbyEstimator(base.BaseEstimator, base.RegressorMixin):
def __init__(self, groupby_column, pipeline_factory):
self.groupby_column = groupby_column
self.pipeline_factory = pipeline_factory
def fit(self, dataframe, label):
self.drugs_dict = {}
self.label = label
dataframe = pd.get_dummies(dataframe)
for name, values in dataframe.groupby(self.groupby_column):
y = values[label]
X = values.drop(columns = [label, self.groupby_column], axis = 1)
self.drugs_dict[name] = self.pipeline_factory().fit(X, y)
return self
def get_coefs():
self.coefs_dict = {}
self.coefs_dict[name] = self.drugs_dict[name].named_steps["lin_reg"].coef_
return self.coefs_dict
def predict(self, test_data):
price_pred_list = []
for idx, row in test_data.iterrows():
name = row[self.groupby_column]
regression_coefs = self.drugs_dict[name]
row = pd.DataFrame(row).T
X = row.drop(columns = [self.label, self.groupby_column], axis = 1).values.reshape(1, -1)
drug_price_pred = regression_coefs.predict(X)
price_pred_list.append([name, drug_price_pred])
return price_pred_list
Expected result is a dictionary of the format:
{drug_a: [coefficient_1, coefficient_2,...coefficient_n],
drug_b: [coefficient_1, coefficient_2,...coefficient_n],
drug_c: [coefficient_1, coefficient_2,...coefficient_n]}
The pipeline factory is like this. I'll be building this out with alternative regressors, pca, gridsearchcv, etc. at a later time (so long as I can get the parameters out of the groupby objects for the individual regressions.
def pipeline_factory():
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
return Pipeline([
('lin_reg', LinearRegression())
])
EDIT: Added the get_coefs method as suggested. Unfortunately, as displayed above, it is still returning the same error.
The problem is with self.drugs_dict which is a dictionary of Pipeline objects, so you can't use coef_ on them directly. The coef_ is anattribute associated with the estimator object which in your case is a LinearRegression object. So the correct way of accessing the coefficients will be self.drugs_dict[name].named_steps["lin_reg"].coef_ instead of self.drugs_dict[name].coefs_ in your get_coefs() method.
While #Parthasarathy Subburaj led me to the right answer, here's the completed code for anyone that may be looking for a similar solution:
from sklearn import base
import numpy as np
import pandas as pd
class GroupbyEstimator(base.BaseEstimator, base.RegressorMixin):
def __init__(self, groupby_column, pipeline_factory):
# column is the value to group by; estimator_factory can be called to produce estimators
self.groupby_column = groupby_column
self.pipeline_factory = pipeline_factory
def fit(self, dataframe, label):
# Create an estimator and fit it with the portion in each group (create and fit a model per city
self.drugs_dict = {}
self.label = label
self.coefs_dict = {}
dataframe = pd.get_dummies(dataframe) #onehot encoder had problems with the data, so I'm getting the dummies with pandas here
for name, values in dataframe.groupby(self.groupby_column):
y = values[label]
X = values.drop(columns = [label, self.groupby_column], axis = 1)
self.drugs_dict[name] = self.pipeline_factory().fit(X, y)
self.coefs_dict[name] = self.drugs_dict[name].named_steps["lin_reg"].coef_
return self
def get_coefs(self):
return self.coefs_dict
def predict(self, test_data):
price_pred_list = []
for idx, row in test_data.iterrows():
name = row[self.groupby_column] #get drug name from drug column
regression_coefs = self.drugs_dict[name] #get coefficients from fitting in drugs_dict
row = pd.DataFrame(row).T
X = row.drop(columns = [self.label, self.groupby_column], axis = 1).values.reshape(1, -1)
drug_price_pred = regression_coefs.predict(X) #Use regression coefficients from dictionary (key = drug name) to predict
price_pred_list.append([name, drug_price_pred])
return price_pred_list
The TL;DR of the comments is that the dictionary holding model names and coefficients needs to be created under the fit method using sklearn's .named_steps on the desired portion of the pipeline, and then returned in a separate method (in this case get_coefs).

'Word2Vec' object has no attribute 'index2word'

I'm getting this error "AttributeError: 'Word2Vec' object has no attribute 'index2word'" in following code in python. Anyone knows how can I solve it?
Acctually "tfidf_weighted_averaged_word_vectorizer" throws the error. "obli.csv" contains line of sentences.
Thank you.
from feature_extractors import tfidf_weighted_averaged_word_vectorizer
dataset = get_data2()
corpus, labels = dataset.data, dataset.target
corpus, labels = remove_empty_docs(corpus, labels)
# print('Actual class label:', dataset.target_names[labels[10]])
train_corpus, test_corpus, train_labels, test_labels = prepare_datasets(corpus,
labels,
test_data_proportion=0.3)
tfidf_vectorizer, tfidf_train_features = tfidf_extractor(train_corpus)
vocab = tfidf_vectorizer.vocabulary_
tfidf_wv_train_features = tfidf_weighted_averaged_word_vectorizer(corpus=tokenized_train,
tfidf_vectors=tfidf_train_features,
tfidf_vocabulary=vocab,
model=model,
num_features=100)
def get_data2():
obli = pd.read_csv('db/obli.csv').values.ravel().tolist()
cl0 = [0 for x in range(len(obli))]
nonObli = pd.read_csv('db/nonObli.csv').values.ravel().tolist()
cl1 = [1 for x in range(len(nonObli))]
all = obli + nonObli
db = Db(all,cl0 + cl1)
db.data = all
db.target = cl0 + cl1
return db
This is code from chapter 4 of Text Analytics for Python by Dipanjan Sarkar.
index2word in gensim has been moved since that text was published.
Instead of model.index2word you should use model.wv.index2word.

Resources