What does BERT's special characters appearance in SQuAD's QA answers mean? - nlp-question-answering

I'm running a fine-tuned model of BERT and ALBERT for Questing Answering. And, I'm evaluating the performance of these models on a subset of questions from SQuAD v2.0. I use SQuAD's official evaluation script for evaluation.
I use Huggingface transformers and in the following you can find an actual code and example I'm running (might be also helpful for some folks who are trying to run fine-tuned model of ALBERT on SQuAD v2.0):
tokenizer = AutoTokenizer.from_pretrained("ktrapeznikov/albert-xlarge-v2-squad-v2")
model = AutoModelForQuestionAnswering.from_pretrained("ktrapeznikov/albert-xlarge-v2-squad-v2")
question = "Why aren't the examples of bouregois architecture visible today?"
text = """Exceptional examples of the bourgeois architecture of the later periods were not restored by the communist authorities after the war (like mentioned Kronenberg Palace and Insurance Company Rosja building) or they were rebuilt in socialist realism style (like Warsaw Philharmony edifice originally inspired by Palais Garnier in Paris). Despite that the Warsaw University of Technology building (1899\u20131902) is the most interesting of the late 19th-century architecture. Some 19th-century buildings in the Praga district (the Vistula\u2019s right bank) have been restored although many have been poorly maintained. Warsaw\u2019s municipal government authorities have decided to rebuild the Saxon Palace and the Br\u00fchl Palace, the most distinctive buildings in prewar Warsaw."""
input_dict = tokenizer.encode_plus(question, text, return_tensors="pt")
input_ids = input_dict["input_ids"].tolist()
start_scores, end_scores = model(**input_dict)
all_tokens = tokenizer.convert_ids_to_tokens(input_ids[0])
answer = ' '.join(all_tokens[torch.argmax(start_scores) : torch.argmax(end_scores)+1]).replace('▁', '')
print(answer)
And the output is like the following:
[CLS] why aren ' t the examples of bour ego is architecture visible today ? [SEP] exceptional examples of the bourgeois architecture of the later periods were not restored by the communist authorities after the war
As you can see there are BERT's special tokens in the answer including [CLS] and [SEP].
I understand that in cases where the answer is just [CLS] (having two tensor(0) for start_scores and end_scores) it basically means model thinks there's no answer to the question in context which makes sense. And in these cases I just simply set the answer to that question to a null string when running the evaluation script.
But I wonder in cases like the example above, should I again assume that model could not find an answer and set the answer to empty string? or should I just leave the answer like that when I'm evaluating the model performance?
I'm asking this question because as far as I understand, the performance calculated using the evaluation script can change (correct me if I'm wrong) if I have such cases as answers and I may not get a realistic sense of the performance of these models.

You should simply treat them as invalid because you try to predict a proper answer span from the variable text. Everything else should be invalid. This is also the way how huggingface treats this predictions:
We could hypothetically create invalid predictions, e.g., predict that the start of the span is in the question. We throw out all invalid predictions.
You should also note that they use a more sopisticated method to get the predictions for each question (don't ask me why they show torch.argmax in their example). Please have a look at the example below:
from transformers.data.processors.squad import SquadResult, SquadExample, SquadFeatures,SquadV2Processor, squad_convert_examples_to_features
from transformers.data.metrics.squad_metrics import compute_predictions_logits, squad_evaluate
###
#your example code
###
outputs = model(**input_dict)
def to_list(tensor):
return tensor.detach().cpu().tolist()
output = [to_list(output[0]) for output in outputs]
start_logits, end_logits = output
all_results = []
all_results.append(SquadResult(1000000000, start_logits, end_logits))
#this is the answers section from the evaluation dataset
answers = [{'text':'not restored by the communist authorities', 'answer_start':77}, {'text':'were not restored', 'answer_start':72}, {'text':'not restored by the communist authorities after the war', 'answer_start':77}]
examples = [SquadExample('0', question, text, 'not restored by the communist authorities', 75, 'Warsaw', answers,False)]
#this does basically the same as tokenizer.encode_plus() but stores them in a SquadFeatures Object and splits if neccessary
features = squad_convert_examples_to_features(examples, tokenizer, 512, 100, 64, True)
predictions = compute_predictions_logits(
examples,
features,
all_results,
20,
30,
True,
'pred.file',
'nbest_file',
'null_log_odds_file',
False,
True,
0.0,
tokenizer
)
result = squad_evaluate(examples, predictions)
print(predictions)
for x in result.items():
print(x)
Output:
OrderedDict([('0', 'communist authorities after the war')])
('exact', 0.0)
('f1', 72.72727272727273)
('total', 1)
('HasAns_exact', 0.0)
('HasAns_f1', 72.72727272727273)
('HasAns_total', 1)
('best_exact', 0.0)
('best_exact_thresh', 0.0)
('best_f1', 72.72727272727273)
('best_f1_thresh', 0.0)

Related

NLP: Get opinionated terms that correspond to aspect terms

I want to extract the sentiment sentence that goes along an aspect term in a sentence. I have the following code:
import spacy
nlp = spacy.load("en_core_web_lg")
def find_sentiment(doc):
# find roots of all entities in the text
ner_heads = {ent.root.idx: ent for ent in doc.ents}
rule3_pairs = []
for token in doc:
children = token.children
A = "999999"
M = "999999"
add_neg_pfx = False
for child in children:
if(child.dep_ in ["nsubj"] and not child.is_stop): # nsubj is nominal subject
if child.idx in ner_heads:
A = ner_heads[child.idx].text
else:
A = child.text
if(child.dep_ in ["acomp", "advcl"] and not child.is_stop): # acomp is adjectival complement
M = child.text
# example - 'this could have been better' -> (this, not better)
if(child.dep_ == "aux" and child.tag_ == "MD"): # MD is modal auxiliary
neg_prefix = "not"
add_neg_pfx = True
if(child.dep_ == "neg"): # neg is negation
neg_prefix = child.text
add_neg_pfx = True
# print(child, child.dep_)
if (add_neg_pfx and M != "999999"):
M = neg_prefix + " " + M
if(A != "999999" and M != "999999"):
rule3_pairs.append((A, M))
return rule3_pairs
print(find_sentiment(nlp('NEW DELHI Refined soya oil remained weak for the second day and prices shed 0.56 per cent to Rs 682.50 per 10 kg in futures market today as speculators reduced positions following sluggish demand in the spot market against adequate stocks position.')))
Which gets me the output: [('oil', 'weak'), ('prices', 'reduced')]
But this is too little of the content of the text
I want to know if it is possible to get an output like: [('oil', 'weak'), ('prices', 'shed 0.56 percent'), ('demand', 'sluggish')]
Is there any approach you recomend trying?
I triedthe code given above. Also a another library of stanza which only got similar results.
Unfortunately, if your task is to extract all expressive words from the text (all the words that contain sentimental significance), then it is not possible with the current state of affairs. Language is highly variable, and the same word could change its sentiment and meaning from sentence to sentence. While words like "awful" are easy to classify as negative, "demand" from your text is not as obvious, not even speaking about edge cases when seemingly positive "incredible" may reverse its sentiment if used as empowerment: "incredibly stupid" should be classified as very negative, but machines can normally only output two opposite labels for those words.
This is why for purposes of sentimental analysis, the only reliable way is building machine learning model that will classify texts entirely, which means you should adapt your software to accept the final verdict and process it in some way or another.
Naive Bayes Classifier
The simplest way to classify text by sentiment is the Naive Bayes classifier algorithm (that, among other things, not only classifies sentiment) that is implemented in NLTK:
from nltk import NaiveBayesClassifier, classify
#The training data is a two-dimensional list of words to classify.
train_data = dataset[:7000]
test_data = dataset[7000:]
#Train method returns the trained model.
classifier = NaiveBayesClassifier.train(train_data)
#To get accuracy, use classify.accuracy method:
print("Accuracy is:", classify.accuracy(classifier, test_data))
In order to make a prediction, we need to pass a list of words. It's preferable to remove any words that do not play sentimental significance such as the stop words and punctuation so that it wouldn't disturb our model:
from nltk.corpus import stopwords
from nltk.tokenise import word_tokenise
def clearLexemes(words):
return [word if word not in stopwords.word("english")
or "!?<>:;.&*%^" in word for word in words]
text = "What a terrible day!"
tokens = clearLexemes(word_tokenise(text))
print("Text sentiment is " + str(classifier.classify(dict([token, True] for token in tokens)))))
The output will be the sentiment of the text.
The important notes:
requires a minimum parameters to train and trains relatively fast;
is highly efficient for working with natural languages (is also used for gender identification and named entity recognition);
is unlikely to properly classify edge cases when words shift their sentiment in creatively-styled or rare utterances. For example, "Sweetheart, I wish ll of your fears would come true and you will be happy to live in such world!" This sentence is negative and uses irony to mask negative attribute through positive expressions, and the model may not be able to detect this.
Linear Regression
Another related method is to use linear regression algorithms from your favourite machine learning framework. In this notebook I used the Amazon food review dataset
to measure how fast model accuracy increases as you feed it with more and more data. The data you need to feed the model is the raw text and its score label (that in your case could be sentiment).
import numpy as np #For converting strings to text
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import confusion_matrix, classification_report
#Preparing the data
ys: pd.DataFrame = reviews.head(170536) #30% of the dataframe is test data
xs: pd.DataFrame = reviews[170537:] #70% of the dataframe is training data
#Training the model
lr = LogisticRegression(max_iter=1000)
cv = CountVectorizer(token_pattern=r'\b\w+\b')
train = cv.fit_transform(xs["Summary"].apply(lambda x: np.str_(x)))
test = cv.transform(ys["Summary"].apply(lambda x: np.str_(x)))
lr.fit(train, xs["Score"])
#Measuring accuracy:
predictions = lr.predict(test)
labels = ["x1", "x2", "x3", "x4", "x5"]
report = classification_report(predictions, ys["Score"],
target_names = labels, output_dict=True)
accuracy = [report[label]["precision"] for label in labels]
print(accuracy)
Conclusion
Investigating sentimental analysis is a worthwhile area of academic and industrial research that completely relies on machine learning and is bound to its limitations. It is a powerful topic that should be covered in the classical NLP suite. Unfortunately, currently understanding meaning close enough to be able to extract situational meaning is a feat close to inventing Artificial General Intelligence, however technology rapidly grows in that direction.

find bigram using gensim

this is my code
from gensim.models import Phrases
documents = ["the mayor of new york was there the hill have eyes","the_hill have_eyes new york mayor was present"]
sentence_stream = [doc.split(" ") for doc in documents]
bigram = Phrases(sentence_stream, min_count=1)
sent = ['the', 'mayor', 'of', 'new_york', 'was', 'there', 'the_hill', 'have_eyes']
print(bigram[sent])
i want it detects "the_hill_have_eyes" but the output is
['the', 'mayor', 'of', 'new_york', 'was', 'there', 'the_hill', 'have_eyes']
Phrases is a purely-statistical method for combining some unigram-token-pairs to new bigram-tokens. If it's not combining two unigrams you think should be combined, it's because the training data and/or chosen parameters (like threshold or min_count) don't imply that pairing should be combined.
Note especially that:
even when Phrases-combinations prove beneficial for downstream classification or info-retrieval steps, they may not intuitively/aesthetically match the "phrases" we as human readers would like to see
since Phrases requires bulk statistics for good results, it requires a lot of training data – you are unlikely to see impressive or representative results from tiny toy-sized training data
In particular with regard to that last point & your example, the interpretation of min_count in Phrases default-scoring means even a min_count=1 isn't low enough to cause bigrams for which there is only a single example in the training-corpus to be created.
So, if you expand your training-corpus a bit, you may be able to create the results you want. But you should still be aware that this method's only value comes from training are larger, realistic corpuses, so anything you see in tiny contrived examples may not generalize to real uses.
What you want is not actually bigrams but "fourgrams".
This can be achieved by doing something like this (my old piece of code I wrote some months ago):
// read the txt file
sentences = Text8Corpus(datapath('testcorpus.txt'))
phrases = Phrases(sentences, min_count=1, threshold=1)
bigram = Phraser(phrases)
sent = [u'trees', u'graph', u'minors']
// look for words in "sent"
print(bigram[sent])
[u'trees_graph', u'minors'] // output
// to create the bigrams
bigram_model = Phrases(unigram_sentences)
// apply the trained model to a sentence
for unigram_sentence in unigram_sentences:
bigram_sentence = u' '.join(bigram_model[unigram_sentence])
// get a trigram model out of the bigram
trigram_model = Phrases(bigram_sentences)
So here you have a trigram model (detecting 3 words together) and you get the idea on how to implement fourgrams.
Hope this helps. Good luck.

Is this text training with skip-gram correct?

I am still a beginner with neural networks and NLP.
In this code I'm training cleaned text (some tweets) with skip-gram.
But I do not know if I do it correctly.
Can anyone inform me about the correctness of this skip-gram text training?
Any help is appreciated.
This my code :
from nltk import word_tokenize
from gensim.models.phrases import Phrases, Phraser
sent = [row.split() for row in X['clean_text']]
phrases = Phrases(sent, max_vocab_size = 50, progress_per=10000)
bigram = Phraser(phrases)
sentences = bigram[sent]
from gensim.models import Word2Vec
w2v_model = Word2Vec(window=5,
size = 300,
sg=1)
w2v_model.build_vocab(sentences)
w2v_model.train(sentences, total_examples=w2v_model.corpus_count, epochs=25)
del sentences #to reduce memory usage
def get_mat(model, corpus, size):
vecs = np.zeros((len(corpus), size))
n = 0
for i in corpus.index:
vecs[i] = np.zeros(size).reshape((1, size))
for word in str(corpus.iloc[i,0]).split():
try:
vecs[i] += model[word]
#n += 1
except KeyError:
continue
return vecs
X_sg = get_vectors(w2v_model, X, 300)
del X
X_sg=pd.DataFrame(X_sg)
X_sg.head()
from sklearn import preprocessing
scale = preprocessing.normalize
X_sg=scale(X_sg)
for i in range(len(X_sg)):
X_sg[i]+=1 #I did this because some weights where negative! So could not
#apply LSTM on them later
You haven't mentioned if you've received any errors, or unsatisfactory results, so it's hard to know what kind of help you might need.
Your specific lines of code involving the Word2Vec model are roughly correct: plausibly-useful parameters (if you have a dataset large enough to train 300-dimensional vectors), and the proper steps. So the real proof would be whether your results are acceptable.
Regarding your attempted use of Phrases bigram-creation beforehand:
You should get things generally working and with promising results before adding this extra pre-processing complexity.
The parameter max_vocab_size=50 is seriously misguided and may make the phrases-step pointless. The max_vocab_size is a hard cap on how many words/bigrams are tallied by the class, as a way to cap its memory-usage. (Whenever the number of known words/bigrams hits this cap, many lower-frequency words/bigrams are pruned – in practice, a majority of all words/bigrams each pruning, giving up a lot of accuracy in return for capped memory usage.) The max_vocab_size default in gensim is 40,000,000 – but the default in the Google word2phrase.c source on which gensim's method is based was 500,000,000. By using just 50, it's not really going to learn anything useful about just whatever 50 words/bigrams survive the many prunings.
Regarding your get_mat() function & later DataFrame code, i have no idea what you're trying to do with it, so can't offer any opinion on it.

An NLP Model that Suggest a List of Words in an Incomplete Sentence

I have somewhat read a bunch of papers which talks about predicting missing words in a sentence. What I really want is to create a model that suggest a word from an incomplete sentence.
Example:
Incomplete Sentence :
I bought an ___________ because its rainy.
Suggested Words:
umbrella
soup
jacket
From the journal I have read, they have utilized Microsoft Sentence Completion Dataset for predicting missing words from a sentence.
Example :
Incomplete Sentence :
Im sad because you are __________
Missing Word Options:
a) crying
b) happy
c) pretty
d) sad
e) bad
I don't want to predict a missing word from a list of options. I want to suggest a list of words from an incomplete sentence. Is it feasible? Please enlighten me cause Im really confused. What is state of the art model I can use for suggesting a list of words (semantically coherent) from an incomplete sentence?
Is it necessary that the list of suggested words as an output is included in the training dataset?
This is exactly how the BERT model was trained: mask some random words in the sentence, and make your network predict these words. So yes, it is feasible. And not, it is not necessary to have the list of suggested words as a training input. However, these suggested words should be the part of the overall vocabulary with which this BERT has been trained.
I adapted this answer to show how the completion function may work.
# install this package to obtain the pretrained model
# ! pip install -U pytorch-pretrained-bert
import torch
from pytorch_pretrained_bert import BertTokenizer, BertForMaskedLM
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForMaskedLM.from_pretrained('bert-base-uncased')
model.eval(); # turning off the dropout
def fill_the_gaps(text):
text = '[CLS] ' + text + ' [SEP]'
tokenized_text = tokenizer.tokenize(text)
indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
segments_ids = [0] * len(tokenized_text)
tokens_tensor = torch.tensor([indexed_tokens])
segments_tensors = torch.tensor([segments_ids])
with torch.no_grad():
predictions = model(tokens_tensor, segments_tensors)
results = []
for i, t in enumerate(tokenized_text):
if t == '[MASK]':
predicted_index = torch.argmax(predictions[0, i]).item()
predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0]
results.append(predicted_token)
return results
print(fill_the_gaps(text = 'I bought an [MASK] because its rainy .'))
print(fill_the_gaps(text = 'Im sad because you are [MASK] .'))
print(fill_the_gaps(text = 'Im worried because you are [MASK] .'))
print(fill_the_gaps(text = 'Im [MASK] because you are [MASK] .'))
The [MASK] symbol indicates the missing words (there can be any number of them). [CLS] and [SEP] are BERT-specific special tokens. The outputs for these particular prints are
['umbrella']
['here']
['worried']
['here', 'here']
The duplication is not surprising - transformer NNs are generally good at copying words. And from semantic point of view, these symmetric continuations look indeed very likely.
Moreover, if it is not a random word which is missing, but exactly the last word (or last several words), you can utilize any language model (e.g. another famous SOTA language model, GPT-2) to complete the sentence.

How to omit empty Model errors in tune.svm()?

I’am trying to use the tune.svm-function and since I don’t really know which parameters will produce a good model (as training data will be picked by a user) I need to cover a wide range of values. Currently I've got this behavior
tune(svm, value ~ . , data= data_l, ranges=list(cost = 10^(0:5), epsilon = 10^(-1:0)))
Parameter tuning of ‘svm’:
- sampling method: 10-fold cross validation
- best parameters:
cost epsilon
100 0.1
- best performance: 277.5491
and
tune(svm, value ~ . , data= data_l, ranges=list(cost = 10^(0:5), epsilon = 10^(-1:1)))
Error in predict.svm(ret, xhold, decision.values = TRUE) : Model is empty!
(Difference in the highest value of epsilon)
I know that svm won’t work with epsilon =10 but my intuition for this tuning function would be, that it can handle parameters that will not produce a model.
Why wouldn't it pick the models that could be generated?
Is there an “easy” way to omit this error-behavior? (I tried tryCatch(tune()) and a lot of other stuff I found, but I guess I would have to dig deep into the tune/svm/predict-codes which doesn’t sound “easy” anymore)
My understanding of the "Model is empty!" error is that it indicates a singular training matrix was fed into the SVM. See Salvy's answer to this related post and Oldrich Kruza's related message in the R mailing list.

Resources