BERT embeddings in batches - pytorch

I am following this post to extract embeddings for sentences and for a single sentence the steps are described as follows:
text = "After stealing money from the bank vault, the bank robber was seen " \
"fishing on the Mississippi river bank."
# Add the special tokens.
marked_text = "[CLS] " + text + " [SEP]"
# Split the sentence into tokens.
tokenized_text = tokenizer.tokenize(marked_text)
# Mark each of the 22 tokens as belonging to sentence "1".
segments_ids = [1] * len(tokenized_text)
# Convert inputs to PyTorch tensors
tokens_tensor = torch.tensor([indexed_tokens])
segments_tensors = torch.tensor([segments_ids])
# Load pre-trained model (weights)
model = BertModel.from_pretrained('bert-base-uncased',
output_hidden_states = True,
)
# Put the model in "evaluation" mode, meaning feed-forward operation.
model.eval()
with torch.no_grad():
outputs = model(tokens_tensor, segments_tensors)
hidden_states = outputs[2]
And I want to do this for a batch of sequences. Here is my example code:
seql = ['this is an example', 'today was sunny and', 'today was']
encoded = [tokenizer.encode(seq, max_length=5, pad_to_max_length=True) for seq in seql]
encoded
[[2, 2511, 1840, 3251, 3],
[2, 1663, 2541, 1957, 3],
[2, 1663, 2541, 3, 0]]
But since I'm working with batches, sequences need to have same length. So I introduce a padding token (3rd sentence) which confuses me about several points:
What should the segment id for pad_token (0) will be?
Should I use attention masking when feeding the tensors to the model so that padding is ignored? In the example only token and segment tensors are used.
outputs = model(tokens_tensor, segments_tensors)
If I don't work with batches but with individual sentences, then I might not need a padding token. Would it be better to do that compared to batches?

You could do all the work you need using one function ( padding,truncation)
encode_plus
check the parameters: the docs
The same you could do with a list of sequences
batch_encode_plus
docs

Related

Attention weighted aggregation

Let the tensor shown below be the representation of two sentences (batch_size = 2) composed with 3 words (max_lenght = 3) and each word being represented by vectors of dimension equal to 5 (hidden_size = 5) obtained as output from a neural network:
net_output
# tensor([[[0.7718, 0.3856, 0.2545, 0.7502, 0.5844],
# [0.4400, 0.3753, 0.4840, 0.2483, 0.4751],
# [0.4927, 0.7380, 0.1502, 0.5222, 0.0093]],
# [[0.5859, 0.0010, 0.2261, 0.6318, 0.5636],
# [0.0996, 0.2178, 0.9003, 0.4708, 0.7501],
# [0.4244, 0.7947, 0.5711, 0.0720, 0.1106]]])
Also consider the following attention scores:
att_scores
# tensor([[0.2425, 0.5279, 0.2295],
# [0.2461, 0.4789, 0.2751]])
Which efficient approach allows obtaining the aggregation of vectors in net_output weighted by att_scores resulting in a vector of shape (2, 5)?
This should work:
weighted = (net_output * att_scores[..., None]).sum(axis = 1)
Uses broadcasting to (elementwise) multiply the attention weights to each vector and aggregates (them by summing) all vectors in a batch.

How to get words from output of XLNet using Transformers library

I am using Hugging Face's Transformer library to work with different NLP models. Following code does masking with XLNet. It outputs a tensor with numbers. How do I convert the output to words again?
import torch
from transformers import XLNetModel, XLNetTokenizer, XLNetLMHeadModel
tokenizer = XLNetTokenizer.from_pretrained('xlnet-base-cased')
model = XLNetLMHeadModel.from_pretrained('xlnet-base-cased')
# We show how to setup inputs to predict a next token using a bi-directional context.
input_ids = torch.tensor(tokenizer.encode("I went to <mask> York and saw the <mask> <mask> building.")).unsqueeze(0) # We will predict the masked token
print(input_ids)
perm_mask = torch.zeros((1, input_ids.shape[1], input_ids.shape[1]), dtype=torch.float)
perm_mask[:, :, -1] = 1.0 # Previous tokens don't see last token
target_mapping = torch.zeros((1, 1, input_ids.shape[1]), dtype=torch.float) # Shape [1, 1, seq_length] => let's predict one token
target_mapping[0, 0, -1] = 1.0 # Our first (and only) prediction will be the last token of the sequence (the masked token)
outputs = model(input_ids, perm_mask=perm_mask, target_mapping=target_mapping)
next_token_logits = outputs[0] # Output has shape [target_mapping.size(0), target_mapping.size(1), config.vocab_size]
The current output I get is:
tensor([[[ -5.1466, -17.3758, -17.3392, ..., -12.2839, -12.6421, -12.4505]]],
grad_fn=AddBackward0)
The output you have is a tensor of size 1 by 1 by vocabulary size. The meaning of the nth number in this tensor is the estimated log-odds of the nth vocabulary item. So, if you want to get out the word that the model predicts to be most likely to come in the final position (the position you specified with target_mapping), all you need to do is find the word in the vocabulary with the maximum predicted log-odds.
Just add the following to the code you have:
predicted_index = torch.argmax(next_token_logits[0][0]).item()
predicted_token = tokenizer.convert_ids_to_tokens(predicted_index)
So predicted_token is the token the model predicts as most likely in that position.
Note, by default behaviour of XLNetTokenizer.encoder() adds special tokens and to the end of a string of tokens when it encodes it. The code you have given masks and predicts the final word, which, after running though tokenizer.encoder() is the special character '<cls>', which is probably not what you want.
That is, when you run
tokenizer.encode("I went to <mask> York and saw the <mask> <mask> building.")
the result is a list of token ids,
[35, 388, 22, 6, 313, 21, 685, 18, 6, 6, 540, 9, 4, 3]
which, if you convert back to tokens (by calling tokenizer.convert_ids_to_tokens() on the above id list), you will see has two extra tokens added at the end,
['▁I', '▁went', '▁to', '<mask>', '▁York', '▁and', '▁saw', '▁the', '<mask>', '<mask>', '▁building', '.', '<sep>', '<cls>']
So, if the word you are meaning to predict is 'building', you should use perm_mask[:, :, -4] = 1.0 and target_mapping[0, 0, -4] = 1.0.

How can I calculate perplexity using nltk

I try to do some process on a text. It's part of my code:
fp = open(train_file)
raw = fp.read()
sents = fp.readlines()
words = nltk.tokenize.word_tokenize(raw)
bigrams = ngrams(words,2, left_pad_symbol='<s>', right_pad_symbol=</s>)
fdist = nltk.FreqDist(words)
In the old versions of nltk I found this code on StackOverflow for perplexity
estimator = lambda fdist, bins: LidstoneProbDist(fdist, 0.2)
lm = NgramModel(5, train, estimator=estimator)
print("len(corpus) = %s, len(vocabulary) = %s, len(train) = %s, len(test) = %s" % ( len(corpus), len(vocabulary), len(train), len(test) ))
print("perplexity(test) =", lm.perplexity(test))
However, this code is no longer valid, and I didn't find any other package or function in nltk for this purpose. Should I implement it?
Perplexity
Lets assume we have a model which takes as input an English sentence and gives out a probability score corresponding to how likely its is a valid English sentence. We want to determined how good this model is. A good model should give high score to valid English sentences and low score to invalid English sentences. Perplexity is a popularly used measure to quantify how "good" such a model is. If a sentence s contains n words then perplexity
Modeling probability distribution p (building the model)
can be expanded using chain rule of probability
So given some data (called train data) we can calculated the above conditional probabilities. However, practically it is not possible as it will requires huge amount of training data. We then make assumption to calculate
Assumption : All words are independent (unigram)
Assumption : First order Markov assumption (bigram)
Next words depends only on the previous word
Assumption : n order Markov assumption (ngram)
Next words depends only on the previous n words
MLE to estimate probabilities
Maximum Likelihood Estimate(MLE) is one way to estimate the individual probabilities
Unigram
where
count(w) is number of times the word w appears in the train data
count(vocab) is the number of uniques words (called vocabulary) in the train data.
Bigram
where
count(w_{i-1}, w_i) is number of times the words w_{i-1}, w_i appear together in same sequence (bigram) in the train data
count(w_{i-1}) is the number of times the word w_{i-1} appear in the train data. w_{i-1} is called context.
Calculating Perplexity
As we have seen above $p(s)$ is calculated by multiplying lots of small numbers and so it is not numerically stable because of limited precision of floating point numbers on a computer. Lets use the nice properties of log to simply it. We know
Example: Unigram model
Train Data ["an apple", "an orange"]
Vocabulary : [an, apple, orange, UNK]
MLE estimates
For test sentence "an apple"
l = (np.log2(0.5) + np.log2(0.25))/2 = -1.5
np.power(2, -l) = 2.8284271247461903
For test sentence "an ant"
l = (np.log2(0.5) + np.log2(0))/2 = inf
Code
import nltk
from nltk.lm.preprocessing import padded_everygram_pipeline
from nltk.lm import MLE
train_sentences = ['an apple', 'an orange']
tokenized_text = [list(map(str.lower, nltk.tokenize.word_tokenize(sent)))
for sent in train_sentences]
n = 1
train_data, padded_vocab = padded_everygram_pipeline(n, tokenized_text)
model = MLE(n)
model.fit(train_data, padded_vocab)
test_sentences = ['an apple', 'an ant']
tokenized_text = [list(map(str.lower, nltk.tokenize.word_tokenize(sent)))
for sent in test_sentences]
test_data, _ = padded_everygram_pipeline(n, tokenized_text)
for test in test_data:
print ("MLE Estimates:", [((ngram[-1], ngram[:-1]),model.score(ngram[-1], ngram[:-1])) for ngram in test])
test_data, _ = padded_everygram_pipeline(n, tokenized_text)
for i, test in enumerate(test_data):
print("PP({0}):{1}".format(test_sentences[i], model.perplexity(test)))
Example: Bigram model
Train Data: "an apple", "an orange"
Padded Train Data: "(s) an apple (/s)", "(s) an orange (/s)"
Vocabulary : (s), (/s) an, apple, orange, UNK
MLE estimates
For test sentence "an apple" Padded : "(s) an apple (/s)"
l = (np.log2(p(an|<s> ) + np.log2(p(apple|an) + np.log2(p(</s>|apple))/3 =
(np.log2(1) + np.log2(0.5) + np.log2(1))/3 = -0.3333
np.power(2, -l) = 1.
For test sentence "an ant" Padded : "(s) an ant (/s)"
l = (np.log2(p(an|<s> ) + np.log2(p(ant|an) + np.log2(p(</s>|ant))/3 = inf
Code
import nltk
from nltk.lm.preprocessing import padded_everygram_pipeline
from nltk.lm import MLE
from nltk.lm import Vocabulary
train_sentences = ['an apple', 'an orange']
tokenized_text = [list(map(str.lower, nltk.tokenize.word_tokenize(sent))) for sent in train_sentences]
n = 2
train_data = [nltk.bigrams(t, pad_right=True, pad_left=True, left_pad_symbol="<s>", right_pad_symbol="</s>") for t in tokenized_text]
words = [word for sent in tokenized_text for word in sent]
words.extend(["<s>", "</s>"])
padded_vocab = Vocabulary(words)
model = MLE(n)
model.fit(train_data, padded_vocab)
test_sentences = ['an apple', 'an ant']
tokenized_text = [list(map(str.lower, nltk.tokenize.word_tokenize(sent))) for sent in test_sentences]
test_data = [nltk.bigrams(t, pad_right=True, pad_left=True, left_pad_symbol="<s>", right_pad_symbol="</s>") for t in tokenized_text]
for test in test_data:
print ("MLE Estimates:", [((ngram[-1], ngram[:-1]),model.score(ngram[-1], ngram[:-1])) for ngram in test])
test_data = [nltk.bigrams(t, pad_right=True, pad_left=True, left_pad_symbol="<s>", right_pad_symbol="</s>") for t in tokenized_text]
for i, test in enumerate(test_data):
print("PP({0}):{1}".format(test_sentences[i], model.perplexity(test)))

How to do Text classification using word2vec

I want to perform text classification using word2vec.
I got vectors of words.
ls = []
sentences = lines.split(".")
for i in sentences:
ls.append(i.split())
model = Word2Vec(ls, min_count=1, size = 4)
words = list(model.wv.vocab)
print(words)
vectors = []
for word in words:
vectors.append(model[word].tolist())
data = np.array(vectors)
data
output:
array([[ 0.00933912, 0.07960335, -0.04559333, 0.10600036],
[ 0.10576613, 0.07267512, -0.10718666, -0.00804013],
[ 0.09459028, -0.09901826, -0.07074171, -0.12022413],
[-0.09893986, 0.01500741, -0.04796079, -0.04447284],
[ 0.04403428, -0.07966098, -0.06460238, -0.07369237],
[ 0.09352681, -0.03864434, -0.01743148, 0.11251986],.....])
How can i perform classification (product & non product)?
You already have the array of word vectors using model.wv.syn0. If you print it, you can see an array with each corresponding vector of a word.
You can see an example here using Python3:
import pandas as pd
import os
import gensim
import nltk as nl
from sklearn.linear_model import LogisticRegression
#Reading a csv file with text data
dbFilepandas = pd.read_csv('machine learning\\Python\\dbSubset.csv').apply(lambda x: x.astype(str).str.lower())
train = []
#getting only the first 4 columns of the file
for sentences in dbFilepandas[dbFilepandas.columns[0:4]].values:
train.extend(sentences)
# Create an array of tokens using nltk
tokens = [nl.word_tokenize(sentences) for sentences in train]
Now it's time to use the vector model, in this example we will calculate the LogisticRegression.
# method 1 - using tokens in Word2Vec class itself so you don't need to train again with train method
model = gensim.models.Word2Vec(tokens, size=300, min_count=1, workers=4)
# method 2 - creating an object 'model' of Word2Vec and building vocabulary for training our model
model = gensim.models.Word2vec(size=300, min_count=1, workers=4)
# building vocabulary for training
model.build_vocab(tokens)
print("\n Training the word2vec model...\n")
# reducing the epochs will decrease the computation time
model.train(tokens, total_examples=len(tokens), epochs=4000)
# You can save your model if you want....
# The two datasets must be the same size
max_dataset_size = len(model.wv.syn0)
Y_dataset = []
# get the last number of each file. In this case is the department number
# this will be the 0 or 1, or another kind of classification. ( to use words you need to extract them differently, this way is to numbers)
with open("dbSubset.csv", "r") as f:
for line in f:
lastchar = line.strip()[-1]
if lastchar.isdigit():
result = int(lastchar)
Y_dataset.append(result)
else:
result = 40
clf = LogisticRegression(random_state=0, solver='lbfgs', multi_class='multinomial').fit(model.wv.syn0, Y_dataset[:max_dataset_size])
# Prediction of the first 15 samples of all features
predict = clf.predict(model.wv.syn0[:15, :])
# Calculating the score of the predictions
score = clf.score(model.wv.syn0, Y_dataset[:max_dataset_size])
print("\nPrediction word2vec : \n", predict)
print("Score word2vec : \n", score)
You can also calculate the similarity of words belonging to your created model dictionary:
print("\n\nSimilarity value : ",model.wv.similarity('women','men'))
You can find more functions to use here.
Your question is rather broad but I will try to give you a first approach to classify text documents.
First of all, I would decide how I want to represent each document as one vector. So you need a method that takes a list of vectors (of words) and returns one single vector. You want to avoid that the length of the document influences what this vector represents. You could for example choose the mean.
def document_vector(array_of_word_vectors):
return array_of_word_vectors.mean(axis=0)
where array_of_word_vectors is for example data in your code.
Now you can either play a bit around with distances (for example cosine distance would a nice first choice) and see how far certain documents are from each other or - and that's probably the approach that brings faster results - you can use the document vectors to build a training set for a classification algorithm of your choice from scikit learn, for example Logistic Regression.
The document vectors will become your matrix X and your vector y is an array of 1 and 0, depending on the binary category that you want the documents to be classified into.

Using predict on new text with kmeans (sklearn)?

I have a very small list of short strings which I want to (1) cluster and (2) use that model to predict which cluster a new string belongs to.
Running the first part works fine, getting a prediction for the new string does not.
First Part
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
# List of
documents_lst = ['a small, narrow river',
'a continuous flow of liquid, air, or gas',
'a continuous flow of data or instructions, typically one having a constant or predictable rate.',
'a group in which schoolchildren of the same age and ability are taught',
'(of liquid, air, gas, etc.) run or flow in a continuous current in a specified direction',
'transmit or receive (data, especially video and audio material) over the Internet as a steady, continuous flow.',
'put (schoolchildren) in groups of the same age and ability to be taught together',
'a natural body of running water flowing on or under the earth']
# 1. Vectorize the text
tfidf_vectorizer = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf_vectorizer.fit_transform(documents_lst)
print('tfidf_matrix.shape: ', tfidf_matrix.shape)
# 2. Get the number of clusters to make .. (find a better way than random)
num_clusters = 3
# 3. Cluster the defintions
km = KMeans(n_clusters=num_clusters, init='k-means++').fit(tfidf_matrix)
clusters = km.labels_.tolist()
print(clusters)
Which returns:
tfidf_matrix.shape: (8, 39)
[0, 1, 0, 2, 1, 0, 2, 0]
Second Part
The failing part:
predict_doc = ['A stream is a body of water with a current, confined within a bed and banks.']
tfidf_vectorizer = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf_vectorizer.fit_transform(predict_doc)
print('tfidf_matrix.shape: ', tfidf_matrix.shape)
km.predict(tfidf_matrix)
The error:
ValueError: Incorrect number of features. Got 7 features, expected 39
FWIW: I somewhat understand that the training and predict have a different amount of features after vectorizing ...
I am open to any solution including changing from kmeans to an algorithm more suitable for short text clustering.
Thanks in advance
For completeness I will answer my own question with an answer from here , that doesn't answer that question. But answers mine
from sklearn.cluster import KMeans
list1 = ["My name is xyz", "My name is pqr", "I work in abc"]
list2 = ["My name is xyz", "I work in abc"]
vectorizer = TfidfVectorizer(min_df = 0, max_df=0.5, stop_words = "english", charset_error = "ignore", ngram_range = (1,3))
vec = vectorizer.fit(list1) # train vec using list1
vectorized = vec.transform(list1) # transform list1 using vec
km = KMeans(n_clusters=2, init='k-means++', n_init=10, max_iter=1000, tol=0.0001, precompute_distances=True, verbose=0, random_state=None, n_jobs=1)
km.fit(vectorized)
list2Vec = vec.transform(list2) # transform list2 using vec
km.predict(list2Vec)
The credit goes to #IrshadBhat

Resources