Forward outputs on multiple sequences is wrong - nlp

I am using T5 to summarize multiple sequences as a batch. Here I want to generate the output of model.generate(input_ids) by calling forward function (model(**inputs)). I know that forward() and generate() work completely different see this. To make them working the same way. I take some sequences and call model.generate() on them to generate the corresponding outputs and get pairs of (text, summary). Now, Calling the forward function on these pairs one each time generates the same outputs. However, when calling the forward function on batch of sequences, the output is not the same ? What I missed ?
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("t5-small")
model = AutoModelForSeq2SeqLM.from_pretrained("t5-small")
model.resize_token_embeddings(len(tokenizer))
model.to("cuda")
model.eval()
# sequences
seq1 = "summarize: Calling the model (which means the forward method) uses the labels for teacher forcing. This means inputs to the decoder are the labels shifted by one"
output1 = "calling the model uses the labels for teacher forcing. inputs to the decoder"
seq2 = "summarize: When you call the generate method, the model is used in the autoregressive fashion"
output2 = "the model is used in the auto-aggressive fashion."
seq3 = "summarize: However, selecting the token is a hard decision, and the gradient cannot be propagated through this decision"
output3 = "the token is a hard decision, and the gradient cannot be propagated through this decision"
input_sequences = [seq1, seq2, seq3]
output_seq = [output1, output2, output3]
# encoding input and attention mask
encoding = tokenizer(
input_sequences,
padding="longest",
max_length=128,
truncation=True,
return_tensors="pt",
)
input_ids, attention_mask = encoding.input_ids.to("cuda"), encoding.attention_mask.to("cuda")
# labels
target_encoding = tokenizer(
output_seq, padding="longest", max_length=128, truncation=True
)
labels = target_encoding.input_ids
labels = torch.tensor(labels).to("cuda")
labels[labels == tokenizer.pad_token_id] = -100
# Call the models
logits = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels).logits
# Apply softamx() and batch_decode()
X = logits
X = F.softmax(X, dim=-1)
ids = X.argmax(dim=-1)
y = tokenizer.batch_decode(sequences=ids, skip_special_tokens=True)
# results: batch_size=3
['call the model uses the labels for teacher forcing inputs to the decoder are',
'the model is used in the auto-aggressive fashion the the the',
'the token is a hard decision, and the gradient cannot be propagated through this decision ']
# results: batch_size =1 i.e. consider 1 seq each time
['call the model uses the labels for teacher forcing inputs to the decoder are']
['the model is used in the auto-aggressive fashion ']
['the token is a hard decision, and the gradient cannot be propagated through this decision ']

Related

How to extract Sentence Embedding Using BERT model from [CLS] token

I am following this link:
BERT document embedding
I want to extract sentence-embedding using BERT model using CLS token. Here is the code:
import torch
from keras.preprocessing.sequence import pad_sequences
import tensorflow as tf
def text_to_embedding(tokenizer, model, in_text):
'''
Uses the provided BERT 'model' and 'tokenizer' to generate a vector
representation of the input string, 'in_text'.
Returns the vector stored as a numpy ndarray.
'''
# ===========================
# STEP 1: Tokenization
# ===========================
MAX_LEN = 510
# 'encode' will:
# (1) Tokenize the sentence
# (2) Prepend the '[CLS]' token to the start.
# (3) Append the '[SEP]' token to the end.
# (4) Map tokens to their IDs.
input_ids = tokenizer.encode(
in_text, # sentence to encode.
add_special_tokens = True, # Add '[CLS]' and '[SEP]'
max_length = MAX_LEN, # Truncate all sentences.
#return_tensors = 'pt' # Return pytorch tensors.
)
print(input_ids)
print(tokenizer.decode(input_ids))
# Pad our input tokens. Truncation was handled above by the 'encode'
# function, which also makes sure that the '[SEP]' token is placed at the
# end *after* truncating.
# Note: 'pad_sequences' expects a list of lists, but we only have one
# piece of text, so we surround 'input_ids' with an extra set of brackets.
results = tokenizer(in_text, max_length=MAX_LEN, truncation=True)
input_ids = results.input_ids
attn_mask = results.attention_mask
print(results)
# Cast to tensors.
input_ids = torch.tensor(input_ids)
attn_mask = torch.tensor(attn_mask)
# Add an extra dimension for the "batch" (even though there is only one
# input in this batch)
input_ids = input_ids.unsqueeze(0)
attn_mask = attn_mask.unsqueeze(0)
# ===========================
# STEP 1: Tokenization
# ===========================
# Put the model in evaluation mode--the dropout layers behave differently
# during evaluation.
#model.eval()
# Copy the inputs to the GPU
#input_ids = input_ids.to(device)
#attn_mask = attn_mask.to(device)
# telling the model not to build the backward graph will make this
# a little quicker.
with torch.no_grad():
# Forward pass, returns hidden states and predictions
# This will return the logits rather than the loss because we have
# not provided labels.
outputs = model(input_ids = input_ids,token_type_ids = None,attention_mask = attn_mask)
hidden_states = outputs[2]
#Sentence Vectors
#To get a single vector for our entire sentence we have multiple
#application-dependent strategies, but a simple approach is to
#average the second to last hiden layer of each token producing
#a single 768 length vector.
# `hidden_states` has shape [13 x 1 x ? x 768]
# `token_vecs` is a tensor with shape [? x 768]
token_vecs = hidden_states[-2][0]
# Calculate the average of all ? token vectors.
sentence_embedding = torch.mean(token_vecs, dim=0)
# Move to the CPU and convert to numpy ndarray.
sentence_embedding = sentence_embedding.detach().cpu().numpy()
return(sentence_embedding)
from transformers import BertTokenizer, BertModel
# Load pre-trained model (weights)
model = BertModel.from_pretrained('bert-base-uncased',output_hidden_states = True), # Whether the model returns all hidden-states.
#model.cuda()
from transformers import BertTokenizer
# Load the BERT tokenizer.
print('Loadin BERT tokenizer...')
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)
k=text_to_embedding(tokenizer, model, "I like to play cricket")
Output:
<ipython-input-14-f03410b60544> in text_to_embedding(tokenizer, model, in_text)
77 # This will return the logits rather than the loss because we have
78 # not provided labels.
---> 79 outputs = model(input_ids = input_ids,token_type_ids = None,attention_mask = attn_mask)
80
81
TypeError: 'tuple' object is not callable
I get an error in this line outputs = model(input_ids = input_ids,token_type_ids = None,attention_mask = attn_mask)
Instead of using average of hidden layer, I want to modify code to get embedding for input sentence using CLS token.
There are 3 ways you can approach your problem-
There exists a very cool tool called bert-as-service. It maps a sentence to a fixed length word embeddings based on the model you choose to use. The documentation is very well written.
Install
pip install bert-serving-server # server
pip install bert-serving-client # client, independent of bert-serving-server
Download one of the pre-trained models available at official BERT repo- link
Start the server
bert-serving-start -model_dir /model_directory/ -num_worker=4
Generate embedding
from bert_serving.client import BertClient
bc = BertClient()
vectors=bc.encode(your_list_of_sentences)
There exist an academic paper by name of Sentence-BERT and their github repo
You are doing a lot of manual work- padding attn mask etc.
Toeknizer does it for you automatically, check the documentation. And, if you see the implementation of the forward() call of the model, it returns-
return (sequence_output, pooled_output) + encoder_outputs[1:]
For the bert base (768 hidden states), sequence output is the embedding of all token of the sequence, so if your input size[ max_len] is 510, then each token is embedded is a space of 768 dimension making sequence output of size- 768 * 510 *1
Pooled output is the one where all the embeddings are squished into a space of 768*1 dimension.
So I think you will want to use Pooled output for simple embeddings.

BERT document embedding

I am trying to do document embedding using BERT. The code I use is a combination of two sources. I use BERT Document Classification Tutorial with Code, and BERT Word Embeddings Tutorial. Below is the code, I feed the first 510 tokens of each document to the BERT model. Finally, I apply K-means clustering to these embeddings, but the members of each cluster are TOTALLY irrelevant. I am wondering how this is possible. Maybe something is wrong with my code. I would appreciate if you take a look at my code and tell if there is something wrong with it. I use Google colab to run this code.
# text_to_embedding function
import torch
from keras.preprocessing.sequence import pad_sequences
def text_to_embedding(tokenizer, model, in_text):
'''
Uses the provided BERT 'model' and 'tokenizer' to generate a vector
representation of the input string, 'in_text'.
Returns the vector stored as a numpy ndarray.
'''
# ===========================
# STEP 1: Tokenization
# ===========================
MAX_LEN = 510
# 'encode' will:
# (1) Tokenize the sentence
# (2) Prepend the '[CLS]' token to the start.
# (3) Append the '[SEP]' token to the end.
# (4) Map tokens to their IDs.
input_ids = tokenizer.encode(
in_text, # sentence to encode.
add_special_tokens = True, # Add '[CLS]' and '[SEP]'
max_length = MAX_LEN, # Truncate all sentences.
#return_tensors = 'pt' # Return pytorch tensors.
)
# Pad our input tokens. Truncation was handled above by the 'encode'
# function, which also makes sure that the '[SEP]' token is placed at the
# end *after* truncating.
# Note: 'pad_sequences' expects a list of lists, but we only have one
# piece of text, so we surround 'input_ids' with an extra set of brackets.
results = pad_sequences([input_ids], maxlen=MAX_LEN, dtype="long",
value=0, truncating="post", padding="post")
# Remove the outer list.
input_ids = results[0]
# Create attention masks.
attn_mask = [int(i > 0) for i in input_ids]
# Cast to tensors.
input_ids = torch.tensor(input_ids)
attn_mask = torch.tensor(attn_mask)
# Add an extra dimension for the "batch" (even though there is only one
# input in this batch)
input_ids = input_ids.unsqueeze(0)
attn_mask = attn_mask.unsqueeze(0)
# ===========================
# STEP 1: Tokenization
# ===========================
# Put the model in evaluation mode--the dropout layers behave differently
# during evaluation.
model.eval()
# Copy the inputs to the GPU
input_ids = input_ids.to(device)
attn_mask = attn_mask.to(device)
# telling the model not to build the backward graph will make this
# a little quicker.
with torch.no_grad():
# Forward pass, returns hidden states and predictions
# This will return the logits rather than the loss because we have
# not provided labels.
outputs = model(
input_ids = input_ids,
token_type_ids = None,
attention_mask = attn_mask)
hidden_states = outputs[2]
#Sentence Vectors
#To get a single vector for our entire sentence we have multiple
#application-dependent strategies, but a simple approach is to
#average the second to last hiden layer of each token producing
#a single 768 length vector.
# `hidden_states` has shape [13 x 1 x ? x 768]
# `token_vecs` is a tensor with shape [? x 768]
token_vecs = hidden_states[-2][0]
# Calculate the average of all ? token vectors.
sentence_embedding = torch.mean(token_vecs, dim=0)
# Move to the CPU and convert to numpy ndarray.
sentence_embedding = sentence_embedding.detach().cpu().numpy()
return(sentence_embedding)
from transformers import BertTokenizer, BertModel
# Load pre-trained model (weights)
model = BertModel.from_pretrained('bert-base-uncased',
output_hidden_states = True, # Whether the model returns all hidden-states.
)
model.cuda()
from transformers import BertTokenizer
# Load the BERT tokenizer.
print('Loadin BERT tokenizer...')
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)
I don't know if it solves your problem but here's my 2 cent:
You don't have to calculate the attention mask and do the padding manually. Have a look at the documentation. Just call the tokenizer itself:
results = tokenizer(in_text, max_length=MAX_LEN, truncation=True)
input_ids = results.input_ids
attn_mask = results.attention_mask
# Cast to tensors
...
Instead of using the average of the second to last hidden layer, you can try the same thing with the last hidden layer; or you can use the vector represents [CLS] from the last layer

Get feature vectors from BertForSequenceClassification

I have successfully build a sentiment analysis tool with BertForSequenceClassification from huggingface/transformers to classify $tsla tweets as positive or negative.
However, I can't find out how I can obtain the feature vectors per tweet (more specifically the embedding of [CLS]) from my finetuned model.
more info of used model:
model = BertForSequenceClassification.from_pretrained(OUTPUT_DIR, num_labels=num_labels)
model.config.output_hidden_states = True
tokenizer = BertTokenizer(OUTPUT_DIR+'vocab.txt')
However, when I run the code below the output variable only consists of the logits.
model.eval()
eval_loss = 0
nb_eval_steps = 0
preds = []
for input_ids, input_mask, segment_ids, label_ids in tqdm_notebook(eval_dataloader, desc="Evaluating"):
input_ids = input_ids.to(device)
input_mask = input_mask.to(device)
segment_ids = segment_ids.to(device)
label_ids = label_ids.to(device)
with torch.no_grad():
output = model(input_ids,token_type_ids= segment_ids,attention_mask= input_mask)
I also have this problem after fine-tuning BertForSequenceClassification. I know your purpose is to get the hidden state of [CLS] as the representation of each tweet. Right? As the instruction of API document, I think the code is:
model = BertForSequenceClassification.from_pretrained(OUTPUT_DIR, output_hidden_states=True)
logits, hidden_states = model(input_ids, attn_masks)
cls_hidden_state = hidden_states[-1][:, 0, :] # the first hidden state in last layer
or
model = BertForSequenceClassification.from_pretrained(OUTPUT_DIR, output_hidden_states=True)
last_hidden_states = model.bert(input_ids, attn_masks)[0]
cls_hidden_state = last_hidden_states[:, 0, :]
BertForSequenceClassification is a wrapper that consists of two parts: BERT model (attribute bert) and a classifier (attribute classifier).
You can call directly the underling BERT model. If you pass your input directly to it, you will get the hidden states. It returns a tuple: the first member of the tuple are all hidden states, the second one is the [CLS] vector.

How to enforce LSTM to learn monotonic sequences?

I'm using LSTM with Keras to forecast a set of sequences. Here my basic model:
inputs = Input(shape=(1,seq_dim)) #seq_dim = 2
# shape = (timesteps, featdim) = (1,2) since my input sequences are pair of values
# I want to predict the sequence of the fist values in the pairs
se = LSTM(lstm_size)(inputs)
out = Dense(1)(se) # I want to forecast one value
model = Model(inputs=inputs, outputs=out)
I know for sure that the sequences start from 0 and are monotonic (not-decreasing).
I tried with the Maximum() layer
max_out = Maximum()([output_seq,input_seq])
Here the model
inputs = Input(shape=(1,seq_dim))
# shape = (timesteps, featdim) = (1,2) since my input sequences are pair of values
# I want to predict the sequence of the fist values in the pairs
se = LSTM(lstm_size)(inputs)
out = Dense(1)(se) # I want to forecast one value
# max between the output and the previous value of the sequence (current input)
max_out = Maximum()([out,inputs[:,:,0]])
model = Model(inputs=inputs, outputs=max_out)
however at compiling the model an error is raised:
"AttributeError: 'Tensor' object has no attribute '_keras_history'"
I've also tried with a Lambda layer but it raises the same error.
max_out = Lambda(lambda x: K_BACKEND.max(x))([out,inputs[:,:,0]])
How can I add this constrain to my model? Is it possible to do in the architecture definitio (as I'm trying to do), or by editing the loss function?
thanks in advance
Try this
max_out = Lambda( lambda oi: K_BACKEND.maximum( oi[0], oi[1][:,:,0], axis=-1)),output_shape=lambda oi : oi[0] )([out,inputs]).

Keras Implementation of Customized Loss Function that need internal layer output as label

in keras, I want to customize my loss function which not only takes (y_true, y_pred) as input but also need to use the output from the internal layer of the network as the label for an output layer.This picture shows the Network Layout
Here, the internal output is xn, which is a 1D feature vector. in the upper right corner, the output is xn', which is the prediction of xn. In other words, xn is the label for xn'.
While [Ax, Ay] is traditionally known as y_true, and [Ax',Ay'] is y_pred.
I want to combine these two loss components into one and train the network jointly.
Any ideas or thoughts are much appreciated!
I have figured out a way out, in case anyone is searching for the same, I posted here (based on the network given in this post):
The idea is to define the customized loss function and use it as the output of the network. (Notation: A is the true label of variable A, and A' is the predicted value of variable A)
def customized_loss(args):
#A is from the training data
#S is the internal state
A, A', S, S' = args
#customize your own loss components
loss1 = K.mean(K.square(A - A'), axis=-1)
loss2 = K.mean(K.square(S - S'), axis=-1)
#adjust the weight between loss components
return 0.5 * loss1 + 0.5 * loss2
def model():
#define other inputs
A = Input(...) # define input A
#construct your model
cnn_model = Sequential()
...
# get true internal state
S = cnn_model(prev_layer_output0)
# get predicted internal state output
S' = Dense(...)(prev_layer_output1)
# get predicted A output
A' = Dense(...)(prev_layer_output2)
# customized loss function
loss_out = Lambda(customized_loss, output_shape=(1,), name='joint_loss')([A, A', S, S'])
model = Model(input=[...], output=[loss_out])
return model
def train():
m = model()
opt = 'adam'
model.compile(loss={'joint_loss': lambda y_true, y_pred:y_pred}, optimizer = opt)
# train the model
....
First of all you should be using the Functional API. Then you should define the network output as the output plus the result from the internal layer, merge them into a single output (by concatenating), and then make a custom loss function that then splits the merged output into two parts and does the loss computations on its own.
Something like:
def customLoss(y_true, y_pred):
#loss here
internalLayer = Convolution2D()(inputs) #or other layers
internalModel = Model(input=inputs, output=internalLayer)
tmpOut = Dense(...)(internalModel)
mergedOut = merge([tmpOut, mergedOut], mode = "concat", axis = -1)
fullModel = Model(input=inputs, output=mergedOut)
fullModel.compile(loss = customLoss, optimizer = "whatever")
I have my reservations regarding this implementation. The loss computed at the merged layer is propagated back to both the merged branches. Generally you would like to propagate it through just one layer.

Resources