Get feature vectors from BertForSequenceClassification - pytorch

I have successfully build a sentiment analysis tool with BertForSequenceClassification from huggingface/transformers to classify $tsla tweets as positive or negative.
However, I can't find out how I can obtain the feature vectors per tweet (more specifically the embedding of [CLS]) from my finetuned model.
more info of used model:
model = BertForSequenceClassification.from_pretrained(OUTPUT_DIR, num_labels=num_labels)
model.config.output_hidden_states = True
tokenizer = BertTokenizer(OUTPUT_DIR+'vocab.txt')
However, when I run the code below the output variable only consists of the logits.
model.eval()
eval_loss = 0
nb_eval_steps = 0
preds = []
for input_ids, input_mask, segment_ids, label_ids in tqdm_notebook(eval_dataloader, desc="Evaluating"):
input_ids = input_ids.to(device)
input_mask = input_mask.to(device)
segment_ids = segment_ids.to(device)
label_ids = label_ids.to(device)
with torch.no_grad():
output = model(input_ids,token_type_ids= segment_ids,attention_mask= input_mask)

I also have this problem after fine-tuning BertForSequenceClassification. I know your purpose is to get the hidden state of [CLS] as the representation of each tweet. Right? As the instruction of API document, I think the code is:
model = BertForSequenceClassification.from_pretrained(OUTPUT_DIR, output_hidden_states=True)
logits, hidden_states = model(input_ids, attn_masks)
cls_hidden_state = hidden_states[-1][:, 0, :] # the first hidden state in last layer
or
model = BertForSequenceClassification.from_pretrained(OUTPUT_DIR, output_hidden_states=True)
last_hidden_states = model.bert(input_ids, attn_masks)[0]
cls_hidden_state = last_hidden_states[:, 0, :]

BertForSequenceClassification is a wrapper that consists of two parts: BERT model (attribute bert) and a classifier (attribute classifier).
You can call directly the underling BERT model. If you pass your input directly to it, you will get the hidden states. It returns a tuple: the first member of the tuple are all hidden states, the second one is the [CLS] vector.

Related

Forward outputs on multiple sequences is wrong

I am using T5 to summarize multiple sequences as a batch. Here I want to generate the output of model.generate(input_ids) by calling forward function (model(**inputs)). I know that forward() and generate() work completely different see this. To make them working the same way. I take some sequences and call model.generate() on them to generate the corresponding outputs and get pairs of (text, summary). Now, Calling the forward function on these pairs one each time generates the same outputs. However, when calling the forward function on batch of sequences, the output is not the same ? What I missed ?
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("t5-small")
model = AutoModelForSeq2SeqLM.from_pretrained("t5-small")
model.resize_token_embeddings(len(tokenizer))
model.to("cuda")
model.eval()
# sequences
seq1 = "summarize: Calling the model (which means the forward method) uses the labels for teacher forcing. This means inputs to the decoder are the labels shifted by one"
output1 = "calling the model uses the labels for teacher forcing. inputs to the decoder"
seq2 = "summarize: When you call the generate method, the model is used in the autoregressive fashion"
output2 = "the model is used in the auto-aggressive fashion."
seq3 = "summarize: However, selecting the token is a hard decision, and the gradient cannot be propagated through this decision"
output3 = "the token is a hard decision, and the gradient cannot be propagated through this decision"
input_sequences = [seq1, seq2, seq3]
output_seq = [output1, output2, output3]
# encoding input and attention mask
encoding = tokenizer(
input_sequences,
padding="longest",
max_length=128,
truncation=True,
return_tensors="pt",
)
input_ids, attention_mask = encoding.input_ids.to("cuda"), encoding.attention_mask.to("cuda")
# labels
target_encoding = tokenizer(
output_seq, padding="longest", max_length=128, truncation=True
)
labels = target_encoding.input_ids
labels = torch.tensor(labels).to("cuda")
labels[labels == tokenizer.pad_token_id] = -100
# Call the models
logits = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels).logits
# Apply softamx() and batch_decode()
X = logits
X = F.softmax(X, dim=-1)
ids = X.argmax(dim=-1)
y = tokenizer.batch_decode(sequences=ids, skip_special_tokens=True)
# results: batch_size=3
['call the model uses the labels for teacher forcing inputs to the decoder are',
'the model is used in the auto-aggressive fashion the the the',
'the token is a hard decision, and the gradient cannot be propagated through this decision ']
# results: batch_size =1 i.e. consider 1 seq each time
['call the model uses the labels for teacher forcing inputs to the decoder are']
['the model is used in the auto-aggressive fashion ']
['the token is a hard decision, and the gradient cannot be propagated through this decision ']

Sentiment Analysis using LSTM (Model has not not generate good output)

I Make a sentiment analysis model using LSTM but my model gives very bad prediction.
Here is the complete code
Dataset for amazon review
My LSTM model looks like this:
def ltsm_model(input_shape, word_to_vec_map, word_to_index):
"""
Function creating the ltsm_model model's graph.
Arguments:
input_shape -- shape of the input, usually (max_len,)
word_to_vec_map -- dictionary mapping every word in a vocabulary into its 50-dimensional vector representation
word_to_index -- dictionary mapping from words to their indices in the vocabulary (400,001 words)
Returns:
model -- a model instance in Keras
"""
### START CODE HERE ###
# Define sentence_indices as the input of the graph, it should be of shape input_shape and dtype 'int32' (as it contains indices).
sentence_indices = Input(shape=input_shape, dtype='int32')
# Create the embedding layer pretrained with GloVe Vectors (≈1 line)
embedding_layer = pretrained_embedding_layer(word_to_vec_map, word_to_index)
# Propagate sentence_indices through your embedding layer, you get back the embeddings
embeddings = embedding_layer(sentence_indices)
# Propagate the embeddings through an LSTM layer with 128-dimensional hidden state
# Be careful, the returned output should be a batch of sequences.
X = LSTM(128, return_sequences=True)(embeddings)
# Add dropout with a probability of 0.5
X = Dropout(0.5)(X)
# Propagate X trough another LSTM layer with 128-dimensional hidden state
# Be careful, the returned output should be a single hidden state, not a batch of sequences.
X = LSTM(128, return_sequences=False)(X)
# Add dropout with a probability of 0.5
X = Dropout(0.5)(X)
# Propagate X through a Dense layer with softmax activation to get back a batch of 5-dimensional vectors.
X = Dense(2, activation='relu')(X)
# Add a softmax activation
X = Activation('softmax')(X)
# Create Model instance which converts sentence_indices into X.
model = Model(inputs=[sentence_indices], outputs=X)
### END CODE HERE ###
return model
Here is what my training dataset looks like:
This is my testing data:
x_test = np.array(['amazing!: this soundtrack is my favorite music..'])
X_test_indices = sentences_to_indices(x_test, word_to_index, maxLen)
print(x_test[0] +' '+ str(np.argmax(model.predict(X_test_indices))))
I got following out for this:
amazing!: this soundtrack is my favorite music.. 0
But it should be positive sentiment and should be 1
Also this my fit model output:
How can I improve my model performance? This pretty bad model I suppose.

How to stack same RNN for every layer?

I would like to know how to stack many layers of RNN but every layer are the same RNN. I want every layer share the same weight. I have read stack LSTM and RNN, but I found that each layer was not the same.
1 layer code:
inputs = keras.Input(shape=(maxlen,), batch_size = batch_size)
Emb_layer = layers.Embedding(max_features,word_dim)
Emb_output = Emb_layer(inputs)
first_layer = layers.SimpleRNN(n_hidden,use_bias=True,return_sequences=False,stateful =False)
first_layer_output = first_layer(Emb_output)
dense_layer = layers.Dense(1, activation='sigmoid')
dense_output = dense_layer(first_layer_output )
model = keras.Model(inputs=inputs, outputs=dense_output)
model.summary()
enter image description here
RNN 1 layer
inputs = keras.Input(shape=(maxlen,), batch_size = batch_size)
Emb_layer = layers.Embedding(max_features,word_dim)
Emb_output = Emb_layer(inputs)
first_layer = layers.SimpleRNN(n_hidden,use_bias=True,return_sequences=True,stateful =True)
first_layer_output = first_layer(Emb_output)
first_layer_state = first_layer.states
second_layer = layers.SimpleRNN(n_hidden,use_bias=True,return_sequences=False,stateful =False)
second_layer_set_state = second_layer(first_layer_output, initial_state=first_layer_state)
dense_layer = layers.Dense(1, activation='sigmoid')
dense_output = dense_layer(second_layer_set_state )
model = keras.Model(inputs=inputs, outputs=dense_output)
model.summary()
enter image description here
Stack RNN 2 layer.
For example, I want to build two layers RNN, but the first layer and the second must have the same weight, such that when I update the weight in the first layer the second layer must be updated and share the same value. As far as I know, TF has RNN.state. It returns the value from the previous layer. However, when I use this, it seems that each layer is treated independently. The 2-layer RNN that I want should have trainable parameters equal to the 1-layer since they shared the same weight, but this did not work.
You can view the layer object as a container for the weights that knows how to apply the weights. You can use the layer object as many times as you want. Assuming the embedding and the RNN dimension are the same, you can do:
states = Emb_layer(inputs)
first_layer = layers.SimpleRNN(n_hidden, use_bias=True, return_sequences=True)
for _ in range(10):
states = first_layer(states)
There is no reason to set stateful to true. This is used when you split long sequences into multiple batches and what the RNN to remember the state between batches, so you do not have yo manually set initial states. You can get the final state of the RNN (that you wany you want to use for classification) by simply indexing the last position from states.

How to sample Logits and Probabilties from a transformer seq2seq model for reinforcement learning?

skipping the formalities:
I am trying to apply reinforcement learning to a transformer based seq2seq model (for abstractive summarization purposes) in Pytorch.
My current setup looks something like this:
I am getting a greedy distribution (summary) from the model by inferring one token at a time in a loop
def get_greedy_distribution(model, batch):
src, (shift_tgt, lbl_tgt), segs, clss, mask_src, mask_tgt, mask_cls = batch
# the mock targets are just torch.zeros tensors to store inferred tokens
mock_tgt = get_mock_tgt(shift_tgt)
mock_return = get_mock_tgt(shift_tgt)
max_length = shift_tgt.shape[1]
with torch.no_grad():
for i in range(0, max_length-1):
prediction = model(src, mock_tgt, segs, clss, mask_src, mask_tgt, mask_cls)
prediction = F.softmax(prediction, dim=2)
val, ix = prediction.data.topk(1)
mock_tgt[:, i+1] = ix.squeeze()[:, i].detach()
mock_return[:, i] = ix.squeeze()[:, i].detach()
return mock_return
I am getting a sample distribution, with probabilities, from the model in a similar way:
def get_distribution(model, batch):
src, (shift_tgt, lbl_tgt), segs, clss, mask_src, mask_tgt, mask_cls = batch
mock_tgt = get_mock_tgt(shift_tgt)
mock_return = get_mock_tgt(shift_tgt)
max_length = shift_tgt.shape[1]
log_probs = []
for i in range(0, max_length-1):
prediction = model(src, mock_tgt, segs, clss, mask_src, mask_tgt, mask_cls)
prediction = F.softmax(prediction, dim=2)
multi_dist = Categorical(prediction[:, i])
x_t = multi_dist.sample()
log_prob = multi_dist.log_prob(x_t)
mock_tgt[:, i+1] = x_t
mock_return[:, i] = x_t
log_probs.append(log_prob)
return mock_return, log_probs
However, I am a bit unsure if I am inferring the sample distribution correctly. This would work well in an RNN context where I can sample logits and probabilities during the typical RNN loop, but it feels slightly wrong when using a Transformer.
How would you suggest to approach the Transformer for a typical baseline-sampled reinforcement learning setup (I am guessing it is a policy gradient)?
Pytorch code is preferred but if you have Tensorflow examples I am sure I can figure it out.

Attribute error: None type has no attribute summary in keras

I have tried to go in deep with my understanding of word embedding and NLP in keras implementing and copying part of the code creating a Keras model using functional API. When I launch model.summary I receive an Attribute error: None type has no attribute 'summary'.
After many attempts decreasing the numbers of layers, the dimension of word embedding matrix unfortunately nothing changed. I don't know what to do.
def pretrained_embedding_layer(word_to_vec, word_to_index):
vocab_len = len(word_to_index) + 1
emb_dim = word_to_vec["sole"].shape[0]
emb_matrix = np.zeros((vocab_len,emb_dim))
for word, index in word_to_index.items():
emb_matrix[index, :] = word_to_vec[word]
print(emb_matrix.shape)
embedding_layer = Embedding(vocab_len,emb_dim,trainable =False)
embedding_layer.build((None,))
embedding_layer.set_weights([emb_matrix])
return embedding_layer
def Chatbot_V1(input_shape, word_to_vec, word_to_index):
# Define sentence_indices as the input of the graph, it should be of shape input_shape and dtype 'int32' (as it contains indices).
sentence_indices = Input(input_shape, dtype='int32')
# Create the embedding layer pretrained with GloVe Vectors (≈1 line)
embedding_layer = pretrained_embedding_layer(word_to_vec, word_to_index)
embeddings = embedding_layer(sentence_indices)
# Propagate the embeddings through an LSTM layer with 128-dimensional hidden state
X = LSTM(128, return_sequences=True)(embeddings)
# Add dropout with a probability of 0.5
X = Dropout(0.5)(X)
# Propagate X trough another LSTM layer with 128-dimensional hidden state
# Be careful, the returned output should be a single hidden state, not a batch of sequences.
X = LSTM(128, return_sequences=True)(X)
# Add dropout with a probability of 0.5
X = Dropout(0.5)(X)
# Propagate X through a Dense layer with softmax activation to get back a batch of vocab_dim dimensional vectors.
X = Dense(vocab_dim)(X)
# Add a softmax activation
preds = Activation('softmax')(X)
# Create Model instance which converts sentence_indices into X.
model = Model(sentence_indices, preds)
model = Chatbot_V1((maxLen,), word_to_vec, word_to_index)
model.summary()
Launching model.summary:
AttributeError: 'NoneType' object has no attribute 'summary'
Why? What is wrong in layers definition?
The function Chatbot_V1 does not return anything, and in python this is signaled by None if you assign the return value of the function to a variable. So just use the return keyword to return the model at the end of Chatbot_V1

Resources