Bert Word vs Sentence [list of words] embeddings

Bert Word vs Sentence [list of words] embeddings - nlp

I am wondering how Bert Embeddings will vary if we use
Sentence as single sequence. ["This is a car"]. Here Bert tokenizer will add CLS at the start and SEP ids at end of sequence as x1["input_ids"] = [101, 123, 125, 126, 127, 102]
Each word as a sequence. [["this"], ["is"], ["a"], ["car"]. Here, CLS and SEP ids will be with each word as: x2["input_ids"= [[101, 123, 102],[101, 125, 102], [101, 126, 102], [101, 127, 102]]
After adding attention_mask and token_type_ids in x1 and x2, if we pass both through BertModel, how will both 1 and 2 effect the embedding vectors?

Related

Keras ConvLSTM InvalidArgumentError

I'm new enough to TensorFlow and Keras that I might be missing something obvious, but this is driving me nuts. I inherited an app that trains a custom convolutional LSTM, and I just spent the last two months or so straightening out some really atrocious data wrangling, only to discover I can't get the model to train properly.
Here's the model definition (with hard values substituted for the variables in my actual code):
inputs = layers.Input(shape = (2, 23, 23, 10))
outputs = layers.ConvLSTM2D(filters = 32,
kernel_size = (5,5),
padding = "same",
return_sequences = True,
stateful = False,
activation = "relu")(inputs)
outputs = layers.BatchNormalization()(outputs)
outputs = layers.ConvLSTM2D(filters = 32,
kernel_size = (3,3),
padding = "same",
return_sequences = True,
stateful = False,
activation = "relu")(outputs)
outputs = layers.ConvLSTM2D(filters = 32,
kernel_size = (3,3),
padding = "same",
return_sequences = True,
stateful = False,
activation = "relu")(outputs)
outputs = layers.Conv3D(filters = 1,
kernel_size = (3, 3, 3),
padding = "same",
activation = "sigmoid")(outputs)
If something looks squirrely there, let me know--I didn't create this model (though I did change the input grid from 100 x 100 to 23 x 23, if that makes a difference). The idea is to predict the intensity of a particular weather phenomenon (that's a single number for each grid point at each time).
Here's the model summary produced after the model is defined:
Model: "model"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_1 (InputLayer) [(None, 2, 23, 23, 10)] 0
conv_lstm2d (ConvLSTM2D) (None, 2, 23, 23, 32) 134528
batch_normalization (BatchN (None, 2, 23, 23, 32) 128
ormalization)
conv_lstm2d_1 (ConvLSTM2D) (None, 2, 23, 23, 32) 73856
conv_lstm2d_2 (ConvLSTM2D) (None, 2, 23, 23, 32) 73856
conv3d (Conv3D) (None, 2, 23, 23, 1) 865
=================================================================
Total params: 283,233
Trainable params: 283,169
Non-trainable params: 64
_________________________________________________________________
The model is fit using data from a custom Sequence sub-class, which produces X input of shape ([batch size], 2, 23, 23, 10) and Y input of shape ([batch size], 2, 23, 23, 1). The batch size is usually 8, but because of the way the data is lazily loaded, the last batch in a particular block of files may be smaller, which is why I don't specify a batch size in model definition. For the record, the original coders had a constant batch size, though, as with mine, it wasn't specified in the model definition.
When I try to fit the model, I get a crash pretty quickly, with this traceback:
Traceback (most recent call last):
File "C:/code/Python/edapts/ConvLSTM2D.py", line 175, in <module>
history = model.fit(training_data,
File "C:\Python\lib\site-packages\keras\utils\traceback_utils.py", line 67, in error_handler
raise e.with_traceback(filtered_tb) from None
File "C:\Python\lib\site-packages\tensorflow\python\eager\execute.py", line 54, in quick_execute
tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.InvalidArgumentError: Graph execution error:
Detected at node 'gradient_tape/model/conv_lstm2d/transpose_1/transpose' defined at (most recent call last):
File "C:/code/Python/edapts/ConvLSTM2D.py", line 175, in <module>
history = model.fit(training_data,
File "C:\Python\lib\site-packages\keras\utils\traceback_utils.py", line 64, in error_handler
return fn(*args, **kwargs)
File "C:\Python\lib\site-packages\keras\engine\training.py", line 1409, in fit
tmp_logs = self.train_function(iterator)
File "C:\Python\lib\site-packages\keras\engine\training.py", line 1051, in train_function
return step_function(self, iterator)
File "C:\Python\lib\site-packages\keras\engine\training.py", line 1040, in step_function
outputs = model.distribute_strategy.run(run_step, args=(data,))
File "C:\Python\lib\site-packages\keras\engine\training.py", line 1030, in run_step
outputs = model.train_step(data)
File "C:\Python\lib\site-packages\keras\engine\training.py", line 893, in train_step
self.optimizer.minimize(loss, self.trainable_variables, tape=tape)
File "C:\Python\lib\site-packages\keras\optimizers\optimizer_v2\optimizer_v2.py", line 537, in minimize
grads_and_vars = self._compute_gradients(
File "C:\Python\lib\site-packages\keras\optimizers\optimizer_v2\optimizer_v2.py", line 590, in _compute_gradients
grads_and_vars = self._get_gradients(tape, loss, var_list, grad_loss)
File "C:\Python\lib\site-packages\keras\optimizers\optimizer_v2\optimizer_v2.py", line 471, in _get_gradients
grads = tape.gradient(loss, var_list, grad_loss)
Node: 'gradient_tape/model/conv_lstm2d/transpose_1/transpose'
transpose expects a vector of size 4. But input(1) is a vector of size 5
[[{{node gradient_tape/model/conv_lstm2d/transpose_1/transpose}}]] [Op:__inference_train_function_9880]
I'm so lost. The only thing I can think of is that the original coders made the X and Y features the same, whereas I'm only trying to predict 1 of the 10 (there's not actually a reason to predict the others--not sure why they were trying to). If that's the problem, how do I redefine the model to take account of the different output shape?
EDIT: Well, that's interesting. I just compared the model definition from the original team with the model definition used by the co-worker I inherited the code from. Said co-worker seems to have inserted an extra layer in the outputs, duplicating the ConvLSTM2D with 32 filters and kernel size (3,3). When I remove one of those duplicates, everything runs just fine....
Great, so it's "fixed". But is there someone who can explain why it wasn't working in the first place? My level of understanding the issue at this point is to cross myself and throw salt over my shoulder.
EDIT #2: Does the problem result from having such a small grid (23 x 23)? So either a bigger grid or smaller kernels would solve the problem without deleting a layer? That seems intuitively likely, but I'd like to know the match to calculate how the outputs from each layer match up with the definition of the next layer.

How to get tokens to words in BERT tokenizer

I have a list, using higgingface bert tokenizer I can get the mapping numerical representation.
X = ['[CLS]', '[MASK]', 'love', 'this', '[SEP]']
tokens = tokenizer.convert_tokens_to_ids(X)
toekns: [101, 103, 2293, 2023, 102]
Is there any function so that I can get tokens=[101, 103, 2293, 2023, 102] to words ['[CLS]', '[MASK]', 'love', 'this', '[SEP]']?
One possible way is to mapping, but is there any defined function to do it easily ?

Stack expects each tensor to be equal size

I am following PyTorch tutorial on speech command recogniton and trying to implement my own recognition of 22 sentences in german language. In the tutorial they use padding for audio tensors, but for labels they use only torch.stack. Because of that, I have an error, as I start training the network:
RuntimeError: stack expects each tensor to be equal size, but got [456] at entry 0 and [470] at entry 1.
I do understand what this says, but since I am new to PyTorch can't unfortunately implement padding function for sentences from scratch. Therefore I would be happy if you could give me some hints and tipps for this.
Here is the code for collate_fn and pad_sequence functions:
def pad_sequence(batch):
# Make all tensor in a batch the same length by padding with zeros
batch = [item.t() for item in batch]
batch = torch.nn.utils.rnn.pad_sequence(batch, batch_first=True, padding_value=0.)
return batch.permute(0, 2, 1)
def collate_fn(batch):
# A data tuple has the form:
# waveform, label
tensors, targets = [], []
# Gather in lists, and encode labels as indices
for waveform, label in batch:
tensors += [waveform]
targets += [label]
# Group the list of tensors into a batched tensor
tensors = pad_sequence(tensors)
targets = torch.stack(targets)
return tensors, targets

As I started working directly with pad_sequence, I understood how simple it works. So, in my case I needed only bunch of strings (batch), which were automatically compared by PyTorch and extended to the maximal length of the one of the several strings in the batch.
My code looks now like this:
def pad_AudioSequence(batch):
# Make all tensor in a batch the same length by padding with zeros
batch = [item.t() for item in batch]
batch = torch.nn.utils.rnn.pad_sequence(batch, batch_first=True, padding_value=0.)
return batch.permute(0, 2, 1)
def pad_TextSequence(batch):
return torch.nn.utils.rnn.pad_sequence(batch,batch_first=True, padding_value=0)
def collate_fn(batch):
# A data tuple has the form:
# waveform, label
tensors, targets = [], []
# Gather in lists, and encode labels as indices
for waveform, label in batch:
tensors += [waveform]
targets += [label]
# Group the list of tensors into a batched tensor
tensors = pad_AudioSequence(tensors)
targets = pad_TextSequence(targets)
return tensors, targets
For those, who still don't understand how that works, here is little example:
encDecClass2 = dummyEncoderDecoder()
sent1 = audioWorkerClass.sentences[4] # wie viel Prozent hat der Akku noch?
sent2 = audioWorkerClass.sentences[5] # Wie spät ist es?
sent3 = audioWorkerClass.sentences[6] # Mach einen Timer für 5 Sekunden.
# encode sentences into tensor of numbers, representing words, using my own enc-dec class
sent1 = encDecClass2.encode(sent1) # tensor([11, 94, 21, 94, 22, 94, 23, 94, 24, 94, 25, 94, 26, 94, 15, 94])
sent2 = encDecClass2.encode(sent2) # tensor([27, 94, 28, 94, 12, 94, 29, 94, 15, 94])
sent3 = encDecClass2.encode(sent3) # tensor([30, 94, 31, 94, 32, 94, 33, 94, 34, 94, 35, 94, 19, 94])
print(sent1.shape) # torch.Size([16])
print(sent2.shape) # torch.Size([10])
print(sent3.shape) # torch.Size([14])
batch = []
# add sentences to the batch as separate arrays
batch +=[sent1]
batch +=[sent2]
batch +=[sent3]
output = pad_sequence(batch,batch_first=True, padding_value=0)
print(f"{output}\n{output.shape}")
#############################################################################
# output:
# tensor([[11, 94, 21, 94, 22, 94, 23, 94, 24, 94, 25, 94, 26, 94, 15, 94],
# [27, 94, 28, 94, 12, 94, 29, 94, 15, 94, 0, 0, 0, 0, 0, 0],
# [30, 94, 31, 94, 32, 94, 33, 94, 34, 94, 35, 94, 19, 94, 0, 0]])
# torch.Size([3, 16])
#############################################################################
As you may see all arrays were equalized to the maximum length of those three arrays and padded with zeros. Shape of the output is 3x16, because we had three sentences and longest array had sequence of 16 in the batch.

pretrained roberta relation extraction attribute error

I am trying to get the following pretrained huggingface model to work: https://huggingface.co/mmoradi/Robust-Biomed-RoBERTa-RelationClassification
I use the following code:
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("mmoradi/Robust-Biomed-RoBERTa-RelationClassification")
model = AutoModel.from_pretrained("mmoradi/Robust-Biomed-RoBERTa-RelationClassification")
inputs = tokenizer("""The colorectal cancer was caused by mutations in angina""")
outputs = model(**inputs)
For some reason, I get the following error when trying to produce outputs, so in the last line of my code:
--> 796 input_shape = input_ids.size()
797 elif inputs_embeds is not None:
798 input_shape = inputs_embeds.size()[:-1]
AttributeError: 'list' object has no attribute 'size'
The inputs look like this:
{'input_ids': [0, 133, 11311, 1688, 3894, 337, 1668, 21, 1726, 30, 28513, 11, 1480, 347, 2], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
I have no idea how to go about debugging this, so any help or hints are welcomed!

You have to specify the type of tensor that you want in return for tokenizer. If you don't, it will return a dictionary with two lists (input_ids and attention_mask):
inputs = tokenizer("""The colorectal cancer was caused by mutations in angina""", return_tensors="pt")

How to change parameters of pre-trained longformer model from huggingface

I am using Hugging-face pre-trained LongformerModel model. I am using to extract embedding for sentence. I want to change the token length, max sentence length parameter but I am not able to do so. Here is the code.
model = LongformerModel.from_pretrained('allenai/longformer-base-4096',output_hidden_states = True)
tokenizer = LongformerTokenizer.from_pretrained('allenai/longformer-base-4096')
model.eval()
text=[" I like to play cricket"]
input_ids = torch.tensor(tokenizer.encode(text,max_length=20,padding=True,add_special_tokens=True)).unsqueeze(0)
print(tokenizer.encode(text,max_length=20,padding=True,add_special_tokens=True))
# [0, 38, 101, 7, 310, 5630, 2]
I expected encoder to give me list of size 20 with padding as I have passed a parameter max_length=20. But it returned list of size 7 only?
attention_mask = torch.ones(input_ids.shape, dtype=torch.long, device=input_ids.device)
attention_mask[:, [0,-1]] = 2
outputs = model(input_ids, attention_mask=attention_mask, return_dict=True)
hidden_states = outputs[2]
print ("Number of layers:", len(hidden_states), " (initial embeddings + 12 BERT layers)")
layer_i = 0
print ("Number of batches:", len(hidden_states[layer_i]))
batch_i = 0
print ("Number of tokens:", len(hidden_states[layer_i][batch_i]))
token_i = 0
print ("Number of hidden units:", len(hidden_states[layer_i][batch_i][token_i]))
Output:
Number of layers: 13 (initial embeddings + 12 BERT layers)
Number of batches: 1
Number of tokens: 512 # How can I change this parameter to pick up my sentence length during run-time
Number of hidden units: 768
How can I reduce number of tokens to sentence length instead of 512 ? Every-time I input a new sentence, it should pick up that length.

Question regarding padding
padding=True pads your input to the longest sequence. padding=max_length pads your input to the specified max_length (documentation):
from transformers import LongformerTokenizer
tokenizer = LongformerTokenizer.from_pretrained('allenai/longformer-base-4096')
text=[" I like to play cricket"]
print(tokenizer.encode(text[0],max_length=20,padding='max_length',add_special_tokens=True))
Output:
[0, 38, 101, 7, 310, 5630, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
Question regarding the number of tokens of the hidden states
The Longformer implementation applies padding to your sequence to match the attention window sizes. You can see the size of the attention windows in your model config:
model.config.attention_window
Output:
[512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512]
This is the corresponding code line: link.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Bert Word vs Sentence [list of words] embeddings - nlp

Related

Keras ConvLSTM InvalidArgumentError

How to get tokens to words in BERT tokenizer

Stack expects each tensor to be equal size

pretrained roberta relation extraction attribute error

How to change parameters of pre-trained longformer model from huggingface

Categories

Resources