I have been working with pretrained embeddings (Glove) and would like to allow these to be finetuned. I currently use embeddings like this:
word_embeddingsA = nn.Embedding(vocab_size, embedding_length)
word_embeddingsA.weight = nn.Parameter(TEXT.vocab.vectors, requires_grad=False)
Should I simply set requires_grad=True to allow the embeddings to be trained? Or should I do something like this
word_embeddingsA = nn.Embedding.from_pretrained(TEXT.vocab.vectors, freeze=False)
Are these equivalent, and do I have a way to check that the embeddings are getting trained?
Yes they are equivalent as states in embedding:
freeze (boolean, optional) – If True, the tensor does not get updated in the learning process. Equivalent to embedding.weight.requires_grad = False. Default: True
If word_embeddingsA.requires_grad == True, then embedding is getting trained, else it's not.
Related
I am trying hard to understand how to make a paraphrase generation using BERT/GPT-2. I cannot understand how do I make it. Could you please provide me with any resources where I will be able to make a paraphrase generation model?
"The input would be a sentence and the output would be a paraphrase of the sentence"
Here is my recipe for training a paraphraser:
Instead of BERT (encoder only) or GPT (decoder only) use a seq2seq model with both encoder and decoder, such as T5, BART, or Pegasus. I suggest using the multilingual T5 model that was pretrained for 101 languages. If you want to load embeddings for your own language (instead of using all 101), you can follow this recipe.
Find a corpus of paraphrases for your language and domain. For English, ParaNMT, PAWS, and QQP are good candidates. A corpus called Tapaco, extracted from Tatoeba, is a paraphrasing corpus that covers 73 languages, so it is a good starting point if you cannot find a paraphrase corpus for your language.
Fine-tune your model on this corpus. The code can be something like this:
import torch
from transformers import T5ForConditionalGeneration, T5Tokenizer
# use here a backbone model of your choice, e.g. google/mt5-base
backbone_model = 'cointegrated/rut5-base-multitask'
model = T5ForConditionalGeneration.from_pretrained(backbone_model)
tokenizer = T5Tokenizer.from_pretrained(backbone_model)
model.cuda();
optimizer = torch.optim.Adam(params=[p for p in model.parameters() if p.requires_grad], lr=1e-5)
# todo: load the paraphrasing corpus and define the get_batch function
for i in range(100500):
xx, yy = get_batch(mult=mult)
x = tokenizer(xx, return_tensors='pt', padding=True).to(model.device)
y = tokenizer(yy, return_tensors='pt', padding=True).to(model.device)
# do not force the model to predict pad tokens
y.input_ids[y.input_ids==0] = -100
loss = model(
input_ids=x.input_ids,
attention_mask=x.attention_mask,
labels=y.input_ids,
decoder_attention_mask=y.attention_mask,
return_dict=True
).loss
loss.backward()
optimizer.step()
optimizer.zero_grad()
model.save_pretrained('my_paraphraser')
tokenizer.save_pretrained('my_paraphraser')
A more complete version of this code can be found in this notebook.
After the training, the model can be used in the following way:
from transformers import pipeline
pipe = pipeline(task='text2text-generation', model='my_paraphraser')
print(pipe('Here is your text'))
# [{'generated_text': 'Here is the paraphrase or your text.'}]
If you want your paraphrases to be more diverse, you can control the generation procress using arguments like
print(pipe(
'Here is your text',
encoder_no_repeat_ngram_size=3, # make output different from input
do_sample=True, # randomize
num_beams=5, # try more options
max_length=128, # longer texts
))
Enjoy!
you can use T5 paraphrasing for generating paraphrases
I am trying to do a seq2seq prediction. For this, I have a LSTM layer followed by a fully connected layer. I employ Teacher training during the training phase and would like to skip this (I maybe wrong here) during testing phase. I have not found a direct way of doing this so I have taken the approach shown below.
def forward(self, inputs, future=0, teacher_force_ratio=0.2, target=None):
outputs = []
for idx in range(future):
rnn_out, _ = self.rnn(inputs)
output = self.fc1(rnn_out)
if self.teacher_training:
new_input = output if np.random.random() >= teacher_force_ratio else target[idx]
else:
new_input = output
inputs = new_input
I use a bool variable teacher_training to check if Teacher training is needed or not. Is this correct? If yes, is there a better way to do this? Thanks.
In PyTorch all classes that extend nn.Module have a kwarg boolean param called training . So instead of teacher_training we should simply use training param. This param is automatically set depending on your model training mode (model.train() and model.eval()).
I defined a simple generative adversarial network that consists of a generator and a discriminator. The generator is compiled two times: The first time for non-adversarial training (without the discriminator extension), and the second one for adversarial training.
After I have built and compiled everything, I can ask my compiled models for their losses and metrics. This is what I get:
net.generator.loss -> 'mean_absolute_error'
net.generator.metrics -> []
net.combined.loss -> ['mean_absolute_error', 'binary_crossentropy']
net.combined.metrics -> []
So far so good, this seems to be plausible. But when I then use the train_on_batch method on net.generator or net.combined, the format of the returned loss does not match my expectations. I found out that I can check this by using model.metrics_names:
net.generator.metrics_names -> ['loss']
net.combined.metrics_names -> ['loss', 'sequential_15_loss', 'discriminator_loss']
My Question is: Why does my net.combined loss contain 3 instead of just two elements as I defined (loss=[generator_loss_fct,
'binary_crossentropy'). I don't want it to be 3 elements long.
Additionally the first two are almost always the same, or at least
very very very similar.
Does someone understand this? If yes, please explain me why this is like this and if I did something wrong. :)
Thanks in advance!
# build and compile the generator
self.encoder = self._build_encoder(input_shape, encoder_filters, kernel_size, latent_size)
self.decoder = self._build_decoder(self.encoder.layers[-1].output_shape[1:], decoder_filters, kernel_size)
self.generator = Sequential([self.encoder, self.decoder])
# compile for non-adversarial training
self.generator.compile(loss=generator_loss_fct, optimizer=self.optimizer)
# get the inputs
masked_img= Input(self.input_shape, name='masked-image')
filled_img = self.generator(masked_img)
# build and compile the (global) discriminator
self.discriminator = self._build_discriminator(input_shape, discriminator_filters, kernel_size)
self.discriminator.compile(loss='binary_crossentropy', optimizer=self.optimizer, metrics=['accuracy'])
# let the discriminator judge the validity of the reconstruction
valid = self.discriminator(filled_img)
# we freeze the discriminator when training the generator
self.discriminator.trainable = False
# build and compile the combined adversarial model
self.combined = Model(masked_img, [filled_img, valid])
self.combined.compile(loss=[generator_loss_fct, 'binary_crossentropy'], loss_weights=[self.alpha, self.beta], optimizer=self.optimizer)
When you have a multioutput model, Keras will report the total loss, together with the loss corresponding to each output.
Besides, if, as you say, the first two losses are so close, probably your last loss does nothing.
If you are willing to train a GAN model you can take a look at this Keras example
I am building a recommendation system where I predict the best item for each user given their purchase history of items. I have userIDs and itemIDs and how much itemID was purchased by userID. I have Millions of users and thousands of products. Not all products are purchased(there are some products that no one has bought them yet). Since the users and items are big I don't want to use one-hot vectors. I am using pytorch and I want to create and train the embeddings so that I can make the predictions for each user-item pair. I followed this tutorial https://pytorch.org/tutorials/beginner/nlp/word_embeddings_tutorial.html. If it's an accurate assumption that the embedding layer is being trained, then do I retrieve the learned weights through model.parameters() method or should I use the embedding.data.weight option?
model.parameters() returns all the parameters of your model, including the embeddings.
So all these parameters of your model are handed over to the optimizer (line below) and will be trained later when calling optimizer.step() - so yes your embeddings are trained along with all other parameters of the network.(you can also freeze certain layers by setting i.e. embedding.weight.requires_grad = False, but this is not the case here).
# summing it up:
# this line specifies which parameters are trained with the optimizer
# model.parameters() just returns all parameters
# embedding class weights are also parameters and will thus be trained
optimizer = optim.SGD(model.parameters(), lr=0.001)
You can see that your embedding weights are also of type Parameter by doing so:
import torch
embedding_maxtrix = torch.nn.Embedding(10, 10)
print(type(embedding_maxtrix.weight))
This will output the type of the weights, which is Parameter:
<class 'torch.nn.parameter.Parameter'>
I'm not entirely sure what mean by retrieve. Do you mean getting a single vector, or do you want just the whole matrix to save it, or do something else?
embedding_maxtrix = torch.nn.Embedding(5, 5)
# this will get you a single embedding vector
print('Getting a single vector:\n', embedding_maxtrix(torch.LongTensor([0])))
# of course you can do the same for a seqeunce
print('Getting vectors for a sequence:\n', embedding_maxtrix(torch.LongTensor([1, 2, 3])))
# this will give the the whole embedding matrix
print('Getting weights:\n', embedding_maxtrix.weight.data)
Output:
Getting a single vector:
tensor([[-0.0144, -0.6245, 1.3611, -1.0753, 0.5020]], grad_fn=<EmbeddingBackward>)
Getting vectors for a sequence:
tensor([[ 0.9277, -0.1879, -1.4999, 0.2895, 0.8367],
[-0.1167, -2.2139, 1.6918, -0.3483, 0.3508],
[ 2.3763, -1.3408, -0.9531, 2.2081, -1.5502]],
grad_fn=<EmbeddingBackward>)
Getting weights:
tensor([[-0.0144, -0.6245, 1.3611, -1.0753, 0.5020],
[ 0.9277, -0.1879, -1.4999, 0.2895, 0.8367],
[-0.1167, -2.2139, 1.6918, -0.3483, 0.3508],
[ 2.3763, -1.3408, -0.9531, 2.2081, -1.5502],
[-0.5829, -0.1918, -0.8079, 0.6922, -0.2627]])
I hope this answers your question, you can also take a look at the documentation, there you can find some useful examples as well.
https://pytorch.org/docs/stable/nn.html#torch.nn.Embedding
I am new to Keras and I created my own tf_idf sentence embeddings with shape (no_sentences, embedding_dim). I am trying to add this matrix as input to an LSTM layer. My network looks something like this:
q1_tfidf = Input(name='q1_tfidf', shape=(max_sent, 300))
q2_tfidf = Input(name='q2_tfidf', shape=(max_sent, 300))
q1_tfidf = LSTM(100)(q1_tfidf)
q2_tfidf = LSTM(100)(q2_tfidf)
distance2 = Lambda(preprocessing.exponent_neg_manhattan_distance, output_shape=preprocessing.get_shape)(
[q1_tfidf, q2_tfidf])
I'm struggling with how the matrix should be shaped. I am getting this error:
ValueError: Error when checking input: expected q1_tfidf to have 3 dimensions, but got array with shape (384348, 300)
I already checked this post: Sentence Embedding Keras but still can't figure it out. It seems like I'm missing something obvious.
Any idea how to do this?
Ok as far as I understood, you want to predict the difference between two sentences.
What about reusing the LSTM layer (the language model should be the same) and just learn a single sentence embedding and use it twice:
q1_tfidf = Input(name='q1_tfidf', shape=(max_sent, 300))
q2_tfidf = Input(name='q2_tfidf', shape=(max_sent, 300))
lstm = LSTM(100)
lstm_out_q1= lstm (q1_tfidf)
lstm_out_q2= lstm (q2_tfidf)
predict = concatenate([lstm_out_q1, lstm_out_q2])
model = Model(inputs=[q1_tfidf ,q1_tfidf ], outputs=predict)
predict = concatenate([q1_tfidf , q2_tfidf])
You could also introduce your custom distance in an additional lambda layer, but therefore you need to use a different reshaping in concatenation.