I'm trying to implement Attention mechanism in order to produce abstractive text summarization using Keras by taking a lot of help from this GitHub thread where there is a lot of informative discussion about the implementation. I'm struggling to understand certain very basic bits of the code and what will I need to modify to successfully get the output. I know that attention is the weighted sum of the context vector generated through all hidden states of all the previous timestamps and that is what we are trying to do below.
Data:
I got the BBC news dataset consists of news text and the headlines for various categories such as Politics, Entertainment, and Sports.
Parameters:
n_embeddings = 64
vocab_size = len(voabulary)+1
max_len = 200
rnn_size = 64
Code:
_input = Input(shape=(max_len,), dtype='int32')
embedding = Embedding(input_dim=vocab_size, output_dim=n_embeddings, input_length=max_len)(_input)
activations = LSTM(rnn_size, return_sequences=True)(embedding)
# compute importance for each step
attention = Dense(1, activation='tanh')(activations)
attention = Flatten()(attention)
attention = Activation('softmax')(attention)
attention = RepeatVector(units)(attention)
attention = Permute([2, 1])(attention)
# apply the attention
sent_representation = merge([activations, attention], mode='mul')
sent_representation = Lambda(lambda xin: K.sum(xin, axis=1))(sent_representation)
probabilities = Dense(max_len, activation='softmax')(sent_representation)
model = Model(input=_input, output=probabilities)
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=[])
print(model.summary())
My Questions:
The thread linked is trying to use Attention for classification whereas I want to generate a text sequence (summary) so how should I utilize the sent_probabilites and decode to generate the summary?
What is RepeatVector used here for? Is it for getting the activation and attention probability of each word at timestamp T?
I didn't find much explanation of what Permute layer does?
what is Lambda(lambda xin: K.sum(xin, axis=1))(sent_representation) for?
How does model.fit() look like? I have created the padded sequence of fixed length of X and y.
I would really appreciate any help you could provide. Thanks a lot in advance.
I have found tht most research tend to use tensorflow for this task, as it is much more easy to implement seq2seq model using tensorflow than using keras model
This blog goes into details on how to build a text summarizer in an extremely optimized manner, it explains one dongjun-Lee , one of the most easy efficient and easy implementations
Code can be found here, implemented on google colab,
If you liked the idea you can check all of the blog series.
Hope this helps
Related
In deep learning using Keras I have usually come across model.fit as something like this:
model.fit(x_train, y_train, epochs=50, callbacks=[es], batch_size=512, validation_data=(x_val, y_val)
Whereas in NLP taks, I have seen some articles on Text summarization using LSTM encoder-decoder with Attention model and I usually come across this code for fitting the model which I'm not able to comprehend:
model.fit([x_tr,y_tr[:,:-1]], y_tr.reshape(y_tr.shape[0],y_tr.shape[1], 1)[:,1:] ,epochs=50,callbacks=[es],batch_size=512, validation_data=([x_val,y_val[:,:-1]], y_val.reshape(y_val.shape[0],y_val.shape[1], 1)[:,1:]))
And I have found no explanation to why it is being done so. Can someone provide an explanation to the above code. The above code is found at https://www.analyticsvidhya.com/blog/2019/06/comprehensive-guide-text-summarization-using-deep-learning-python/
Please note: I have contacted the person who wrote the article but no response from him.
Just saw your question. Anyway, if any one has the similar question, here is an explanation.
model.fit() method to fit the training data where you can define the batch size to be e.g. 512 in your case. Send the text and summary (excluding the last word in summary) as the input, and a reshaped summary tensor comprising every word (starting from the second word) as the output (which explains the infusion of intelligence into the model to predict a word, given the previous word). Besides, to enable validation during the training phase, send the validation data as well.
I have some questions about fine-tuning causal language model using transformers and PyTorch.
My main goal is to fine-tune XLNet. However, I found the most of posts online was targeting at text classification, like this post. I was wondering, is there any way to fine-tune the model, without using the run_language_model.py from transformers' GitHub?
Here is a piece of my code trying to fine-tune XLNet:
model = XLNetLMHeadModel.from_pretrained("xlnet-base-cased")
tokenizer = XLNetTokenizer.from_pretrained("xlnet-base-cased", do_lower_case=True)
LOSS = torch.nn.CrossEntrypoLoss()
batch_texts = ["this is sentence 1", "i have another sentence like this", "the final sentence"]
encodings = tokenizer.encode_plus(batch_texts, add_special_tokens=True,
return_tensors=True, return_attention_mask=True)
outputs = model(encodings["input_ids"], encodings["attention_mask"])
loss = LOSS(outputs[0], target_ids)
loss.backward()
# ignoring the rest of codes...
I got stuck at the last two lines. At first, when using this LM model, it seems I don't have any labels as the supervised learning usually do; Second, as the language model which is to minimize the loss (cross-entropy here), I need a target_ids to compute the loss and perplexity with input_ids.
Here are my follow-up questions:
How should I deal with this labels during the model fitting?
Should I set something like target_ids=encodings["input_ids"].copy() to compute cross-entropy loss and perplexity?
If not, how should set this target_ids?
From the perplexity page from transformers' documentation, how should I adapt its method for non-fixed length of input text?
I saw another post from the documentation saying that it requires padding text for causal language modeling. However, from the link in 3), there is no such sign for padding text. Which one should I follow?
Any suggestions and advice will be appreciated!
When fine-tuning a model with a language-model head, the labels are the next tokens themselves (you predict the next words). Huggingface's library makes a lot of things very easy to do by hiding most of the complexity of the process within their methods, which is very nice when you want to do something standard. But if you want to do something special, or if you want to learn and understand the details, I suggest to go down implementing the training loop directly in pytorch; coding the low-level stuff is the best way to learn.
For this case, here are a draft to get started; the training loop is far from being complete, but it must be adapted to each specific case anyway, so I hope these few lines may help to start...
model = GPT2LMHeadModel.from_pretrained('distilgpt2')
tokenizer = GPT2Tokenizer.from_pretrained('distilgpt2')
# our input:
s = tokenizer.encode('In winter, the weather is',return_tensors='pt')
# we want to fine-tune to force a fake output as follows:
ss = tokenizer.encode('warm and hot',return_tensors='pt')
# forward pass:
outputs = model(s)
# check that the outout logits are given for every input token:
print(outputs.logits.size())
# we're gonna train on the token that follows the last input one
# so we extract just the last logit:
lasty = outputs.logits[0,-1].view(1,-1)
# prepare backprop:
lossfct = torch.nn.CrossEntropyLoss()
optimizer = transformers.AdamW(model.parameters(), lr=5e-5)
# just take the first next token (you should repeat this for the next ones)
labels = ss[0][0].view(1)
loss = lossfct(lasty,labels)
loss.backward()
optimizer.step()
optimizer.zero_grad()
# finetunening done: you may check the answer is already different:
y = model.generate(s)
sy = tokenizer.decode(y[0])
print(sy)
I'm trying to calculate the probability or any type of score for words in a sentence using NLP. I've tried this approach with GPT2 model using Huggingface Transformers library, but, I couldn't get satisfactory results due to the model's unidirectional nature which for me didn't seem to predict within context. So I was wondering whether there is a way, to calculate the above said using BERT since it's Bidirectional.
I've found this post relatable, which I randomly saw the other day but didn't see any answer which would be useful for me as well.
Hope I will be able to receive ideas or a solution for this. Any help is appreciated. Thank you.
BERT is trained as a masked language model, i.e., it is trained to predict tokens that were replaced by a [MASK] token.
from transformers import AutoTokenizer, BertForMaskedLM
tok = AutoTokenizer.from_pretrained("bert-base-cased")
bert = BertForMaskedLM.from_pretrained("bert-base-cased")
input_idx = tok.encode(f"The {tok.mask_token} were the best rock band ever.")
logits = bert(torch.tensor([input_idx]))[0]
prediction = logits[0].argmax(dim=1)
print(tok.convert_ids_to_tokens(prediction[2].numpy().tolist()))
It prints token no. 11581 which is:
Beatles
To get a normalized probability distribution over BERT's vocabulary, you can normalize the logits using the softmax function, i.e., F.softmax(logits, dim=1), (assuming standart import torch.nn.fucntional as F).
The tricky thing is that words might be split into multiple subwords. You can simulate that by adding multiple [MASK] tokens, but then you have a problem with how to compare the scores of prediction so different lengths reliably. I would probably average the probabilities, but maybe there is a better way.
I have been attempting to classify an author using multiple texts written by this author, which I would then use to find similarities in other texts to identify that author in the test group.
I have been successful with some of the predictions, however I am still getting results where it failed to predict the author.
I have done pre-processing the texts beforehand with stemming, tokenizing, stop words, removing punctuation etc. in an attempt to make it more accurate.
I am unfamiliar with how exactly the OneClassSVM parameters work. What parameters could I use to best suit my problem and how could I make my model more accurate in it's predictions?
Here is what I have so far:
vectorizer = TfidfVectorizer()
author_corpus = self.pre_process(author_corpus)
test_corpus = self.pre_process(test_corpus)
train = author_corpus
test = test_corpus
train_vectors = vectorizer.fit_transform(train)
test_vectors = vectorizer.transform(test)
model = OneClassSVM(kernel='linear', gamma='auto', nu=0.01)
model.fit(train_vectors)
test_predictions = model.predict(test_vectors)
print(test_predictions[:10])
print(model.score_samples(test_vectors)[:10])
You can use a SVM, but deep learning is really well-suited for this. I did a Kaggle competition with classifying documents that was amazing for this.
If you don't think you have a big enough dataset, you might want to just take a text classifier model and re-train the last layer on your author, then fine-tune the rest of the model.
I’ve heard positive things about Andrew Ng’s deep learning class on Coursera. I learned all I know about AI using the Microsoft Professional Certification in AI on edx.
There are two sets of very similar code below with a very simple input as an illustrative example to my question. I think an explanation to the following observation can somehow answer my question. Thanks!
When I run the following code, the model can be trained quickly and can predict good results.
import tensorflow as tf
import numpy as np
from tensorflow import keras
model = tf.keras.Sequential([keras.layers.Dense(units=1, input_shape=[1])])
model.compile(optimizer='sgd', loss='mean_squared_error')
model.fit(xs, ys, epochs=1000)
print(model.predict([7.0]))
However, when i run the following code, which is very similar to the one above, the model is trained very slowly and may not be well trained and give bad predictions (i.e. the loss becomes <1 easily with the code above but stays at around 20000 with the code below)
model = keras.Sequential()# Your Code Here#
model.add(keras.layers.Dense(2,activation = 'relu',input_shape = (1,)))
model.add(keras.layers.Dense(1))
#model.compile(optimizer=tf.train.AdamOptimizer(0.1),
#loss='mean_squared_error')
model.compile(optimizer = tf.train.AdamOptimizer(1),loss = 'mean_squared_error')
#model.compile(# Your Code Here#)
xs = np.array([1,2,3,4,5,6,7,8,9,10], dtype=float)# Your Code Here#
ys = np.array([100,150,200,250,300,350,400,450,500,550], dtype=float)# Your Code Here#
model.fit(xs,ys,epochs = 1000)
print(model.predict([7.0]))
One more note: when I train my model with the second set of code, the model may be well trained occasionally (~8 out of 10 times it is not well trained, and loss remains >10000 after 1000 epochs).
I don't think there is any direct way to choose best deep architecture rather doing multiple experiments by varying hyper-parameters and changing the architecture. Compare the performance of each and every experiment and choose the best one. There are few articles listed below which may be helpful for you.
link-1, link-2, link-3