I am looking to perform a binary classification on each frame of a video. I have ground truth labels and the feature vectors for each frame of several videos, but am not sure how I could approach this as a time series problem to train an RNN.
How could I use the predictions on previous frames in a video to help inform future predictions? I think this could be accomplished with a one-to-one RNN, though I can't find any relevant examples.
Your task is quite similar to named-entity recognition task in NLP. In named-entity recognition you have a sequence of words and for each word you want to determine if a word is a named entity (say a person or company name) or not.
A many-to-many RNN architecture should be suitable for this task.
Please check for example the following article:
https://towardsdatascience.com/named-entity-recognition-ner-using-keras-bidirectional-lstm-28cd3f301f54
Instead of words you have video frames features, so you would need to replace the first Embedding layer with a Dense layer. Also in the above article N classes are used as labels and not just binary. So, as a started point you can try something like:
model = Sequential()
model.add(TimeDistributed(Dense(16, activation="relu")))
model.add(LSTM(32, return_sequences=True)
model.add(TimeDistributed(Dense(1, activation="relu")))
model.compile(loss='binary_crossentropy', optimizer='nadam', metrics=['accuracy'])
The model would take a sequence of video frame features as input and sequence of ground truth labels as output for each video.
Related
I have collected a dataset of paragraphs summaries, where the summary may or may not correspond to the paragraph it is paired with. I also have the labels of whether a summary corresponds to the paragraph or not (1 if it is a corresponding pair, and 0 if it is not).
I would like to use the pretrained Pegasus_large model in Huggingface (off-the-shelf) and train it on this downstream classification task.
Since Pegasus does not have any CLS token, I was thinking of possible ways of doing this.
I want to concatenate the paragraph and summary together, pass it through the pretrained Pegasus encoder only, and then pool over the final hidden layer outputs of the encoder. If I use the Huggingface PegasusModel (the one without and summary generation head), it expects me to provide decoder_input_ids, which I assume are the true tokens (labels) when pegasus is trained as a seq2seq model for summary generation. However, since I am not training my model to generate summaries, and would like the encoder representation only, I am not sure what to put as my decoder_input_ids.
My questions are: 1. Am I right in assuming the decoder_input_ids are only used for training the model for sequence generation, and 2. How should I get the last hidden layer outputs without having any decoder_input_ids?
In deep learning using Keras I have usually come across model.fit as something like this:
model.fit(x_train, y_train, epochs=50, callbacks=[es], batch_size=512, validation_data=(x_val, y_val)
Whereas in NLP taks, I have seen some articles on Text summarization using LSTM encoder-decoder with Attention model and I usually come across this code for fitting the model which I'm not able to comprehend:
model.fit([x_tr,y_tr[:,:-1]], y_tr.reshape(y_tr.shape[0],y_tr.shape[1], 1)[:,1:] ,epochs=50,callbacks=[es],batch_size=512, validation_data=([x_val,y_val[:,:-1]], y_val.reshape(y_val.shape[0],y_val.shape[1], 1)[:,1:]))
And I have found no explanation to why it is being done so. Can someone provide an explanation to the above code. The above code is found at https://www.analyticsvidhya.com/blog/2019/06/comprehensive-guide-text-summarization-using-deep-learning-python/
Please note: I have contacted the person who wrote the article but no response from him.
Just saw your question. Anyway, if any one has the similar question, here is an explanation.
model.fit() method to fit the training data where you can define the batch size to be e.g. 512 in your case. Send the text and summary (excluding the last word in summary) as the input, and a reshaped summary tensor comprising every word (starting from the second word) as the output (which explains the infusion of intelligence into the model to predict a word, given the previous word). Besides, to enable validation during the training phase, send the validation data as well.
I'm trying to build a keras model to classify text for 45 different classes. I'm a little confused about preparing my data for the input as required by google's BERT model.
Some blog posts insert data as a tf dataset with input_ids, segment ids, and mask ids, as in this guide, but then some only go with input_ids and masks, as in this guide.
Also in the second guide, it notes that the segment mask and attention mask inputs are optional.
Can anyone explain whether or not those two are required for a multiclass classification task?
If it helps, each row of my data can consist of any number of sentences within a reasonably sized paragraph. I want to be able to classify each paragraph/input to a single label.
I can't seem to find many guides/blogs about using BERT with Keras (Tensorflow 2) for a multiclass problem, indeed many of them are for multi-label problems.
I guess it is too late to answer but I had the same question. I went through huggingface code and found that if attention_mask and segment_type ids are None then by default it pays attention to all tokens and all the segments are given id 0.
If you want to check it out, you can find the code here
Let me know if this clarifies it or you think otherwise.
I have a set of sentences and their scores, I would like to train a marking system that could predict the score for a given sentence, such one example is like this:
(X =Tomorrow is a good day, Y = 0.9)
I would like to use LSTM to build such a marking system, and also consider the sequential relationship between each word in the sentence, so the training example shown above is transformed as following:
(x1=Tomorrow, y1=is) (x2=is, y2=a) (x3=a, y3=good) (x4=day, y4=0.9)
When training this LSTM, I would like the first three time steps using a softmax classifier, and the final step using a MSE. It is obvious that the loss function used in this LSTM is composed of two different loss functions. In this case, it seems the Keras does not provide the way to address my problem directly. In addition, I am not sure whether my method to build the marking system is correct or not.
Keras support multiple loss functions as well:
model = Model(inputs=inputs,
outputs=[lang_model, sent_model])
model.compile(optimizer='sgd',
loss=['categorical_crossentropy', 'mse'],
metrics=['accuracy'], loss_weights=[1., 1.])
Based on your explanation, I think you need a model that first, predict a token based on previous tokens, in NLP domain it usually called Language model, and then compute a score which I assume it is a sentiment (it is applicable to other domain).
To do so, you can train your language model with LSTM and pick the last output of LSTM for your ranking task. To this end, you need to define two loss function: categorical_crossentropy for the language model and MSE for the ranking task.
This tutorial would be helpful: https://www.pyimagesearch.com/2018/06/04/keras-multiple-outputs-and-multiple-losses/
I was trying to use keras to do entity relation extraction task.
My model looks like example code of keras imdb_bidirectional_lstm.py
model = Sequential()
model.add(Embedding(max_features, 128, input_length=maxlen))
model.add(Bidirectional(LSTM(64)))
model.add(Dropout(0.5))
model.add(Dense(1, activation='sigmoid'))
However, it is different from the imdb classify task that relation is related to specific entity in sentence and there may be several relations in one sentence. So I want to get output of specific entity word in BiLSTM layer one time and then concatenate them.
for example, there is a sentence "In Baghdad, a cameraman died when an
American tank fired on the Palestine Hotel." there is several relations in this sentence. so if I want to get the relation between "cameraman" and "tank", I need to get output of "cameraman" and "tank" in biLSTM layer and send them into a MLP. So what should I do to get the output of "cameraman" and "tank" in biLSTM layer? I have tried output attribute of layer, but it seems infeasible.
It may sound confusing. To be brief, how to get output of specific time step in lstm layer?
Any suggestion will be appreciated. Thank you very much!
The return_sequences=True parameter can get outputs of all time steps. Then you need to write a custom layer to extract the output of certain steps you need. AFAIK there is no direct way from keras to achieve that. It should not be too hard to write such a custom layer though.