In a keras example on LSTM for modeling IMDB sequence data (https://github.com/fchollet/keras/blob/master/examples/imdb_lstm.py), there is an embedding layer before input into a LSTM layer:
model.add(Embedding(max_features,128)) #max_features=20000
model.add(LSTM(128))
What does the embedding layer really do? In this case, does that mean the length of the input sequence into the LSTM layer is 128? If so, can I write the LSTM layer as:
model.add(LSTM(128,input_shape=(128,1))
But it is also noted the input X_train has subjected to pad_sequences processing:
print('Pad sequences (samples x time)')
X_train = sequence.pad_sequences(X_train, maxlen=maxlen) #maxlen=80
X_test = sequence.pad_sequences(X_test, maxlen=maxlen) #maxlen=80
It seems the input sequence length is 80?
To quote the documentation:
Turns positive integers (indexes) into dense vectors of fixed size.
eg. [[4], [20]] -> [[0.25, 0.1], [0.6, -0.2]]
Basically this transforms indexes (that represent which words your IMDB review contained) to a vector with the given size (in your case 128).
If you don't know what embeddings are in general, here is the wikipedia definition:
Word embedding is the collective name for a set of language modeling
and feature learning techniques in natural language processing (NLP)
where words or phrases from the vocabulary are mapped to vectors of
real numbers in a low-dimensional space relative to the vocabulary
size ("continuous space").
Coming back to the other question you've asked:
In this case, does that means the length of the input sequence into
the LSTM layer is 128?
not quite. For recurrent nets you'll have a time dimension and a feature dimension. 128 is your feature dimension, as in how many dimensions each embedding vector should have. The time dimension in your example is what is stored in maxlen, which is used to generate the training sequences.
Whatever you supply as 128 to the LSTM layer is the actual number of output units of the LSTM.
Related
everyone! I was reading about Bert and wanted to do text classification with its word embeddings. I came across this line of code:
pooled_output, sequence_output = self.bert_layer([input_word_ids, input_mask, segment_ids])
and then:
clf_output = sequence_output[:, 0, :]
out = Dense(1, activation='sigmoid')(clf_output)
But I can't understand the use of pooled output. Doesn't sequence output contain all the information including the word embedding of ['CLS']? If so, why do we have pooled output?
Thanks in advance!
Sequence output is the sequence of hidden-states (embeddings) at the output of the last layer of the BERT model. It includes the embedding of the [CLS] token. Hence, for the sentence "You are on Stackoverflow", it gives 5 embeddings: one embedding for each of the four words (assuming the word "Stackoverflow" was tokenized into a single token) along with the embedding of the [CLS] token.
Pooled output is the embedding of the [CLS] token (from Sequence output), further processed by a Linear layer and a Tanh activation function. The Linear layer weights are trained from the next sentence prediction (classification) objective during pretraining. For further details, please refer to the BERT original paper.
If you have given a sequence, "You are on StackOverflow". The sequence_output will give 768 embeddings of these four words. But, the pooled output will just give you one embedding of 768, it will pool the embeddings of these four words.
As a number of other answers have pointed out, sequence_output is token-level with 2 dimensions - the first dimension corresponds to the number of tokens in the input text.
pooled_output is one-dimensional and seems to be some sort of a higher-order context embedding for the input text.
I initially felt they should contain practically the same information (or that sequence_output should contain more given the additional n_token dimension), but right now I'm training a semantic similarity model based on Bert and am seeing definitively better results using both sequence_output and pooled_output in the model, compared to using just sequence_output.
I have a problem where I need to predict the number of clicks based on Date, CPC(cost per click), Market(US or UK) and keywords(average length 3-4 words, max 6). So the features that I decided to input the model were:
Day(1-31) WeekDay(0-6) Month(1-12) US-Market(0 or 1) UK-Market(0 or 1) CPC(continuous numeric)
And the output is the continuous numeric Number of Clicks.
For Keywords I used keras tokenizer to convert to sequences and the padded those sequences. The I used the glove word embeddings and created an embedding matrix and fed to the nueral Network model as described here in pretrained glove embeddings section.
The model that I used is:
The last Dense layer has linear activation. The model has two inputs (nlp_input for text data) and meta_input for (numerical,categorical data). Both models are concatenated after the Bidirectional LSTM to the nlp_input
The loss is :
model.compile(loss="mean_absolute_percentage_error", optimizer=opt,metrics=['acc'])
where opt = Adam(lr=1e-3, decay=1e-3 / 200)
I trained the model for 100 epochs and the loss was close to 8000 at that point.
But when I apply prediction to the test they result in the same number for all test inputs and that number is even negative -4.5 * e^-5. Could someone guide me as to how should I approach this problem and what improvements could I do to the model.
I am training an encoder-decoder LSTM in keras for text summarization and the CNN dataset with the following architecture
Picture of bidirectional encoder-decoder LSTM
I am pretraining the word embedding (of size 256) using skip-gram and
I then pad the input sequences with zeros so all articles are of equal length
I put a vector of 1's in each summary to act as the "start" token
Use MSE, RMSProp, tanh activation in the decoder output later
Training: 20 epochs, batch_size=100, clip_norm=1,dropout=0.3, hidden_units=256, LR=0.001, training examples=10000, validation_split=0.2
The network trains and training and validation MSE go down to 0.005, however during inference, the decoder keeps producing a repetition of a few words that make no sense and are nowhere near the real summary.
My question is, is there anything fundamentally wrong in my training approach, the padding, loss function, data size, training time so that the network fails to generalize?
Your model looks ok, except for the loss function. I can't figure out how MSE is applicable to word prediction. Cross-entropy loss looks like a natural choice here.
Generated word repetition can be caused by the way the decoder works at inference time: you should not simply select the most probable word from the distribution, but rather sample from it. This will give more variance to the generated text. Start looking at beam search.
If I were to pick a single technique to boost sequence to sequence model performance, it's certainly attention mechanism. There are lots of post about it, you can start with this one, for example.
How can I set up a keras model such that the final LSTM layer outputs a prediction for each time step while having variable sequence lengths as input?
I'd then like to provide labels for each of the timesteps after a dense layer with linear activation.
When I try to add a reshape or a dense layer to the LSTM model that is returning the full sequence and has a masking layer to take care of variable sequence lengths, it says:
The reshape and the dense layers do not support masking.
Would this be possible to do?
You can use the TimeDistributed layer wrapper for this. This applies the layer you want to each timestep. In your case, you could also just use TimeDistributedDense.
How do you write a simple sequence copy task in keras using the LSTM architecture without an Embedding layer? I already have the word vectors.
If you say that you have word vectors, I guess you have a dictionary to map a word to its vector representation (calculated from word2vec, GloVe...).
Using this dictionary, you replace all words in your sequence by their corresponding vectors. You will also need to make all your sequences the same length, since LSTMs need all the input sequences to be of constant length. So you will need to determine a max_length value and trim all sequences that are longer and pad all sequences that are shorter with zeros (see Keras pad_sequences function).
Then you can pass the vectorized sequences directly to the LSTM layer of your neural network. Since the LSTM layer is the first layer of the network, you will need to define the input shape, which in your case is (max_length, embedding_dim, ).
This ways you skip the Embedding layer and use your own precomputed word vectors instead.
I had same problem after searching in Keras at "Stacked LSTM for sequence classification" part , I found following code might be useful:
model = Sequential()
model.add(LSTM(3NumberOfLSTM, return_sequences=True,
input_shape=(YourSequenceLenght, YourWord2VecLenght)))