I am implementing a custom loss function in keras. The model is an autoencoder. The first layer is an Embedding layer, which embed an input of size (batch_size, sentence_length) into (batch_size, sentence_length, embedding_dimension). Then the model compresses the embedding into a vector of a certain dimension, and finaly must reconstruct the embedding (batch_size, sentence_lenght, embedding_dimension).
But the embedding layer is trainable, and the loss must use the weights of the embedding layer (I have to sum over all word embeddings of my vocabulary).
For exemple, if I want to train on the toy exemple : "the cat". The sentence_length is 2 and suppose embedding_dimension is 10 and the vocabulary size is 50, so the embedding matrix has shape (50,10). The Embedding layer's output X is of shape (1,2,10). Then it passes in the model and the output X_hat, is also of shape (1,2,10). The model must be trained to maximize the probability that the vector X_hat[0] representing 'the' is the most similar to the vector X[0] representing 'the' in the Embedding layer, and same thing for 'cat'. But the loss is such that I have to compute the cosine similarity between X and X_hat, normalized by the sum of cosine similarity of X_hat and every embedding (50, since the vocabulary size is 50) in the embedding matrix, which are the columns of the weights of the embedding layer.
But How can I access the weights in the embedding layer at each iteration of the training process?
Thank you !
It seems a bit crazy but it seems to work : instead of creating a custom loss function that I would pass in model.compile, the network computes the loss (Eq. 1 from arxiv.org/pdf/1708.04729.pdf) in a function that I call with Lambda :
loss = Lambda(lambda x: similarity(x[0], x[1], x[2]))([X_hat, X, embedding_matrix])
And the network has two outputs: X_hat and loss, but I weight X_hat to have 0 weight and loss to have all the weight :
model = Model(input_sequence, [X_hat, loss])
model.compile(loss=mean_squared_error,
optimizer=optimizer,
loss_weights=[0., 1.])
When I train the model :
for i in range(epochs):
for j in range(num_data):
input_embedding = model.layers[1].get_weights()[0][[data[j:j+1]]]
y = [input_embedding, 0] #The embedding of the input
model.fit(data[j:j+1], y, batch_size=1, ...)
That way, the model is trained to tend loss toward 0, and when I want to use the trained model's prediction I use the first output which is the reconstruction X_hat
Related
I am using a GPT2 model that outputs logits (before softmax) in the shape (batch_size, num_input_ids, vocab_size) and I need to compare it with the labels that are of shape (batch_size, num_input_ids) to calculate BCELoss. How do I calculate it?
logits = output.logits #--of shape (32, 56, 592)
logits = torch.nn.Softmax()(logits)
labels = labels #---------of shape (32, 56)
torch.nn.BCELoss()(logits, labels)
but the dimensions do not match, so how do I contract logits to labels shape or expand labels to logits shape?
Binary cross-entropy is used when the final classification layer is a sigmoid layer, i.e., for each output dimension, only a true/false output is possible. You can imagine it as assigning some tags to the input. This also means that the labels need to have the same dimension as the logits, having 0/1 for each logit. Statistically speaking, for 592 output dimensions, you predict 592 Bernoulli (= binary) distributions. The expected shape is 32 × 56 × 592.
When using the softmax layer, you assume only one target class is possible; you predict a single categorical distribution over 592 possible output classes. However, in this case, the correct loss function is not binary cross-entropy but categorical cross-entropy, implemented by the CrossEntropyLoss class in PyTorch. Note that it takes the logits directly before the softmax normalization and does the normalization internally. The expected shape is 32 × 56, as in the code snippet.
Suppose you have a timeseries classification task with n_classes possible classes, and you want to output the probability of each class for every timestep (like seq2seq). How can we achieve mult-step multi-class classification with a Conv1D network?
With a RNN it would be something like:
# input shape (n_samples, n_timesteps, n_features)
layer = LSTM(n_neurons, return_sequences=True, input_shape=(n_timesteps n_features))
layer = Dense(n_classes, activation="softmax")(layer)
# objective output shape (n_samples, n_timesteps, n_classes)
keras would know that since the LSTM layer is outputting sequences then it should wrap the Dense layer around a TimeDistributed layer.
How would we achieve the same but with a Conv1D layer? I imagine we need to manually wrap the last Dense layer with a time distributed, but I don't know how to build the network before to output sequences
# input shape (n_samples, n_timesteps, n_features)
layer = Conv1D(n_neurons, 3, input_shape=(n_timesteps n_features))
# ... ?
# can we use a Flatten layer in this case?
layer = TimeDistributed(Dense(n_classes, activation="softmax"))(layer)
# objective output shape (n_samples, n_timesteps, n_classes)
I Make a sentiment analysis model using LSTM but my model gives very bad prediction.
Here is the complete code
Dataset for amazon review
My LSTM model looks like this:
def ltsm_model(input_shape, word_to_vec_map, word_to_index):
"""
Function creating the ltsm_model model's graph.
Arguments:
input_shape -- shape of the input, usually (max_len,)
word_to_vec_map -- dictionary mapping every word in a vocabulary into its 50-dimensional vector representation
word_to_index -- dictionary mapping from words to their indices in the vocabulary (400,001 words)
Returns:
model -- a model instance in Keras
"""
### START CODE HERE ###
# Define sentence_indices as the input of the graph, it should be of shape input_shape and dtype 'int32' (as it contains indices).
sentence_indices = Input(shape=input_shape, dtype='int32')
# Create the embedding layer pretrained with GloVe Vectors (≈1 line)
embedding_layer = pretrained_embedding_layer(word_to_vec_map, word_to_index)
# Propagate sentence_indices through your embedding layer, you get back the embeddings
embeddings = embedding_layer(sentence_indices)
# Propagate the embeddings through an LSTM layer with 128-dimensional hidden state
# Be careful, the returned output should be a batch of sequences.
X = LSTM(128, return_sequences=True)(embeddings)
# Add dropout with a probability of 0.5
X = Dropout(0.5)(X)
# Propagate X trough another LSTM layer with 128-dimensional hidden state
# Be careful, the returned output should be a single hidden state, not a batch of sequences.
X = LSTM(128, return_sequences=False)(X)
# Add dropout with a probability of 0.5
X = Dropout(0.5)(X)
# Propagate X through a Dense layer with softmax activation to get back a batch of 5-dimensional vectors.
X = Dense(2, activation='relu')(X)
# Add a softmax activation
X = Activation('softmax')(X)
# Create Model instance which converts sentence_indices into X.
model = Model(inputs=[sentence_indices], outputs=X)
### END CODE HERE ###
return model
Here is what my training dataset looks like:
This is my testing data:
x_test = np.array(['amazing!: this soundtrack is my favorite music..'])
X_test_indices = sentences_to_indices(x_test, word_to_index, maxLen)
print(x_test[0] +' '+ str(np.argmax(model.predict(X_test_indices))))
I got following out for this:
amazing!: this soundtrack is my favorite music.. 0
But it should be positive sentiment and should be 1
Also this my fit model output:
How can I improve my model performance? This pretty bad model I suppose.
I want to set Euclidean distance as a loss function for LSTM or RNN.
What output should such function have: float, (batch_size) or (batch_size, timesteps)?
Model input X_train is (n_samples, timesteps, data_dim).
Y_train has the same dimensions.
Example code:
def euc_dist_keras(x, y):
return K.sqrt(K.sum(K.square(x - y), axis=-1, keepdims=True))
model = Sequential()
model.add(SimpleRNN(n_units, activation='relu', input_shape=(timesteps, data_dim), return_sequences=True))
model.add(Dense(n_output, activation='linear'))
model.compile(loss=euc_dist_keras, optimizer='adagrad')
model.fit(y_train, y_train, batch_size=512, epochs=10)
So, should I average loss over timesteps dimension and/or batch_size?
A loss functions will take predicted and true labels and will output a scalar, in Keras:
from keras import backend as K
def euc_dist_keras(y_true, y_pred):
return K.sqrt(K.sum(K.square(y_true - y_pred), axis=-1, keepdims=True))
Note, that it will not take X_train as an input. The loss calculation follows the forward propagation step, and it's value provides the goodness of predicted labels compared to true labels.
What output should such function have: float, (batch_size) or
(batch_size, timesteps)?
The loss function should have scalar output.
So, should I average loss over timesteps dimension and/or batch_size?
This would not be required to be able to use Euclidean distance as a loss function.
Side note: In your case, I think the problem might be with the neural network architecture, not the loss. Given (batch_size, timesteps, data_dim) the output of the SimpleRNN will be (batch_size, timesteps, n_units), and the output of Dense layer will be (batch_size, n_output). Thus, given your Y_train has the shape (batch_size, timesteps, data_dim) you would likely need to use TimeDistributed wrapper applying Dense per every temporal slice, and to adjust the number of hidden units in the fully connected layer.
I'm trying to use custom word-embeddings from Spacy for training a sequence -> label RNN query classifier. Here's my code:
word_vector_length = 300
dictionary_size = v.num_tokens + 1
word_vectors = v.get_word_vector_dictionary()
embedding_weights = np.zeros((dictionary_size, word_vector_length))
max_length = 186
for word, index in dictionary._get_raw_id_to_token().items():
if word in word_vectors:
embedding_weights[index,:] = word_vectors[word]
model = Sequential()
model.add(Embedding(input_dim=dictionary_size, output_dim=word_vector_length,
input_length= max_length, mask_zero=True, weights=[embedding_weights]))
model.add(Bidirectional(LSTM(128, activation= 'relu', return_sequences=False)))
model.add(Dense(v.num_labels, activation= 'sigmoid'))
model.compile(loss = 'binary_crossentropy',
optimizer = 'adam',
metrics = ['accuracy'])
model.fit(X_train, Y_train, batch_size=200, nb_epoch=20)
here the word_vectors are stripped from spacy.vectors and have length 300, the input is an np_array which looks like [0,0,12,15,0...] of dimension 186, where the integers are the token ids in the input, and I've constructed the embedded weight matrix accordingly. The output layer is [0,0,1,0,...0] of length 26 for each training sample, indicating the label that should go with this piece of vectorized text.
This looks like it should work, but during the first epoch the training accuracy is continually decreasing... and by the end of the first epoch/for the rest of training, it's exactly 0 and I'm not sure why this is happening. I've trained plenty of models with keras/TF before and never encountered this issue.
Any idea what might be happening here?
Are the labels always one-hot? Meaning only one of the elements of the label vector is one and the rest zero.
If so, then maybe try using a softmax activation with a categorical crossentropy loss like in the following official example:
https://github.com/fchollet/keras/blob/master/examples/babi_memnn.py#L202
This will help constraint the network to output probability distributions on the last layer (i.e. the softmax layer outputs sum up to 1).