Implementing variational autoencoder in keras with reconstruction probability - keras

I'm trying to implement the variational autoencoder in keras and use reconstruction probability instead of reconstruction error for anomaly detection. There is an example in deep learning 4j and someone has already asked the same question here: Variational autoencoder and reconstruction Log Probability vs Reconstruction error
Thanks for your help

Depends on your use case. In the example below, you can take the trace of the inner product of the reconstruction matrix and the input matrix (provided it makes sense to case the reconstruction matrix as a probability). Then edit your custom loss function to return that value instead (or in addition to) of standard VAE loss. Adam doesn't care what is being optimized, however, the nice benefits of using a VAE if you are not using its loss might vanish.
From here:
def compute_log_probability(one_hot_inp,pwm_output):
prod_mat=np.matmul(one_hot_inp.T,pwm_output)
log_prod_mat=np.log(prod_mat)
sum_diag=np.trace(log_prod_mat)
return sum_diag
output = x_decoded.reshape(dim1,dim2)
output = normalize(output,axis=0, norm='l1') #column-wise normalization in this case
prob=compute_log_probability(input,output)
In the protein input case it makes sense to normalize per column because each column can only really have one value. In other cases you might want

Related

Do I need to apply the Softmax Function ANYWHERE in my multi-class classification Model?

I am currently turning my Binary Classification Model to a multi-class classification Model. Bare with me.. I am very knew to pytorch and Machine Learning.
Most of what I state here, I know from the following video.
https://www.youtube.com/watch?v=7q7E91pHoW4&t=654s
What I read / know is that the CrossEntropyLoss already has the Softmax function implemented, thus my output layer is linear.
What I then read / saw is that I can just choose my Model prediction by taking the torch.max() of my model output (Which comes from my last linear output. This feels weird because I Have some negative outputs and i thought I need to apply the SOftmax function first, but It seems to work right without it.
So know the big confusing question I have is, when would I use the Softmax function? Would I only use it when my loss doesnt have it implemented? BUT then I would choose my prediction based on the outputs of the SOftmax layer which wouldnt be the same as with the linear output layer.
Thank you guys for every answer this gets.
For calculating the loss using CrossEntropy you do not need softmax because CrossEntropy already includes it. However to turn model outputs to probabilities you still need to apply softmax to turn them into probabilities.
Lets say you didnt apply softmax at the end of you model. And trained it with crossentropy. And then you want to evaluate your model with new data and get outputs and use these outputs for classification. At this point you can manually apply softmax to your outputs. And there will be no problem. This is how it is usually done.
Traning()
MODEL ----> FC LAYER --->raw outputs ---> Crossentropy Loss
Eval()
MODEL ----> FC LAYER --->raw outputs --> Softmax -> Probabilites
Yes you need to apply softmax on the output layer. When you are doing binary classification you are free to use relu, sigmoid,tanh etc activation function. But when you are doing multi class classification softmax is required because softmax activation function distributes the probability throughout each output node. So that you can easily conclude that the output node which has the highest probability belongs to a particular class. Thank you. Hope this is useful!

Picking up the anomalies using autoencoder

Other than mean square error, are there other quantities that we can use to detect anomalies using autoencoder in keras?
Generally, the idea is to measure the reconstruction and classify anomalies as those datapoints that cause a significant deviation from the input. Thus, one can other other norms such as mae. However, the results will probably be very similar.
I would suggest different flavors of the auto encoder. First of all, if your are not already using it, the variational autoencoder is better than a standard auto encoder in all aspects.
Second, the performance of a variational autoencoder can be significantly improved by using the reconstruction probability. The idea is to output the parameters for probability distributions not only for the latent space but also for the feature space. This means that the decoder would output a mean and a variance to parameterize a normal distribution when used with continuous data. Then the reconstruction probability is basically the negative log likehood of the normal distribution N(x; decoder_mu, decoder_var). Using the 2-sigma rule, the variance can be interpreted as confidence intervall and thus even small errors can lead to an high error.
Other than that, there are other flavors like vae-gan, which combines a vae and gan uses a combined anomaly score with the reconstruction error and the discriminator prediction. Also depending on your problem type, you can also go into the route of a vae-sl that adds an additional classifier in the bottleneck. The model is then trained on mixed data which can be fully or sparsed labelled. Then the classifier can be used for anomaly detection.

Understanding choice of loss and activation in deep autoencoder?

I am following this keras tutorial to create an autoencoder using the MNIST dataset. Here is the tutorial: https://blog.keras.io/building-autoencoders-in-keras.html.
However, I am confused with the choice of activation and loss for the simple one-layer autoencoder (which is the first example in the link). Is there a specific reason sigmoid activation was used for the decoder part as opposed to something such as relu? I am trying to understand whether this is a choice I can play around with, or if it should indeed be sigmoid, and if so why? Similarily, I understand the loss is taken by comparing each of the original and predicted digits on a pixel-by-pixel level, but I am unsure why the loss is binary_crossentropy as opposed to something like mean squared error.
I would love clarification on this to help me move forward! Thank you!
MNIST images are generally normalized in the range [0, 1], so the autoencoder should output images in the same range, for easier learning. This is why a sigmoid activation is used at the output.
The mean squared error loss has a non-linear penalty, with big errors having a larger penalty than smaller errors, which generally leads to converging to the mean of the solution, instead of a more accurace solution. The binary cross-entropy does not have this problem, and thus it is preferred. It works because the output of the model and the labels are in the [0, 1] range, and the loss is applied to all pixels.

Seq2seq LSTM fails to produce sensible summaries

I am training an encoder-decoder LSTM in keras for text summarization and the CNN dataset with the following architecture
Picture of bidirectional encoder-decoder LSTM
I am pretraining the word embedding (of size 256) using skip-gram and
I then pad the input sequences with zeros so all articles are of equal length
I put a vector of 1's in each summary to act as the "start" token
Use MSE, RMSProp, tanh activation in the decoder output later
Training: 20 epochs, batch_size=100, clip_norm=1,dropout=0.3, hidden_units=256, LR=0.001, training examples=10000, validation_split=0.2
The network trains and training and validation MSE go down to 0.005, however during inference, the decoder keeps producing a repetition of a few words that make no sense and are nowhere near the real summary.
My question is, is there anything fundamentally wrong in my training approach, the padding, loss function, data size, training time so that the network fails to generalize?
Your model looks ok, except for the loss function. I can't figure out how MSE is applicable to word prediction. Cross-entropy loss looks like a natural choice here.
Generated word repetition can be caused by the way the decoder works at inference time: you should not simply select the most probable word from the distribution, but rather sample from it. This will give more variance to the generated text. Start looking at beam search.
If I were to pick a single technique to boost sequence to sequence model performance, it's certainly attention mechanism. There are lots of post about it, you can start with this one, for example.

predict() returns image similarities with SVM in scikit learn

A silly question: after i train my SVM in scikit-learn i have to use predict function: predict(X) for predicting at which class belongs? (http://scikit-learn.org/dev/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC.predict)
X parameter is the image feature vector?
In case i give an image not trained (not trained because SVM ask at least 3 samples for class), what returns?
First remark: "predict() returns image similarities with SVM in scikit learn" is not a question. Please put a question in the header of Stack Overflow entries.
Second remark: the predict method of the SVC class in sklearn does not return "image similarities" but a class assignment prediction. Read the http://scikit-learn.org documentation and tutorials to understand what we mean by classification and prediction in machine learning.
X parameter is the image feature vector?
No, X is not "the image" feature vector: it is a set of image feature vectors with shape (n_samples, n_features) as explained in the documentation you refer to. In your case a sample is an image hence the expected shape would be (n_images, n_features). The predict API was design to compute many predictions at once for efficiency reason. If you want to compute a single prediction, you will have to wrap your single feature vector in an array with shape (1, n_features).
For instance if you have a single feature vector (1D) called my_single_image_features with shape (n_features,) you can call predict with:
predictions = clf.predict([my_single_image_features])
my_single_prediction = predictions[0]
Please note the [] signs around the my_single_image_features variable to turn it into a 2D array.
my_single_prediction will be an integer whose meaning depends on the integer values provided by you when calling the clf.fit(X_train, y_train) method in the first place.
In case i give an image not trained (not trained because SVM ask at least 3 samples for class), what returns?
An image is not "trained". Only the model is trained. Of course you can pass samples / images that are not part of the training set to the predict method. This is the whole purpose of machine learning: making predictions on new unseen data based on what you learn from the statistical regularities seen in the past training data.

Resources