Why does my variational autoencoder only produce positive values? - pytorch

I copied this example to build a variational autoencoder (VAE). The example uses images, but I use it for a signal that contains negative values. After training, the autoencoder only reconstructs the positive part of the signal, it does not produce negative values. Can anyone spot where the problem is or explain why this is the case?

If you used the exact code as the one shown in the example you put the link in, then at the end of the decoder you have x = torch.sigmoid(self.decConv2(x)) which take the real number line and outputs numbers between [0, 1]. This is why the network is unable to output negative numbers.
If you want to change the model to output negative numbers as well, remove the sigmoid function.
This means of course that you also have to change the loss function with which you train your model since the BCE loss is only good for outputs in the range of [0, 1].
As a recommendation I would suggest anyone to use the BCE with logits loss and avoid using the sigmoid in the decoder since this method incorporates the sigmoid and the BCE loss in a more numerically stable manner.

Related

Why can't I use softmax in regression task for probabilities?

I have a supervised learning task f(X)=y where X is a 2-dimentional np.array of np.int8 and y is a 1-dimentional array of np.float64 containing probabilities (so numbers between 0 and 1). I want to build a Neural Network model that performs regression in order to predict said probabilities y given X.
As the output of my Network is one real value (i.e. the output layer has one neuron) and is a probability (so in the range [0, 1]), I believe I should use softmax as the activation function of the output layer (i.e. output neuron) in order to squash the network's output to [0, 1].
As it is a regression task, I opted for using the mean_squared_error loss (instead of cross_entropy_loss that is typically used in classification tasks and often paired with softmax).
However, as I am trying to fit(X, y) the loss does not change at all between epochs and remains constant. Any ideas why? Is the combination of softmax and mean_squared_error loss wrong for some reason and why?
If I remove the softmax it does work, but then my model would also predict non probabilities which I do not want. Yes, I could squash it myself later but it doesn't seem right.
My code basically is (after removing some irrelevant additional callbacks for EarlyStopping and learning rate scheaduling):
model = Sequential()
model.add(Dense(W1_size, input_shape=(input_dims,), activation='relu'))
model.add(Dense(1, activation='softmax'))
# compile model
model.compile(optimizer=Adam(), loss='mse') # mse is the standard loss for regression
# fit
model.fit(X, y, batch_size=batch_size, epochs=MAX_EPOCHS)
Edit: Turns out I needed the sigmoid function to squash one real value to [0, 1] as the accepted answer suggests. The softmax function for a vector of size 1 is always 1.
As you stated you want to perform a regression task. (Which means, finding a continuous mapping between your input and desired output).
The softmax function creates a pseudo-probability distribution for multi-dimensional outputs (all values sum up to 1). This is the reason why the softmax function perfectly fits for classification tasks (predicting probabilities for different classes).
As you want to perform a regression task and your output is one-dimensional, softmax would not work properly because it is always 1 for a one-dimensional input.
A function which maps a one-dimensional input continuously to [0,1] works fine here (e.g Sigmoid).
Note that you can also interpret both the output of the sigmoid and the softmax function as probabilities. But be careful: these are only pseudo-probabilities and it is not representing the certainty or uncertainty of your model in making predictions.

Why does almost every Activation Function Saturate at Negative Input Values in a Neural Network

This may be a very basic/trivial question.
For Negative Inputs,
Output of ReLu Activation Function is Zero
Output of Sigmoid Activation Function is Zero
Output of Tanh Activation Function is -1
Below Mentioned are my questions:
Why is it that all of the above Activation Functions Saturated for Negative Input Values.
Is there any Activation Function if we want to predict a Negative Target Value.
Thank you.
True - ReLU is designed to result in zero for negative values. (It can be dangerous with big learning rates, bad initialization or with very few units - all neurons can get stuck in zero and the model freezes)
False - Sigmoid results in zero for "very negative" inputs, not for "negative" inputs. If your inputs are between -3 and +3, you will see a very pleasant result between 0 and 1.
False - The same comment as Sigmoid. If your inputs are between -2 and 2, you will see nice results between -1 and 1.
So, the saturation problem only exists for inputs whose absolute values are too big.
By definition, the outputs are:
ReLU: 0 < y < inf (with center in 0)
Sigmoid: 0 < y < 1 (with center in 0.5)
TanH: -1 < y < 1 (with center in 0)
You might want to use a BatchNormalization layer before these activations to avoid having big values and avoid saturation.
For predicting negative outputs, tanh is the only of the three that is capable of doing that.
You could invent a negative sigmoid, though, it's pretty easy:
def neg_sigmoid(x):
return -keras.backend.sigmoid(x)
#use the layer:
Activation(neg_sigmoid)
In short, negative/positive doesn't matter for these activation functions.
Sigmoid and tanh is both saturated for positive and negative values. As stated in the comments, they are symmetrical to input 0. For relu, it does only saturate for negative values, but I'll explain why it doens't matter in the next question.
The answer is an activation function doesn't need to 'predict' a negative value. The point of the activation function is not to give an equation to predict your final value, but to give a non-linearity to your neural network in the middle layers. You then use some appropriate function at the last layer to get the wanted output values. ex) softmax for classification, just linear for regression.
So because these activation functions are in the middle, it really doesn't matter if the activation function only outputs positive values even if your 'wanted' values are negative, since the model will make the weights for the next layes negative.(hence the term 'wanted values are negative' doesn't mean anything)
So, Relu being saturated on the negative side is no different from it being saturated on the positive side. There are activation functions that doesn't saturated such as leaky Relu, so you may want to check it out. But the point positive/negative for activation functions doesn't matter.
The key idea behind introducing the ReLu activation function was to address the issue of vanishing gradients in deeper networks. However, for different initialization, when the weights go above 1, it could lead to explosion of gradient values and cause the network to saturate. And the key idea behind ReLu was to introduce sparsity into the network. In a easy way we can say that it just prunes the connections deemed unimportant ( that is -ve weights ). Yup, here we have to be careful in the distribution of weights we initialize or the network can end up too sparse and unable to learn more information.
Sigmoid - The key problem with sigmoid for gradient based learning rules is that the derivative of sigmoid leads to a function that goes to 0 for very large inputs. Thus causing vanishing gradients, and also sigmoid doesn't cause a problem with negative values but instead, for large positive input values.
Tanh - The idea behind tanh is to not have sparsity which is enforced by ReLu and utilize complex network dynamics for learning similar to the sigmoid function. Tanh in a simpler way, tries to use the entire network's capability to learn and addresses the vanishing gradient problem similar to ReLu. And having a negative factor in the network acts as a dynamic regularizer (negative weights are strongly pulled to -1 and weights near 0 go towards 0) and is useful for binary classification or fewer class classification problems.
This link has some good information that would be helpful for you.

Understanding choice of loss and activation in deep autoencoder?

I am following this keras tutorial to create an autoencoder using the MNIST dataset. Here is the tutorial: https://blog.keras.io/building-autoencoders-in-keras.html.
However, I am confused with the choice of activation and loss for the simple one-layer autoencoder (which is the first example in the link). Is there a specific reason sigmoid activation was used for the decoder part as opposed to something such as relu? I am trying to understand whether this is a choice I can play around with, or if it should indeed be sigmoid, and if so why? Similarily, I understand the loss is taken by comparing each of the original and predicted digits on a pixel-by-pixel level, but I am unsure why the loss is binary_crossentropy as opposed to something like mean squared error.
I would love clarification on this to help me move forward! Thank you!
MNIST images are generally normalized in the range [0, 1], so the autoencoder should output images in the same range, for easier learning. This is why a sigmoid activation is used at the output.
The mean squared error loss has a non-linear penalty, with big errors having a larger penalty than smaller errors, which generally leads to converging to the mean of the solution, instead of a more accurace solution. The binary cross-entropy does not have this problem, and thus it is preferred. It works because the output of the model and the labels are in the [0, 1] range, and the loss is applied to all pixels.

When to use bias in Keras model?

I am new to modeling with Keras. I am trying to evaluate appropriate parameters for setting up the model. How do I know when you use bias vs when to turn it off?
The short answer is, always use bias variables when your model is small. Otherwise, it is still recommended to keep using bias in all neural network architectures.
Because each neurone performs like a simple logistic regression. In each neurone, the input values are multiplied with by the weights and the bias affects the initial level in the sigmoid function, which results the desired the non-linearity.
For example, if you have a zero input in your training data like X = [[0,0,...], [0,0,...],... ] , Y = 1, in a sigmoid function, the output will always be exactly Y=0.5 since X*W is zero. However, in large networks, each node can make a bias node out of the average activation of all of its inputs.

Keras softmax activation, category_crossentropy loss. But output is not 0, 1

I trained CNN model for just one epoch with very little data. I use Keras 2.05.
Here is the CNN model's (partial) last 2 layers, number_outputs = 201. Training data output is one hot encoded 201 output.
model.add(Dense(200, activation='relu', name='full_2'))
model.add(Dense(40, activation='relu', name='full_3'))
model.add(Dense(number_outputs, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer=adam, metrics=['accuracy'])
The model is saved to a h5 file. Then, saved mode is loaded with same model as above. batch_image is an image file.
prediction = loaded_model.predict(batch_image, batch_size=1)
I get prediction like this:
ndarray: [[ 0.00498065 0.00497852 0.00498095 0.00496987 0.00497506 0.00496112
0.00497585 0.00496474 0.00496769 0.0049708 0.00497027 0.00496049
0.00496767 0.00498348 0.00497927 0.00497842 0.00497095 0.00496493
0.00498282 0.00497441 0.00497477 0.00498019 0.00497417 0.00497654
0.00498381 0.00497481 0.00497533 0.00497961 0.00498793 0.00496556
0.0049665 0.00498809 0.00498689 0.00497886 0.00498933 0.00498056
Questions:
Prediction array should be 1, 0? Why do I get output like output activate as sigmoid, and loss is binary_crossentropy. What is wrong? I want to emphasize again, the model is not really trained well with data. It's almost just initialized with random weights.
If I don't train the network well (not converge yet), such as just initializing weights with random number, should the prediction still be 1, 0?
If I want to get the probability of prediction, and then, I decide how to interpret it, how to get the probability prediction output after the CNN is trained?
Your number of output is 201 that is why your output comes as (1,201) and not as (1,0). You can easily get which class has the highest value just by using np.argmax and that class is the output for your given input by your model.
And for the fact even when you have trained for 1 epoch only, your model has learned something that may be very lame, but still, it learns something and based on that, it has predicted the output.
You have used softmax as your activation in the last layer. It normalizes your output in a non-linear fashion so that the sum of output for all classes is equals to 1. So the value you get for each class can be interpreted as the probability of that class as output for the given input by the model. (For more clarity, you can look into how softmax function works)
And lastly, each class has values like 0.0049 or similar because the model is not sure which class your input belongs to. So it calculates values for each class and then softmax normalizes it. That is why your output values are in the range 0 to 1.
For example, say I have four class so one of the probable output can be like [0.223 0.344 0.122 0.311] which in the end we look as a confidence score for each class. And by looking at confidence score for each class we can say the predicted class is 2 as it has the highest confidence score of 0.344.
The output of a softmax layer is not 0 or 1. It is actually a normalized layer adding up to 1. If you do the sum of all your coefficient, they will add up. To get the prediction, you should get the one with the highest value. You can interpret them as probability even if there are not technically. https://en.wikipedia.org/wiki/Softmax_function for the definition.
This layer is used in the training process in order to be able to compare the prediction of a categorical classification and the true label.
It is required for the optimization because the optimization is done on derivable functions (having a gradient) and a 0,1 output would not be derivable (not even continuous). The optimization is done afterwards on all these values.
An interesting example is the following one: if your true target is [0 0 1 0] and your prediction output [0.1 0.1 0.6 0.2], even if the prediction is correct, it will still be able to learn, because it still give a non zero probabilty to the other classes, on which you can compute a gradient.
In order to get the prediction output in form of class in stead of probability, use:
model.predict_classes(x_train,batch_size)
My understanding is, Softmax says the likelihood of the value landing in that bucket out of the 201 buckets. With certainty of the first bucket you would get [1,0,0,0,0........]. Since very little training/learning/weight adjustment has occurred, the 201 values are all about 0.00497 which together sum to 1.
A decent description on developers.Google of SoftMax here
The output was specified as 'number_outputs' so you get 201 outputs, each of which tell you the likelihood (as a value between 0 and 1) of your prediction being THAT output.

Resources