I want to create a custom loss function that takes into account some output features (not all). I am training sequence regression LSTM neural network on data that looks like this: my input shape is (number_of_samples, 200 timesteps, 4 features) and my output shape is (number_of_samples, 200 timesteps, 6 features). My first, basic model looks like this:
inputs1=Input(shape=(None,num_input_features))
lstm1=LSTM(10,return_sequences=True)(inputs1)
lstm2=LSTM(10,return_sequences=True)(lstm1)
outputs1=TimeDistributed(Dense(num_output_features))(lstm2)
model_proba=Model(inputs=inputs1,outputs=outputs1)
I want to train a model only on the first 4 features of my output (the last 2 features are not relevant for training, I just want to predict them without training on that data). I have tried creating a custom loss function that looks like this:
def custom_loss(y_true, y_pred): # Loss that doesn't take into account last 2 output features
y_true_r=y_true[:,:,:4]
y_pred_r=y_pred[:,:,:4]
mse = MeanSquaredError()
return mse(y_true_r, y_pred_r)
But the problem is that during the training my weights and biases connected to 2 output features (that are not in the loss function) are not trained, they have same initial values after the training.
As far as I understand, the loss doesn't depend on these weights and biases so the gradient is 0 and there is no weights and bias updates. So i want to know is it possible to train these weights and biases in relation to the global loss value?
p.s. I am using Adam optimizer.
Related
I am trying to train a classifier based on the InceptionV3 architecture in Keras.
For this I loaded the pre-trained InceptionV3 model, without top, and added a final fully connected layer for the classes of my classification problem. In the first training I froze the InceptionV3 base model and only trained the final fully connected layer.
In the second step I want to "fine tune" the network by unfreezing a part of the InceptionV3 model.
Now I know that the InceptionV3 model makes extensive use of BatchNorm layers. It is recommended (link to documentation), when BatchNorm layers are "unfrozen" for fine tuning when transfer learning, to keep the mean and variances as computed by the BatchNorm layers fixed. This should be done by setting the BatchNorm layers to inference mode instead of training mode.
Please also see: What's the difference between the training argument in call() and the trainable attribute?
Now my main question is: how to set ONLY the BatchNorm layers of the InceptionV3 model to inference mode?
Currently I set the whole InceptionV3 base model to inference mode by setting the "training" argument when assembling the network:
inputs = keras.Input(shape=input_shape)
# Scale the 0-255 RGB values to 0.0-1.0 RGB values
x = layers.experimental.preprocessing.Rescaling(1./255)(inputs)
# Set include_top to False so that the final fully connected (with pre-loaded weights) layer is not included.
# We will add our own fully connected layer for our own set of classes to the network.
base_model = keras.applications.InceptionV3(input_shape=input_shape, weights='imagenet', include_top=False)
x = base_model(x, training=False)
# Classification block
x = layers.GlobalAveragePooling2D(name='avg_pool')(x)
x = layers.Dense(num_classes, activation='softmax', name='predictions')(x)
model = keras.Model(inputs=inputs, outputs=x)
What I don't like about this, is that in this way I set the whole model to inference mode which may set some layers to inference mode which should not be.
Here is the part of the code that loads the weights from the initial training that I did and the code that freezes the first 150 layers and unfreezes the remaining layers of the InceptionV3 part:
model.load_weights(load_model_weight_file_name)
for layer in base_model.layers[: 150]:
layer.trainable = False
for layer in base_model.layers[ 150:]:
layer.trainable = True
The rest of my code (not shown here) are the usual compile and fit calls.
Running this code seems to result a network that doesn't really learn (loss and accuracy remain approximately the same). I tried different orders of magnitude for the optimization step size, but that doesn't seem to help.
Another thing that I observed it that when I make the whole InceptionV3 part trainable
base_model.trainable = True
that the training starts with an accuracy server orders of magnitude smaller than were my first training round finished (and of course a much higher loss). Can someone explain this to me? I would at least expect the training to continue were it left off in terms of accuracy and loss.
You could do something like:
for layer in base_model.layers:
if isinstance(layer ,tf.keras.layers.BatchNormalization):
layer.trainable=False
This will iterate over each layer and check the type, setting to inference mode if the layer is BatchNorm.
As for the low starting accuracy during transfer learning, you're only loading the weights and not the optimiser state (as would occur with a full model.load() which loads architecture, weights, optimiser state etc).
This doesn't mean there's an error, but if you must load weights only just let it train, the optimiser will configure eventually and you should see progress. Also as you're potentially over-writing the pre-trained weights in your second run, make sure you use a lower learning rate so the updates are small in comparison i.e. fine-tune the weights rather than blast them to pieces.
I have a supervised learning task f(X)=y where X is a 2-dimentional np.array of np.int8 and y is a 1-dimentional array of np.float64 containing probabilities (so numbers between 0 and 1). I want to build a Neural Network model that performs regression in order to predict said probabilities y given X.
As the output of my Network is one real value (i.e. the output layer has one neuron) and is a probability (so in the range [0, 1]), I believe I should use softmax as the activation function of the output layer (i.e. output neuron) in order to squash the network's output to [0, 1].
As it is a regression task, I opted for using the mean_squared_error loss (instead of cross_entropy_loss that is typically used in classification tasks and often paired with softmax).
However, as I am trying to fit(X, y) the loss does not change at all between epochs and remains constant. Any ideas why? Is the combination of softmax and mean_squared_error loss wrong for some reason and why?
If I remove the softmax it does work, but then my model would also predict non probabilities which I do not want. Yes, I could squash it myself later but it doesn't seem right.
My code basically is (after removing some irrelevant additional callbacks for EarlyStopping and learning rate scheaduling):
model = Sequential()
model.add(Dense(W1_size, input_shape=(input_dims,), activation='relu'))
model.add(Dense(1, activation='softmax'))
# compile model
model.compile(optimizer=Adam(), loss='mse') # mse is the standard loss for regression
# fit
model.fit(X, y, batch_size=batch_size, epochs=MAX_EPOCHS)
Edit: Turns out I needed the sigmoid function to squash one real value to [0, 1] as the accepted answer suggests. The softmax function for a vector of size 1 is always 1.
As you stated you want to perform a regression task. (Which means, finding a continuous mapping between your input and desired output).
The softmax function creates a pseudo-probability distribution for multi-dimensional outputs (all values sum up to 1). This is the reason why the softmax function perfectly fits for classification tasks (predicting probabilities for different classes).
As you want to perform a regression task and your output is one-dimensional, softmax would not work properly because it is always 1 for a one-dimensional input.
A function which maps a one-dimensional input continuously to [0,1] works fine here (e.g Sigmoid).
Note that you can also interpret both the output of the sigmoid and the softmax function as probabilities. But be careful: these are only pseudo-probabilities and it is not representing the certainty or uncertainty of your model in making predictions.
I am a bit confused on how Keras fits the models. In general, Keras models are fitted by simply using model.fit(...) something like the following:
model.fit(X_train, y_train, epochs=300, batch_size=64, validation_data=(X_test, y_test))
My question is: Because I stated the testing data by the argument validation_data=(X_test, y_test), does it mean that each epoch is independent? In other words, I understand that at each epoch, Keras train the model using the training data (after getting shuffled) followed by testing the trained model using the provided validation_data. If that's the case, then no matter how many epochs I choose, I only take the results of the last epoch!!
If this scenario is correct, so we do we need multiple epoches? Unless these epoches are dependent somwhow where each epoch uses the same NN weights from the previous epoch, correct?
Thank you
When Keras fit your model it pass throught all the dataset at each epoch by a step corresponding to your batch_size.
For exemple if you have a dataset of 1000 items and a batch_size of 8, the weight of your model will be updated by using 8 items and this until it have seen all your data set.
At the end of that epoch, the model will try to do a prediction on your validation set.
If we have made only one epoch, it would mean that the weight of the model is updated only once per element (because it only "saw" one time the complete dataset).
But in order to minimize the loss function and by backpropagation, we need to update those weights multiple times in order to reach the optimum loss, so pass throught all the dataset multiple times, in other word, multiple epochs.
I hope i'm clear, ask if you need more informations.
I have implemented an autoencoder using Keras. I understand that I can add accuracy performance metric as follows:
autoencoder.compile(optimizer='adam',
loss='mean_squared_error',
metrics=['accuracy'])
My question is:
Is the accuracy metric applied on the last layer of the decoder by default? If so, how can I set it so that it would get the representations from middle (hidden) layer to compute accuracy performance? Do I need to define a custom metric? How would that work?
It seems that what you really want is a multiple output network.
So on top of your middle layer that defines your embedding, add a layer (or more) to do your classification.
Then have a look at Multiple outputs in Keras to create your global cost.
You may also want to start by training the autoendoder only, then the classifier additional layers only to see the performance, you can also balance the accuracy of the encoder vs the accuracy of the classifier as a loss, training "both" networks at the same time.
I trained CNN model for just one epoch with very little data. I use Keras 2.05.
Here is the CNN model's (partial) last 2 layers, number_outputs = 201. Training data output is one hot encoded 201 output.
model.add(Dense(200, activation='relu', name='full_2'))
model.add(Dense(40, activation='relu', name='full_3'))
model.add(Dense(number_outputs, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer=adam, metrics=['accuracy'])
The model is saved to a h5 file. Then, saved mode is loaded with same model as above. batch_image is an image file.
prediction = loaded_model.predict(batch_image, batch_size=1)
I get prediction like this:
ndarray: [[ 0.00498065 0.00497852 0.00498095 0.00496987 0.00497506 0.00496112
0.00497585 0.00496474 0.00496769 0.0049708 0.00497027 0.00496049
0.00496767 0.00498348 0.00497927 0.00497842 0.00497095 0.00496493
0.00498282 0.00497441 0.00497477 0.00498019 0.00497417 0.00497654
0.00498381 0.00497481 0.00497533 0.00497961 0.00498793 0.00496556
0.0049665 0.00498809 0.00498689 0.00497886 0.00498933 0.00498056
Questions:
Prediction array should be 1, 0? Why do I get output like output activate as sigmoid, and loss is binary_crossentropy. What is wrong? I want to emphasize again, the model is not really trained well with data. It's almost just initialized with random weights.
If I don't train the network well (not converge yet), such as just initializing weights with random number, should the prediction still be 1, 0?
If I want to get the probability of prediction, and then, I decide how to interpret it, how to get the probability prediction output after the CNN is trained?
Your number of output is 201 that is why your output comes as (1,201) and not as (1,0). You can easily get which class has the highest value just by using np.argmax and that class is the output for your given input by your model.
And for the fact even when you have trained for 1 epoch only, your model has learned something that may be very lame, but still, it learns something and based on that, it has predicted the output.
You have used softmax as your activation in the last layer. It normalizes your output in a non-linear fashion so that the sum of output for all classes is equals to 1. So the value you get for each class can be interpreted as the probability of that class as output for the given input by the model. (For more clarity, you can look into how softmax function works)
And lastly, each class has values like 0.0049 or similar because the model is not sure which class your input belongs to. So it calculates values for each class and then softmax normalizes it. That is why your output values are in the range 0 to 1.
For example, say I have four class so one of the probable output can be like [0.223 0.344 0.122 0.311] which in the end we look as a confidence score for each class. And by looking at confidence score for each class we can say the predicted class is 2 as it has the highest confidence score of 0.344.
The output of a softmax layer is not 0 or 1. It is actually a normalized layer adding up to 1. If you do the sum of all your coefficient, they will add up. To get the prediction, you should get the one with the highest value. You can interpret them as probability even if there are not technically. https://en.wikipedia.org/wiki/Softmax_function for the definition.
This layer is used in the training process in order to be able to compare the prediction of a categorical classification and the true label.
It is required for the optimization because the optimization is done on derivable functions (having a gradient) and a 0,1 output would not be derivable (not even continuous). The optimization is done afterwards on all these values.
An interesting example is the following one: if your true target is [0 0 1 0] and your prediction output [0.1 0.1 0.6 0.2], even if the prediction is correct, it will still be able to learn, because it still give a non zero probabilty to the other classes, on which you can compute a gradient.
In order to get the prediction output in form of class in stead of probability, use:
model.predict_classes(x_train,batch_size)
My understanding is, Softmax says the likelihood of the value landing in that bucket out of the 201 buckets. With certainty of the first bucket you would get [1,0,0,0,0........]. Since very little training/learning/weight adjustment has occurred, the 201 values are all about 0.00497 which together sum to 1.
A decent description on developers.Google of SoftMax here
The output was specified as 'number_outputs' so you get 201 outputs, each of which tell you the likelihood (as a value between 0 and 1) of your prediction being THAT output.