I am currently trying to learn Deep Learning by focussing on Keras and the book "Deep Learning with Python-Keras"
I do have an example - I do understand the code but not the result - where I need your help. The example is about analyzing movie review from the imdB dataset which is included in Keras. The code goes as follows
def vectorize_sequences(sequences,dimension=10000):
results=np.zeros((len(sequences),dimension))
for i, sequence in enumerate(sequences):
results[i,sequence]=1.
return results
X_train=vectorize_sequences(train_data)
X_test=vectorize_sequences(test_data)
y_train=np.asarray(train_labels)
y_test=np.asarray(test_labels)
model=models.Sequential()
model.add(layers.Dense(16,activation="relu",input_shape=(10000,)))
model.add(layers.Dense(16,activation="relu"))
model.add(layers.Dense(1,activation="sigmoid"))
model.compile(optimizer="rmsprop",loss="binary_crossentropy",metrics=["accuracy"])
history=model.fit(X_train,y_train,epochs=4,batch_size=512)
In the explanation it is written, that "the final layer will use a sigmoid activation so as to output a probability indicating how likely the sample is to have the target “1”"
I know that the sigmoid function ranges between [0,1]. Suppose the output of my network is 0.6
Why am I allowed to say that this value gives the probability to have the target "1" and not the target "0"?
I am kind of stucked and need some help :)
The interpretation of your output depends on the labels you used during your training. So train_labels and test_labels are concluded of 0s and 1s.
During training, the network is optimized to yield the correct label corresponding to an input sequence. So if your output is 0 or 1, the network is giving a confident classification. But, if your output is e.g. 0.5, the network is totally unsure to which class your input belongs.
Now we make the assumption that your input corresponds to class 1. In case of an output like 0.6, the class might be 1, but only with a confidence of 60 percent. It describes the probability to be class 1, since an output of 1 is a correct interpretation of the input to its label. If the output would be a 0, it would be the worst classification of the input since the label is 1. So this in the end corresponds to values ranging from 0 to 1, while the closer to 1 you are the better the classification - so it is a probability in the end.
But keep in mind that this definition only holds if you know that your input belongs to class 1. If it instead is part of class 0, the previous definition has to be turned around.
So in the end, you got two options. First, you can take these values as they are and treat them as a probability an input corresponds to one of the classes. Second, you can introduce a threshold - in this case it makes sense to set it to 0.5 - and say that if you are larger than the threshold, categorize your input to class 1, else to class 0. The closer your output is to 0.5 the more the network is just guessing the class in the end.
The choice of the threshold has a direct influence on the performance of your network in the end. This can be evaluated for example with a ROC curve (https://en.wikipedia.org/wiki/Receiver_operating_characteristic).
Related
Computer vision and deep learning literature usually say one should use binary_crossentropy for a binary (two-class) problem and categorical_crossentropy for more than two classes. Now I am wondering: is there any reason to not use the latter for a two-class problem as well?
categorical_crossentropy:
accepts only one correct class per sample
will take "only" the true neuron and make the crossentropy calculation with that neuron
binary_crossentropy:
accepts many correct classes per sample
will do the crossentropy calculation for "all neurons", considering that each neuron can be two classes, 0 and 1.
A 2-class problem can be modeled as:
2-neuron output with only one correct class: softmax + categorical_crossentropy
1-neuron output, one class is 0, the other is 1: sigmoid + binary_crossentropy
Explanation
Notice how in categorical crossentropy (the first equation), the term y_true is only 1 for the true neuron, making all other neurons equal to zero.
The equation can be reduced to simply: ln(y_pred[correct_label]).
Now notice how binary crossentropy (the second equation in the picture) has two terms, one for considering 1 as the correct class, another for considering 0 as the correct class.
I have a sequence of multi-band images, say each sample is a tensor of size (50, 6, 30, 30) where 50 is the number of image frames in sequence, 6 is number of bands per pixel, and 30x30 is the spatial dimension of the image. The ground truth map is of size 30x30, but it is one-hot encoded (to use crossentropy loss) o 7 classes, so it is a tensor of size (1, 7, 30, 30).I want to use a combination of convolutional and LSTM (or use an integrated ConvLSTM2D layer) for my classification task, but there are below problems:
1- Not every point has a valid label at the output map (i.e. some one-hot vectors are all-zero),
2- Not every pixel has a valid value in every time stamp. So, at every given time stamp, some of the pixels may have zero value (means invalid) for all of their band values.
I read many Q&As on how to handle this issue and I think I should use sample_weights option to mask the invalid points and classes but I am really uncertain how to do it. Sample_weights should be applied to every pixel and each timestamp independently. I think I can manage it if I didn't have the convolution part (a 2D approach). But don't understand how it works when convolution is in place, because some pixel values in convolution window are valid and some are invalid.If I mask those invalid pixels at a specific time (that still I don't know how to do it), what will happen to the chain of forward and backward propagation and loss calculation? I think it will be ruined!
Looking for comments and help.
Possible solution:
Problem 1- For pixels where do not have class at all you can introduce a new class with a label for example noise,
it means not in your one hot encode you have value for that as well and weights will be generated accordingly for those pixels for noise class
this is an indirect way to achieve the same thing you do with sample weight
cause in the sample_weight technique you tell keras or sklearn that what is the weightage of the parameter or sample ratio of the weights.
Problem 2- To answer part 2 consider the possible use cases for example for these invalid values class value can be there in hot encode vector or it will be all zeros?
or you can preprocess and add these to the noise class as well then point 2 will be handled by point 1 automatically.
I did this tutorial: https://www.tensorflow.org/get_started/eager
It was very helpful, but I didn't understand why the outputs always sum to 1. It is stated "For this example, the sum of the output predictions are 1.0", but not explained. I thought it might be a characteristic of the activation function, but I read ReLu can take any value >0 (https://medium.com/the-theory-of-everything/understanding-activation-functions-in-neural-networks-9491262884e0).
I'd like to understand because I want to learn in which cases one should normalize the output variables and in which cases this is not necessary (I assume if they always sum up to 1, it's not necessary).
In the given example, the sentence outputs always sum to 1 refers to the used softmax function and has nothing to do with normalization or your used activation function. In the tutorial's Iris example we want to distingusih between three classes and of course the sum of the class probabilities cannot exceed 100% (1.0).
For example the softmax function, which is located at the end of your network, could return [0.8, 0.1, 0.1]. That means the first class has the highest probability. Notice: The sum of all single probas result to 1.0.
I am implementing neural network to train hand written digits in python. Following is the cost function,
In log(1-(h(x)), if h(x) is 1, then it would result in log(1-1), i.e. log(0). So I'm getting math error.
Im initializing the weights randomly between 10-60. I'm not sure where I have to change and what I have to change!
In this formula, h(x) is usually a sigmoid: h(x)=sigmoid(x), so it's never exactly 1.0, unless the activations in the network are too large (which is bad and will cause problems anyway). The same problem is possible with log(h(x)) when h(x)=0, i.e., when x is a large negative number.
If you don't want to worry about numerical issues, simply add a small number before computing the log: log(h(x) + 1e-10).
Other issues:
Weight initialization in a range [10, 60] doesn't look right, they should better be small random numbers, e.g., from [-0.01, 0.01].
The formula above is computing binary cross-entropy loss. If you're working with MNIST, it has 10 classes, so the loss must be multi-class cross-entropy. See this question for details.
I trained CNN model for just one epoch with very little data. I use Keras 2.05.
Here is the CNN model's (partial) last 2 layers, number_outputs = 201. Training data output is one hot encoded 201 output.
model.add(Dense(200, activation='relu', name='full_2'))
model.add(Dense(40, activation='relu', name='full_3'))
model.add(Dense(number_outputs, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer=adam, metrics=['accuracy'])
The model is saved to a h5 file. Then, saved mode is loaded with same model as above. batch_image is an image file.
prediction = loaded_model.predict(batch_image, batch_size=1)
I get prediction like this:
ndarray: [[ 0.00498065 0.00497852 0.00498095 0.00496987 0.00497506 0.00496112
0.00497585 0.00496474 0.00496769 0.0049708 0.00497027 0.00496049
0.00496767 0.00498348 0.00497927 0.00497842 0.00497095 0.00496493
0.00498282 0.00497441 0.00497477 0.00498019 0.00497417 0.00497654
0.00498381 0.00497481 0.00497533 0.00497961 0.00498793 0.00496556
0.0049665 0.00498809 0.00498689 0.00497886 0.00498933 0.00498056
Questions:
Prediction array should be 1, 0? Why do I get output like output activate as sigmoid, and loss is binary_crossentropy. What is wrong? I want to emphasize again, the model is not really trained well with data. It's almost just initialized with random weights.
If I don't train the network well (not converge yet), such as just initializing weights with random number, should the prediction still be 1, 0?
If I want to get the probability of prediction, and then, I decide how to interpret it, how to get the probability prediction output after the CNN is trained?
Your number of output is 201 that is why your output comes as (1,201) and not as (1,0). You can easily get which class has the highest value just by using np.argmax and that class is the output for your given input by your model.
And for the fact even when you have trained for 1 epoch only, your model has learned something that may be very lame, but still, it learns something and based on that, it has predicted the output.
You have used softmax as your activation in the last layer. It normalizes your output in a non-linear fashion so that the sum of output for all classes is equals to 1. So the value you get for each class can be interpreted as the probability of that class as output for the given input by the model. (For more clarity, you can look into how softmax function works)
And lastly, each class has values like 0.0049 or similar because the model is not sure which class your input belongs to. So it calculates values for each class and then softmax normalizes it. That is why your output values are in the range 0 to 1.
For example, say I have four class so one of the probable output can be like [0.223 0.344 0.122 0.311] which in the end we look as a confidence score for each class. And by looking at confidence score for each class we can say the predicted class is 2 as it has the highest confidence score of 0.344.
The output of a softmax layer is not 0 or 1. It is actually a normalized layer adding up to 1. If you do the sum of all your coefficient, they will add up. To get the prediction, you should get the one with the highest value. You can interpret them as probability even if there are not technically. https://en.wikipedia.org/wiki/Softmax_function for the definition.
This layer is used in the training process in order to be able to compare the prediction of a categorical classification and the true label.
It is required for the optimization because the optimization is done on derivable functions (having a gradient) and a 0,1 output would not be derivable (not even continuous). The optimization is done afterwards on all these values.
An interesting example is the following one: if your true target is [0 0 1 0] and your prediction output [0.1 0.1 0.6 0.2], even if the prediction is correct, it will still be able to learn, because it still give a non zero probabilty to the other classes, on which you can compute a gradient.
In order to get the prediction output in form of class in stead of probability, use:
model.predict_classes(x_train,batch_size)
My understanding is, Softmax says the likelihood of the value landing in that bucket out of the 201 buckets. With certainty of the first bucket you would get [1,0,0,0,0........]. Since very little training/learning/weight adjustment has occurred, the 201 values are all about 0.00497 which together sum to 1.
A decent description on developers.Google of SoftMax here
The output was specified as 'number_outputs' so you get 201 outputs, each of which tell you the likelihood (as a value between 0 and 1) of your prediction being THAT output.