CNN for 2d image rotation estimation (angle regression) - keras

I am trying to build a CNN (in Keras) that can estimate the rotation of an image (or a 2d object). So basically, the input is an image and the output should be its rotation.
My first experiment is to estimate the rotation of MŃIST digits (starting with only one digit "class", let's say the "3"). So what I did was extracting all 3s from the MNIST set, and then building a "rotated 3s" dataset, by randomly rotating these images multiple times, and storing the rotated images together with their rotation angles as ground truth labels.
So my first problem was that a 2d rotation is cyclic and I didn't know how to model this behavior. Therefore, I encoded the angle as y=sin(ang), x = cos(ang). This gives me my dataset (the rotated 3s images) and the corresponding labels (x and y values).
For the CNN, as a start, i just took the keras MNIST CNN example (https://keras.io/examples/mnist_cnn/) and replaced the last dense layer (that had 10 outputs and a softmax activation) with a dense layer that has 2 outputs (x and y) and a tanh activation (since y=sin(ang), x = cos(ang) are within [-1,1]).
The last thing i had to decide was the loss function, where i basically want to have a distance measurement for angles. Therefore i thought "cosine_proximity" is the way to go.
When training the network I can see that the loss is decreasing and converging to a certain point. However when I then check the predictions vs the ground truth I observe a (for me) fairly surprising behavior. Almost all x and y predictions tend towards 0 or +/-1. And since the "decoding" of my rotation is ang=atan2(y,x) the predictions are usually either +/- 0°, 45°, 90, 135° or 180°.
However, my training and test data has only angles of 0°, 20°, 40°, ... 360°.
This doesn't really change if I change the complexity of the network. I also played around with the optimizer parameters without any success.
Is there anything wrong with the assumptions:
- x,y encoding for angle
- tanh activation to have values in [-1,1]
- cosine_proximity as loss function
Thanks in advance for any advice, tips or pointing me towards a possible mistake i made!

It's hard to give you an exact answer so let's try with some ideas:
Change from Cosine Proximity to MSE or other losses and check if something changes.
Change the way you encode the target. You could just represent the angle as a number between 0 and 1. It doesn't seem a problem even if the angles are ciclic.
Ensure you preprocessing/augmentation steps make sense for this particular task.

Related

Loss from linear transformed output and ground truth for training

I have a prediction model in pytorch that takes inputs and generates outputs in a specific coordinate system. In my process I transform the output and ground truth into a different coordinate system (2-dimensional translation and rotation). I can now calculate the loss in both coordinate systems, which have the same values (RMSE and NLL loss).
Does it matter which loss I use for the training to run loss.backward() on?
TLDR:
Does it matter which loss I use for the training to run loss.backward() on?
No for MSE, Yes for NLL.
Assuming that ground truth vector is x and the output vector is y,
Old MSE = (x-y).T.dot(x-y)
After the transformation, ground truth vector becomes A.dot(x) and output becomes A.dot(y).
New MSE = (x-y).T.dot(M).dot(x-y) where M=A.T.dot(A) where A is the transformation matrix.
Due to properties of linear transformation, we also have A.T.dot(A)=I
So, we can see that M will always turn out to be identity matrix and hence the MSE remains unchanged.
Now, NLL loss which is typically applied after nn.LogSoftmax just does
Y[x].mean() where Y is the output after nn.LogSoftmax and x is the target.
(I am referring to this).
This is not the same as what you'd get after you linearly transform output and target.

What might be the best loss function when target is a gaussian label?

I have a simple CNN with the inputs as
Cropped grayscale patches of size MxN centered on the object of interest. The intensity of each patch is rescaled to [0, 1].
Target Gaussian label of the same size MXN with values ranging
in [5.0155e-173, 1]. This label is kept fixed throughout the training.
The goal is to learn the target label and use the learned model to detect the object in a test image. I am using Adam optimizer with various loss functions such as categorical_crossentropy, mean_squared_error, and mean_absolute_error but training halts soon probably due to the low values returned by all these loss functions (vanishing gradients?). Increasing the batch size from 1 to 16~32 sometimes helps in completing the iteration but gives undesired outcomes at test time.
Is it because the loss function is too sensitive to the lower values in the target and even treats them as outliers hence steering the whole learning process in the wrong direction?
I'll be grateful for your help in fixing the loss function in such a scenario.
I think that the best choice here is to use some probability ditribution pseudo-distance, the first choice that came to my mind is to use Kullback-Leiber Divergence, it is already implemented in pytorch and keras( see [kldivloss](https://pytorch.org/docs/stable/nn.html#kldivloss and keras) Other famous ditances may include Jesnsen-Shanon divergence and Earth-Mover distance (This the same distance thatwas used in WGAN

Masking pixels or doing convolutional LSTM classification with Keras

I have a sequence of multi-band images, say each sample is a tensor of size (50, 6, 30, 30) where 50 is the number of image frames in sequence, 6 is number of bands per pixel, and 30x30 is the spatial dimension of the image. The ground truth map is of size 30x30, but it is one-hot encoded (to use crossentropy loss) o 7 classes, so it is a tensor of size (1, 7, 30, 30).I want to use a combination of convolutional and LSTM (or use an integrated ConvLSTM2D layer) for my classification task, but there are below problems:
1- Not every point has a valid label at the output map (i.e. some one-hot vectors are all-zero),
2- Not every pixel has a valid value in every time stamp. So, at every given time stamp, some of the pixels may have zero value (means invalid) for all of their band values.
I read many Q&As on how to handle this issue and I think I should use sample_weights option to mask the invalid points and classes but I am really uncertain how to do it. Sample_weights should be applied to every pixel and each timestamp independently. I think I can manage it if I didn't have the convolution part (a 2D approach). But don't understand how it works when convolution is in place, because some pixel values in convolution window are valid and some are invalid.If I mask those invalid pixels at a specific time (that still I don't know how to do it), what will happen to the chain of forward and backward propagation and loss calculation? I think it will be ruined!
Looking for comments and help.
Possible solution:
Problem 1- For pixels where do not have class at all you can introduce a new class with a label for example noise,
it means not in your one hot encode you have value for that as well and weights will be generated accordingly for those pixels for noise class
this is an indirect way to achieve the same thing you do with sample weight
cause in the sample_weight technique you tell keras or sklearn that what is the weightage of the parameter or sample ratio of the weights.
Problem 2- To answer part 2 consider the possible use cases for example for these invalid values class value can be there in hot encode vector or it will be all zeros?
or you can preprocess and add these to the noise class as well then point 2 will be handled by point 1 automatically.

Binary classifier too confident to plot ROC curve with sklearn?

I have a created a binary classifier in Tensorflow that will output a generator object containing predictions. I extract the predictions (e.g [0.98, 0.02]) from the object into a list, later converting this into a numpy array. I have the corresponding array of labels for these predictions. Using these two arrays I believe I should be able to plot a roc curve via:
import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve
fpr, tpr, thr = roc_curve(labels, predictions[:,1])
plt.plot(fpr, tpr)
plt.show()
print(fpr)
print(tpr)
print(thr)
Where predictions[:,1] gives the positive prediction score. However, running this code leads to only a flat line and only three values for each fpr, tpr and thr:
Flat line roc plot and limited function outputs.
The only theory I have as to why this is happening is because my classifier is too sure of it's predictions. Many, if not all, of the positive prediction scores are 1.0, or incredibly close to zero:
[[9.9999976e-01 2.8635742e-07]
[3.3693312e-11 1.0000000e+00]
[1.0000000e+00 9.8642090e-09]
...
[1.0106111e-15 1.0000000e+00]
[1.0000000e+00 1.0030269e-09]
[8.6156778e-15 1.0000000e+00]]
According to a few sources including this stackoverflow thread and this stackoverflow thread, the very polar values of my predictions could be creating an issue for roc_curve().
Is my intuition correct? If so is there anything I can do about it to plot my roc_curve?
I've tried to include all the information I think would be relevant to this issue but if you would like any more information about my program please ask.
ROC is generated by changing the threshold on your predictions and finding the sensitivity and specificity for each threshold. This generally means that as you increase the threshold, your sensitivity decreases but your specificity increases and it draws a picture of the overall quality of your predicted probabilities. In your case, since everything is either 0 or 1 (or very close to it) there are no meaningful thresholds to use. That's why the thr value is basically [ 1, 1, 1 ].
You can try to arbitrarily pull the values closer to 0.5 or alternatively implement your own ROC curve calculation with more tolerance for small differences.
On the other hand you might want to review your network because such result values often mean there is a problem there, maybe the labels leaked into the network somehow and therefore it produces perfect results.

Meaning of Weight Gradient in CNN

I developed a CNN using MatConvNet and am able to visualize the weights of the 1st layer. It looked very similar to what is shown here (also attached below incase I am not specific enough) http://cs.stanford.edu/people/karpathy/convnetjs/demo/cifar10.html
My question is, what are the weight gradients ? I'm not sure what those are and am unable to generate those...
Weights in a NN
In a neural network, a series of linear functions represented as matrices are applied to features (usually with a nonlinear joint between them). These functions are determined by the values in the marices, referred to as weights.
You can visualize the weights of a normal neural network, but it usually means something slightly different to visualize the convolutional layers of a cnn. These layers are designed to learn a feature computation over the space.
When you visualize the weights, you're looking for patterns. A nice smooth filter may mean that the weights are well learned and "looking for something in particular". A noisy weight visualization may mean that you've undertrained your network, overfit it, need more regularization, or something else nefarious (a decent source for these claims).
From this decent review of weight visualizations, we can see patterns start to emerge from treating the weights as images:
Weight Gradients
"Visualizing the gradient" means taking the gradient matrix and treating like an image [1], just like you took the weight matrix and treated it like an image before.
A gradient is just a derivative; for images, it's usually computed as a finite difference - grossly simplified, the X gradient subtracts pixels next to each other in a row, and the Y gradient subtracts pixels next to each other in a column.
For the common example of a filter that extracts edges, we may see a strong gradient in a particular direction. By visualizing the gradients (taking the matrix of finite differences and treating it like an image), you can get a more immediate idea of how your filter is operating on the input. There are a lot of cutting edge techniques (eg, eg) for interpreting these results, but making the image pop up is the easy part!
A similar technique involves visualizing the activations after a forward pass over the input. In this case, you're looking at how the input was changed by the weights; by visualizing the weights, you're looking at how you expect them to change the input.
Don't over-think it - the weights are interesting because they let us see how the function behaves, and the gradients of the weights are just another feature to help explain what's going on. There's nothing sacred about that feature: here are some cool clustering features (t-SNE) from the google paper that look at space separability.
[1] It can be more complicated if you introduce weight sharing, but not that much
My answer here covers this question https://stackoverflow.com/a/68988426/10661506
Long story short, weight gradient of layer l is the gradient of the loss with respect to the weights of layer l.
If you have a correct implementation of backpropagation, you should have access to these gradients as they are needed to compute the weights update at every layer.

Resources