How can I avoid getting overlapping keypoints during inference? - pytorch

I have been using Detectron2 for recognizing 4 keypoints on each image,
My dummy dataset consists of 1000 images, and I applied augmentations.
def build_train_loader(cls, cfg):
augs = [
T.RandomFlip(prob=0.5,horizontal=True),
T.RandomFlip(prob=0.5,horizontal=False,vertical=True),
T.RandomRotation(angle=[0, 180]),
T.RandomSaturation(0.9, 1.9)
]
return build_detection_train_loader(cfg,
mapper=DatasetMapper(cfg,
is_train=True,
augmentations=augs)
)
I have checked the images after those transforms which I have applied (each type of transform was tested separately), and it seems it has done well, the keypoints are positioned correctly.
Now after the training phase (keypoint_rcnn_R_50_FPN_3x.yaml),
I get some identical keypoints, which means in many images the keypoints overlap,
Here are few samples from my results:
[[[180.4211, 332.8872, 0.7105],
[276.3517, 369.3892, 0.7390],
[276.3517, 366.9956, 0.4788],
[220.5920, 296.9836, 0.9515]]]
And from another image:
[[[611.8049, 268.8926, 0.7576],
[611.8049, 268.8926, 1.2022],
[699.7122, 261.2566, 1.7348],
[724.5556, 198.2591, 1.4403]]]
I have compared the inference's results with augmentations and without,
and it seems with augmentation the keypoints are barely getting recognized . gosh, How can it be?
Can someone please suggest any idea how to overcome those kind of mistakes?
what am I doing wrong?
Thank you!
I have added a link to my google colab notebook:
https://colab.research.google.com/drive/1uIzvB8vCWdGrT7qnz2d2npEYCqOxET5S?usp=sharing

The problem is that there's nothing unique about the different corners of the rectangle. However, in your annotation and in your loss function there is an implicit assumption that the order of the corners is significant:
The corners are labeled in a specific order and the network is trained to output the corners in that specific order.
However, when you augment the dataset, by flipping and rotating the images, you change the implicit order of the corners and now the net does not know which of the four corners to predict at each time.
As far as I can see you have two ways of addressing this issue:
Explicitly force order on the corners:
Make sure that no matter what augmentation the image underwent, for each rectangle the ground truth points are ordered "top left", "top right", "bottom left", "bottom right". This means you'll have to transform the coordinates of the corners (as you are doing now), but also reorder them.
Adding this consistency should help your model overcome the ambiguity in identifying the different corners.
Make the loss invariant to the order of the predicted corners:
Suppose your ground truth rectangle span the domain [0, 1]x[0, 1]: the four corners you should predict are [[0, 0], [1, 1], [1, 0], [0, 1]]. Note that if you predict [[1, 1], [0, 0], [0, 1], [1, 0]] your loss is very high, although you predicted the right corners just in a different order than annotated ones.
Therefore, you should make youy loss invariant to the order of the predicted points:
where pi(i) is a permutation of the corners.

Related

Is it possible to add tensors of different sizes together in pytorch?

I have an image gradient of size (3, 224, 224) and a patch of (1, 768). is it possible to add this gradient to the patch to get a size of the patch (1, 768)?
Forgive my inquisitiveness. I know pytorch too utilizes broadcasting and I am not sure if I will able to do so with two different tensors in way similar to the line below:
torch.add(a, b)
For example:
The end product would be the same patch on the left with the gradient of an entire image on the right added to it. My understanding is that it’s not possible, but knowledge isn’t bounded.
No. Whether two tensors are broadcastable is defined by the following rules:
Each tensor has at least one dimension.
When iterating over the dimension sizes, starting at the trailing dimension, the dimension sizes must either be equal, one of them is 1, or one of them does not exist.
Because the second bullet doesn't hold in your example (i.e., 768 != 224, 1 not in {224, 768}), you can't broadcast the add. If you have some meaningful way to reshape your gradients, you might be able to.
I figured out to do it myself. I divided the image gradient (right) into 16 x 16 patches, created a loop that adds each patch to the original image patch (left). This way, I was able to add a 224 x 224 image gradient into a 16 x 16 patch. I just wanted to see what would happen if I do such

How to calculate errors of best value of parameters that obtained from MCMC method and observational data

I had a model and some observational data. I used the MCMC method to obtain the best free parameters and used some coding to plot contours of 1 to 3 sigma confidence levels (as you see in the plot). I want the +/- value of sigma for each best value in any confidence level but I know that it is not symmetric. So, I didn't find the common equation of square variance useful. Is there any other way to calculate +/- errors?
this is my contours
this is what I want to get
I used np. percentile(w, [15, 85]), np. percentile(w, [5, 95]) and np. percentile(w, [0.5, 99.5]) in the final Gaussian probability and found errors for each confidence level correctly

Conv3D size doesn’t make sense with NIFTI data?

So I am writing custom dataset for medical images, with .nii (NIFTI1 format), but there is a confusion.
My dataloader returns the shape torch.Size (1,1,256,256,51) . But NIFTI volumes use anatomical axes, different coordinate system, so it doesn’t make any sense to permute the axes, which I normally would with volume made of 2D images each stored separately in local drive with 51 slice images (or depth), as Conv3D follows the convention (N,C,D,H,W).
so torch.Size (1,1,256,256,51) (ordinarily 51 would be the depth) doesn’t follow the convention (N,C,D,H,W) , but I should not permute the axes as the data uses entirely different coordinate system ?
In pytorch 3d convolution layer naming of the 3 dimensions you do convolution on is not really important (e.g. this layer doesn't really have a special treatment for depth compared to height). All difference is coming from kernel_size argument (and also padding if you use that). If you permute the dimensions and correspondingly permute the kernel_size parameters nothing will really change. So you can either permute your input's dimensions using e.g. x.permute(0, 1, 4, 2, 3) or continue using your initial tensor with depth as the last dimension.
Just to clarify - if you wanted to use kernel_size=(2, 10, 10) on your DxHxW image, now you can instead to use kernel_size=(10, 10, 2) on your HxWxD image. If you want all your code explicitly assume that dimension order is always D, H, W then you can create tensor with permuted dimensions using x.permute(0, 1, 4, 2, 3).
Let me know if I somehow misunderstand the problem you have.

Direct Heatmap Regression with Fully Convolutional Nets

I'm trying to develop a fully-convolutional neural net to estimate the 2D locations of keypoints in images that contain renders of known 3D models. I've read plenty of literature on this subject (human pose estimation, model based estimation, graph networks for occluded objects with known structure) but no method I've seen thus far allows for estimating an arbitrary number of keypoints of different classes in an image. Every method I've seen is trained to output k heatmaps for k keypoint classes, with one keypoint per heatmap. In my case, I'd like to regress k heatmaps for k keypoint classes, with an arbitrary number of (non-overlapping) points per heatmap.
In this toy example, the network would output heatmaps around each visible location of an upper vertex for each shape. The cubes have 4 vertices on top, the extruded pentagons have 2, and the pyramids just have 1. Sometimes points are offscreen or occluded, and I don't wish to output heatmaps for occluded points.
The architecture is a 6-6 layer Unet (as in this paper https://arxiv.org/pdf/1804.09534.pdf). The ground truth heatmaps are normal distributions centered around each keypoint. When training the network with a batch size of 5 and l2 loss, the network learns to never make an estimate whatsoever, just outputting blank images. Datatypes are converted properly and normalized from 0 to 1 for input and 0 to 255 for output. I'm not sure how to solve this, are there any red flags with my general approach? I'll post code if there's no clear problem in general...

Is there camera RELATIVE rotation matrix?

Suppose that I have two cameras. I don't know exactly the poses of these two cameras. So their rotation matrice denoted as R1 and R2, respectively, are unknown. But I know the relative angles of these cameras along three axes. I mean if the angles along three axes of the two camera are (alpha1, betta1, gamma1) and (alpha2, betta2, gamma2), then the relative angles of these camera (deltaX, deltaY, deltaZ)=(alpha2-alpha1, betta2-betta1, gamma2-gamma1) are known.
My question is that can we form a "relative" rotation matrix R12 so that R2=R12*R1?
Because there're many methods to construct a rotation matrix. And the results of these methods are different (I'm still don't understand why can a camera have different rotation matrices). In this case, I construct the rotation matrix by multiply three rotation matrix along three axes. More specifically, R=RzRyRx.
As I test with the code in Matlab,
R(alpha2, 0, 0)*R(alpha1, 0, 0)=R(alpha1+alpha2, 0, 0).
But
R(alpha2, betta2, gamma2)*R(alpha1, betta1, gamma1) != R(alpha1+alpha2, betta1+betta2, gamma1+gamma2).
Thanks!

Resources