Calculating the kernel, stride and padding size in a Conv3D - pytorch

I have a 5D input tensor and I also know what 5D tensor I need as the output. How can I find out if I would need padding or not? Also, how can I calculate the kernel size and stride size? There are some formulations but stride size and kernel sizes would contain 3 elements like (a, b, c). When it comes to calculate these multiple elements, solving those equations would be more complicated. How can I compute thos elements?
For example, if I have a tensor of size [16,1024,16,14,14], how can I calculate the stride and kernel sizes in a conv3d if I want an output with the size of [16,32,16,3,5]? How can I know that if padding is needed?
Is there any website that automatically can calculate these parameters?

Related

Is it possible to add tensors of different sizes together in pytorch?

I have an image gradient of size (3, 224, 224) and a patch of (1, 768). is it possible to add this gradient to the patch to get a size of the patch (1, 768)?
Forgive my inquisitiveness. I know pytorch too utilizes broadcasting and I am not sure if I will able to do so with two different tensors in way similar to the line below:
torch.add(a, b)
For example:
The end product would be the same patch on the left with the gradient of an entire image on the right added to it. My understanding is that it’s not possible, but knowledge isn’t bounded.
No. Whether two tensors are broadcastable is defined by the following rules:
Each tensor has at least one dimension.
When iterating over the dimension sizes, starting at the trailing dimension, the dimension sizes must either be equal, one of them is 1, or one of them does not exist.
Because the second bullet doesn't hold in your example (i.e., 768 != 224, 1 not in {224, 768}), you can't broadcast the add. If you have some meaningful way to reshape your gradients, you might be able to.
I figured out to do it myself. I divided the image gradient (right) into 16 x 16 patches, created a loop that adds each patch to the original image patch (left). This way, I was able to add a 224 x 224 image gradient into a 16 x 16 patch. I just wanted to see what would happen if I do such

Conv3D size doesn’t make sense with NIFTI data?

So I am writing custom dataset for medical images, with .nii (NIFTI1 format), but there is a confusion.
My dataloader returns the shape torch.Size (1,1,256,256,51) . But NIFTI volumes use anatomical axes, different coordinate system, so it doesn’t make any sense to permute the axes, which I normally would with volume made of 2D images each stored separately in local drive with 51 slice images (or depth), as Conv3D follows the convention (N,C,D,H,W).
so torch.Size (1,1,256,256,51) (ordinarily 51 would be the depth) doesn’t follow the convention (N,C,D,H,W) , but I should not permute the axes as the data uses entirely different coordinate system ?
In pytorch 3d convolution layer naming of the 3 dimensions you do convolution on is not really important (e.g. this layer doesn't really have a special treatment for depth compared to height). All difference is coming from kernel_size argument (and also padding if you use that). If you permute the dimensions and correspondingly permute the kernel_size parameters nothing will really change. So you can either permute your input's dimensions using e.g. x.permute(0, 1, 4, 2, 3) or continue using your initial tensor with depth as the last dimension.
Just to clarify - if you wanted to use kernel_size=(2, 10, 10) on your DxHxW image, now you can instead to use kernel_size=(10, 10, 2) on your HxWxD image. If you want all your code explicitly assume that dimension order is always D, H, W then you can create tensor with permuted dimensions using x.permute(0, 1, 4, 2, 3).
Let me know if I somehow misunderstand the problem you have.

Using CNN with Dataset that has different depths between volumes

I am working with Medical Images, where I have 130 Patient Volumes, each volume consists of N number of DICOM Images/slices.
The problem is that between the volumes the the number of slices N, varies.
Majority, 50% of volumes have 20 Slices, rest varies by 3 or 4 slices, some even more than 10 slices (so much so that interpolation to make number of slices equal between volumes is not possible)
I am able to use Conv3d for volumes where the depth N (number of slices) is same between volumes, but I have to make use of entire data set for the classification task. So how do I incorporate entire dataset and feed it to my network model ?
If I understand your question, you have 130 3-dimensional images, which you need to feed into a 3D ConvNet. I'll assume your batches, if N was the same for all of your data, would be tensors of shape (batch_size, channels, N, H, W), and your problem is that your N varies between different data samples.
So there's two problems. First, there's the problem of your model needing to handle data with different values of N. Second, there's the more implementation-related problem of batching data of different lengths.
Both problems come up in video classification models. For the first, I don't think there's a way of getting around having to interpolate SOMEWHERE in your model (unless you're willing to pad/cut/sample) -- if you're doing any kind of classification task, you pretty much need a constant-sized layer at your classification head. However, the interpolation doesn't have happen right at the beginning. For example, if for an input tensor of size (batch, 3, 20, 256, 256), your network conv-pools down to (batch, 1024, 4, 1, 1), then you can perform an adaptive pool (e.g. https://pytorch.org/docs/stable/nn.html#torch.nn.AdaptiveAvgPool3d) right before the output to downsample everything larger to that size before prediction.
The other option is padding and/or truncating and/or resampling the images so that all of your data is the same length. For videos, sometimes people pad by looping the frames, or you could pad with zeros. What's valid depends on whether your length axis represents time, or something else.
For the second problem, batching: If you're familiar with pytorch's dataloader/dataset pipeline, you'll need to write a custom collate_fn which takes a list of outputs of your dataset object and stacks them together into a batch tensor. In this function, you can decide whether to pad or truncate or whatever, so that you end up with a tensor of the correct shape. Different batches can then have different values of N. A simple example of implementing this pipeline is here: https://github.com/yunjey/pytorch-tutorial/blob/master/tutorials/03-advanced/image_captioning/data_loader.py
Something else that might help with batching is putting your data into buckets depending on their N dimension. That way, you might be able to avoid lots of unnecessary padding.
You'll need to flatten the dataset. You can treat every individual slice as an input in the CNN. You can set each variable as a boolean flag Yes / No if categorical or if it is numerical you can set the input as the equivalent of none (Usually 0).

Padding in Conv2D gives wrong result?

I'm using the Conv2D method of Keras. In the documentation it is written that
padding: one of "valid" or "same" (case-insensitive). Note that "same"
is slightly inconsistent across backends with strides != 1, as
described here
As input I have images of size (64,80,1) and I'm using kernel of size 3x3. Does that mean that the padding is wrong when using Conv2D(32, 3, strides=2, padding='same')(input)?
How can I fix it using ZeroPadding2D?
Based on your comment and seeing that you defined a stride of 2, I believe what you want to achieve is an output size that's exactly half of the input size, i.e. output_shape == (32, 40, 32) (the second 32 is the features).
In that case, just call model.summary() on the final model and you will see if that is the case or not.
If it is, there's nothing else to do.
If it's bigger than you want, you can add a Cropping2D layer to cut off pixels from the borders of the image.
If it's smaller than you want, you can add a ZeroPadding2D layer to add zero-pixels to the borders of the image.
The syntax to create these layers is
Cropping2D(cropping=((a, b), (c, d)))
ZeroPadding2D(padding=((a, b), (c, d)))
a: number of rows you want to add/cut off to/from the top
b: number of rows you want to add/cut off to/from the bottom
c: number of columns you want to add/cut off to/from the left
d: number of columns you want to add/cut off to/from the right
Note however, that there is no strict technical need to always perfectly half the size with each convolution layer. Your model might work well without any padding or cropping. You will have to experiment with it in order to find out.

process 35 x 35 kernel using convolution method

Dear all, I would like to do a convolution using a 35 x 35 kernel. Any suggestion? or any method already in opencv i can use? Because now the cvfilter2d can only support until 10 x 10 kernel.
If you just need quick-and-dirty solution due to OpenCV's size limitation, then you can divide the 35x35 kernel into a 5x5 set of 7x7 "kernel tiles", apply each "kernel tile" to the image to get an output, then shift the result and combine them to get the final sum.
General suggestions for convolution with large 2D kernels:
Try to use kernels that are separable, i.e. a kernel that is the outer product of a column vector and a row vector. In other words, the matrix that represents the kernel is rank-1.
Try use the FFT method. Convolution in the spatial domain is the same as elementwise conjugate multiplication in the frequency domain.
If the kernel is full-rank and for the application's purpose it cannot be modified, then consider using SVD to decompose the kernel into a set of 35 rank-1 matrices (each of which can be expressed as the outer product of a column vector and a row vector), and perform convolution only with the matrices associated with the largest singular values. This introduces errors into the results, but the error can be estimated based on the singular values. (a.k.a. the MATLAB method)
Other special cases:
Kernels that can be expressed as sum of overlapping rectangular blocks can be computed using the integral image (the method used in Viola-Jones face detection).
Kernels that are smooth and modal (with a small number of peaks) can be approximated by sum of 2D Gaussians.

Resources