Torch squeeze and the batch dimension - pytorch

does anyone here know if the torch.squeeze function respects the batch (e.g. first) dimension? From some inline code it seems it does not.. but maybe someone else knows the inner workings better than I do.
Btw, the underlying problem is that I have tensor of shape (n_batch, channel, x, y, 1). I want to remove the last dimension with a simple function, so that I end up with a shape of (n_batch, channel, x, y).
A reshape is of course possible, or even selecting the last axis. But I want to embed this functionality in a layer so that I can easily add it to a ModuleList or Sequence object.
EDIT: just found out that for Tensorflow (2.5.0) the function tf.linalg.diag DOES respect batch dimension. Just a FYI that it might differ per function you are using

No! squeeze doesn't respect the batch dimension. It's a potential source of error if you use squeeze when the batch dimension may be 1. Rule of thumb is that only classes and functions in torch.nn respect batch dimensions by default.
This has caused me headaches in the past. I recommend using reshape or only using squeeze with the optional input dimension argument. In your case you could use .squeeze(4) to only remove the last dimension. That way nothing unexpected happens. Squeeze without the input dimension has led me to unexpected results, specifically when
the input shape to the model may vary
batch size may vary
nn.DataParallel is being used (in which case batch size for a particular instance may be reduced to 1)

Accepted answer is sufficient for the problem - to squeeze last dimension. However, I had tensor of dimension (batch, 1280, 1, 1) and wanted (batch, 1280). Squeeze function didn't allow for that - squeeze(tensor, 1).shape -> (batch, 1280, 1, 1) and squeeze(tensor, 2).shape -> (batch, 1280, 1). I could have used squeeze two times, but you know, aesthetics :).
What helped me was torch.flatten(tensor, start_dim = 1) -> (batch, 1280). Trivial, but I forgot about it. Warning though, this function my create a copy instead view, so be careful.
https://pytorch.org/docs/stable/generated/torch.flatten.html

Related

Is it possible to add tensors of different sizes together in pytorch?

I have an image gradient of size (3, 224, 224) and a patch of (1, 768). is it possible to add this gradient to the patch to get a size of the patch (1, 768)?
Forgive my inquisitiveness. I know pytorch too utilizes broadcasting and I am not sure if I will able to do so with two different tensors in way similar to the line below:
torch.add(a, b)
For example:
The end product would be the same patch on the left with the gradient of an entire image on the right added to it. My understanding is that it’s not possible, but knowledge isn’t bounded.
No. Whether two tensors are broadcastable is defined by the following rules:
Each tensor has at least one dimension.
When iterating over the dimension sizes, starting at the trailing dimension, the dimension sizes must either be equal, one of them is 1, or one of them does not exist.
Because the second bullet doesn't hold in your example (i.e., 768 != 224, 1 not in {224, 768}), you can't broadcast the add. If you have some meaningful way to reshape your gradients, you might be able to.
I figured out to do it myself. I divided the image gradient (right) into 16 x 16 patches, created a loop that adds each patch to the original image patch (left). This way, I was able to add a 224 x 224 image gradient into a 16 x 16 patch. I just wanted to see what would happen if I do such

Conv3D size doesn’t make sense with NIFTI data?

So I am writing custom dataset for medical images, with .nii (NIFTI1 format), but there is a confusion.
My dataloader returns the shape torch.Size (1,1,256,256,51) . But NIFTI volumes use anatomical axes, different coordinate system, so it doesn’t make any sense to permute the axes, which I normally would with volume made of 2D images each stored separately in local drive with 51 slice images (or depth), as Conv3D follows the convention (N,C,D,H,W).
so torch.Size (1,1,256,256,51) (ordinarily 51 would be the depth) doesn’t follow the convention (N,C,D,H,W) , but I should not permute the axes as the data uses entirely different coordinate system ?
In pytorch 3d convolution layer naming of the 3 dimensions you do convolution on is not really important (e.g. this layer doesn't really have a special treatment for depth compared to height). All difference is coming from kernel_size argument (and also padding if you use that). If you permute the dimensions and correspondingly permute the kernel_size parameters nothing will really change. So you can either permute your input's dimensions using e.g. x.permute(0, 1, 4, 2, 3) or continue using your initial tensor with depth as the last dimension.
Just to clarify - if you wanted to use kernel_size=(2, 10, 10) on your DxHxW image, now you can instead to use kernel_size=(10, 10, 2) on your HxWxD image. If you want all your code explicitly assume that dimension order is always D, H, W then you can create tensor with permuted dimensions using x.permute(0, 1, 4, 2, 3).
Let me know if I somehow misunderstand the problem you have.

Using CNN with Dataset that has different depths between volumes

I am working with Medical Images, where I have 130 Patient Volumes, each volume consists of N number of DICOM Images/slices.
The problem is that between the volumes the the number of slices N, varies.
Majority, 50% of volumes have 20 Slices, rest varies by 3 or 4 slices, some even more than 10 slices (so much so that interpolation to make number of slices equal between volumes is not possible)
I am able to use Conv3d for volumes where the depth N (number of slices) is same between volumes, but I have to make use of entire data set for the classification task. So how do I incorporate entire dataset and feed it to my network model ?
If I understand your question, you have 130 3-dimensional images, which you need to feed into a 3D ConvNet. I'll assume your batches, if N was the same for all of your data, would be tensors of shape (batch_size, channels, N, H, W), and your problem is that your N varies between different data samples.
So there's two problems. First, there's the problem of your model needing to handle data with different values of N. Second, there's the more implementation-related problem of batching data of different lengths.
Both problems come up in video classification models. For the first, I don't think there's a way of getting around having to interpolate SOMEWHERE in your model (unless you're willing to pad/cut/sample) -- if you're doing any kind of classification task, you pretty much need a constant-sized layer at your classification head. However, the interpolation doesn't have happen right at the beginning. For example, if for an input tensor of size (batch, 3, 20, 256, 256), your network conv-pools down to (batch, 1024, 4, 1, 1), then you can perform an adaptive pool (e.g. https://pytorch.org/docs/stable/nn.html#torch.nn.AdaptiveAvgPool3d) right before the output to downsample everything larger to that size before prediction.
The other option is padding and/or truncating and/or resampling the images so that all of your data is the same length. For videos, sometimes people pad by looping the frames, or you could pad with zeros. What's valid depends on whether your length axis represents time, or something else.
For the second problem, batching: If you're familiar with pytorch's dataloader/dataset pipeline, you'll need to write a custom collate_fn which takes a list of outputs of your dataset object and stacks them together into a batch tensor. In this function, you can decide whether to pad or truncate or whatever, so that you end up with a tensor of the correct shape. Different batches can then have different values of N. A simple example of implementing this pipeline is here: https://github.com/yunjey/pytorch-tutorial/blob/master/tutorials/03-advanced/image_captioning/data_loader.py
Something else that might help with batching is putting your data into buckets depending on their N dimension. That way, you might be able to avoid lots of unnecessary padding.
You'll need to flatten the dataset. You can treat every individual slice as an input in the CNN. You can set each variable as a boolean flag Yes / No if categorical or if it is numerical you can set the input as the equivalent of none (Usually 0).

Padding in Conv2D gives wrong result?

I'm using the Conv2D method of Keras. In the documentation it is written that
padding: one of "valid" or "same" (case-insensitive). Note that "same"
is slightly inconsistent across backends with strides != 1, as
described here
As input I have images of size (64,80,1) and I'm using kernel of size 3x3. Does that mean that the padding is wrong when using Conv2D(32, 3, strides=2, padding='same')(input)?
How can I fix it using ZeroPadding2D?
Based on your comment and seeing that you defined a stride of 2, I believe what you want to achieve is an output size that's exactly half of the input size, i.e. output_shape == (32, 40, 32) (the second 32 is the features).
In that case, just call model.summary() on the final model and you will see if that is the case or not.
If it is, there's nothing else to do.
If it's bigger than you want, you can add a Cropping2D layer to cut off pixels from the borders of the image.
If it's smaller than you want, you can add a ZeroPadding2D layer to add zero-pixels to the borders of the image.
The syntax to create these layers is
Cropping2D(cropping=((a, b), (c, d)))
ZeroPadding2D(padding=((a, b), (c, d)))
a: number of rows you want to add/cut off to/from the top
b: number of rows you want to add/cut off to/from the bottom
c: number of columns you want to add/cut off to/from the left
d: number of columns you want to add/cut off to/from the right
Note however, that there is no strict technical need to always perfectly half the size with each convolution layer. Your model might work well without any padding or cropping. You will have to experiment with it in order to find out.

What should be the batch for Keras LSTM CNN to process image sequence

I want to use image sequence to predict 1 output.
training data:
[(x_img1, y1), (x_img2, y2), ..., (x_img10, y10)]
Color image dimension:
(100, 120, 3)
Output dimention: (1)
Model implemented in Keras:
img_sequence_length = 3
model = Sequential()
model.add(TimeDistributed(Convolution2D(24, 5, 5, subsample=(2, 2), border_mode="same", activation=‘rely’, name='conv1'),
input_shape=(img_sequence_length,
100,
120,
3)))
….
model.add(LSTM(64, return_sequences=True, name='lstm_1'))
model.add(LSTM(10, return_sequences=False, name='lstm_2'))
model.add(Dense(256))
model.add(Dense(1, name='output'))
The batch should be:
A)
[ [(x_img1, y1), (x_img2, y2), (x_img3, y3)],
[(x_img2, y2), (x_img3, y3), (x_img4, y4)],
…
]
Or
B)
[ [(x_img1, y1), (x_img2, y2), (x_img3, y3)],
[(x_img4, y4), (x_img5, y5), (x_img6, y6)],
…
]
Why?
This choice really depends on what you want to achieve. Understanding what your data is totally influences the decision. (Not only the shape and type of the data, but what it means and what you want from it. Is it a video? Many videos? Do I want the name of the character in a little segment of a video? Or to know the state of the plot continously along the video?)
In option A:
This option is used when all your images form a single long sequence and you want to predict the next element in the sequence by knowing a specific number of previous images.
Each group of 3 images in that batch is completely independent.
The layer doesn't keep a memory between them, and the actual memory is length 3.
The simulation of a long sequence happens because you are repeating images in each batch, like a sliding window. But there is no connection or memory transfer from one group to another.
You use this if the sequence has any logical possibility of being predicted from 3 images.
Imagine you have one long video, but you watch only three 3 seconds of it and try to deduce something from those 3 seconds. But, your memory is completely washed away before you watch another 3 seconds. When you watch this 3 new seconds, you will not be able to remember what you watched before and you will not be able to say you watched 4 seconds. All you learn will be confined in 3 second segments.
In option B:
In this option, each group of 3 images has absolutely no connection at all to the others. You can use this as if every group of 3 images were a different sequence (not belonging to a long sequence).
Image you have lots of videos, and they talk about different things. One is Titanic, the other is the Avengers, and so on.
This batch may be used for a case similar to the one proposed in A, but your sliding window would have a step 3. This would make learning faster, but also make it less learning.
Other options:
You can take a look at this question, its answer and the comments to have more ideas.
Some hints on splitting the data:
First, input and output data must be separate:
X = [item[0] for item in training_data]
Y = [item[1] for item in training_data]
Then you must separate the sequences properly.
As you defined in the input_shape, X must follow the same shape.
X.shape must be (numberOfSequences, img_sequence_length, 100, 120, 3)
So if it's a list of images, you must make sure that every image is a numpy array (transform them if necessary), and that you will later convert X to numpy:
X = np.asarray(X_with_numpy_images)
And if you have only one Y for each sequence, you may have it shaped as:
Y.shape must be (numberOfSequences,1)
You probably mount it taking values in steps of 3:
Y = [Y[(i+1)*3 - 1] for i in range(numberOfSequences)]
Now it's important to understand if each sequence of 3 images is independent of the other sequences, or if you have just one huge sequence divided in small parts.
In case one, use LSTM(..., stateful=False), in case two, use LSTM(...,stateful=True)
And you will also probably need to reshape the tensors properly in the transition from the convolutional layers to the LSTM layers, because LSTM will require inputs shaped as (NumberOfSequences, SequenceLength, Features)
A suggestion is to use reshape layers:
model.add(Reshape((img_sequence_length,100*120*3)))
#of course the last dimension may be different if you don't use `padding='same'` in the convolutions of if you use pooling.

Resources