I didn't understand the patch size and the input size of the fully connected layers. Why the first connected layer has input with 3 dimensions?
Thanks
Patch size for fully connected layers
It's simply the size of the weight matrix of each fully connected layer. For example the first fully connected layer takes a 5x5x2048=51200 sized input and produces 1024 sized output. Therefore it has a 51200 x 1024 weight matrix
Input size
It's an image of size 224x224 with 3 channels. Three channels are simply the RGB channels.
Related
I understand that in order to create a color image, three channel information of input data must be maintained inside the network. However, data must be flattened to pass through the linear layer. If so, can GAN consisting of only FC layer generate only black and white images?
Your fully connected network can generate whatever you want. Even three channel outputs. However, the question is: does it make sense to do so? Flattened your input will inherently lose all kinds of spatial and feature consistency that is naturally available when represented as an RGB map.
Remember that an RGB image can be thought of as 3-element features describing each spatial location of a 2D image. In other words, each of the three channels gives additional information about a given pixel, considering these channels as separate entities is a loss of information.
The 3d CNN works with the videos, MRI, and scan datasets. Can you tell me If I have to feed the input (video) to the proposed 3d CNN network, and train it's weights, how can I able to do that? as 3d CNN expect 5 dimensional inputs;
[batch size, channels, depth, height, weight]
how can I extract depth from the videos?
If I have 10 video of 10 different classes. The duration of each video is 6 seconds. I extract 2 frames for each second and it goes around 12 frames for each video.
Size of RGB videos is 112x112 --> Height = 112, Width=112, and Channels=3
If I keep the batch size equals 2
1 video --> 6 seconds --> 12 frames (1sec == 2frames) [each frame (3,112,112)]
10 videos (10 classes) --> 60 seconds --> 120 frames
So the 5 dimensions will be something like this; [2, 3, 12, 112, 112]
2 --> Two videos will be processed for each batch size.
3 --> RGB channels
12 --> each video contains 12 frames
112 --> Height of each video
112 --> Width of each video
Am I right?
Yes, that seems to make sense if you're looking to use a 3D CNN. You're essentially adding a dimension to your input which is the temporal one, it is logical to use the depth dimension for it. This way you keep the channel axis as the feature channel (i.e. not a spatial-temporal dimension).
Keep in mind 3D CNNs are really memory intensive. There exist other methods to work with temporal dependent input. Here you are not really dealing with a third dimension (a 'spatial' dimension that is), so you're not required to use a 3D CNN.
Edit:
If I give the input of the above dimension to the 3d CNN, will it learn both features (spatial and temporal)? [...] Can you make me understand, spatial and temporal features?
If you use a 3D CNN then your filters will have a 3D kernel, and the convolution will be three dimensional: along the two spatial dimensions (width and height) as well as the depth dimensions (here corresponding to the temporal dimensions, since you're using depth dimension for the sequence of videos frames. A 3D CNN will allow you to capture local ('local' because the perception field is limited by the sizes of the kernels and the overall number of layers in the CNN) spatial and temporal information.
I used python 3 and when i insert transform random crop size 224 it gives miss match error.
here my code
what did i wrong ?
Your code makes variations on resnet: you changed the number of channels, the number of bottlenecks at each "level", and you removed a "level" entirely. As a result, the dimension of the feature map you have at the end of layer3 is not 64: you have a larger spatial dimension than you anticipated by the nn.AvgPool2d(8). The error message you got actually tells you that the output of level3 is of shape 64x56x56 and after avg pooling with kernel and stride 8 you have 64x7x7=3136 dimensional feature vector, instead of only 64 you are expecting.
What can you do?
As opposed to "standard" resnet, you removed stride from conv1 and you do not have max pool after conv1. Moreover, you removed layer4 which also have a stride. Therefore, You can add pooling to your net to reduce the spatial dimensions of layer3.
Alternatively, you can replace nn.AvgPool(8) with nn.AdaptiveAvgPool2d([1, 1]) an avg pool that outputs only one feature regardless of the spatial dimensions of the input feature map.
I am working with Medical Images, where I have 130 Patient Volumes, each volume consists of N number of DICOM Images/slices.
The problem is that between the volumes the the number of slices N, varies.
Majority, 50% of volumes have 20 Slices, rest varies by 3 or 4 slices, some even more than 10 slices (so much so that interpolation to make number of slices equal between volumes is not possible)
I am able to use Conv3d for volumes where the depth N (number of slices) is same between volumes, but I have to make use of entire data set for the classification task. So how do I incorporate entire dataset and feed it to my network model ?
If I understand your question, you have 130 3-dimensional images, which you need to feed into a 3D ConvNet. I'll assume your batches, if N was the same for all of your data, would be tensors of shape (batch_size, channels, N, H, W), and your problem is that your N varies between different data samples.
So there's two problems. First, there's the problem of your model needing to handle data with different values of N. Second, there's the more implementation-related problem of batching data of different lengths.
Both problems come up in video classification models. For the first, I don't think there's a way of getting around having to interpolate SOMEWHERE in your model (unless you're willing to pad/cut/sample) -- if you're doing any kind of classification task, you pretty much need a constant-sized layer at your classification head. However, the interpolation doesn't have happen right at the beginning. For example, if for an input tensor of size (batch, 3, 20, 256, 256), your network conv-pools down to (batch, 1024, 4, 1, 1), then you can perform an adaptive pool (e.g. https://pytorch.org/docs/stable/nn.html#torch.nn.AdaptiveAvgPool3d) right before the output to downsample everything larger to that size before prediction.
The other option is padding and/or truncating and/or resampling the images so that all of your data is the same length. For videos, sometimes people pad by looping the frames, or you could pad with zeros. What's valid depends on whether your length axis represents time, or something else.
For the second problem, batching: If you're familiar with pytorch's dataloader/dataset pipeline, you'll need to write a custom collate_fn which takes a list of outputs of your dataset object and stacks them together into a batch tensor. In this function, you can decide whether to pad or truncate or whatever, so that you end up with a tensor of the correct shape. Different batches can then have different values of N. A simple example of implementing this pipeline is here: https://github.com/yunjey/pytorch-tutorial/blob/master/tutorials/03-advanced/image_captioning/data_loader.py
Something else that might help with batching is putting your data into buckets depending on their N dimension. That way, you might be able to avoid lots of unnecessary padding.
You'll need to flatten the dataset. You can treat every individual slice as an input in the CNN. You can set each variable as a boolean flag Yes / No if categorical or if it is numerical you can set the input as the equivalent of none (Usually 0).
I performed data augmentation for images with two channels. My data set is formatted in the shape of (image_Numbers, image_height, image_weights, image_channels), where image_channels = 2.
In performing data augmentation using datagen (created by ImageDataGenerator), a userwarining message is generated:
UserWarning: NumpyArrayIterator is set to use the data format convention
"channels_last" (channels on axis 3),
i.e. expected either 1, 3 or 4 channels on axis 3.
However, it was passed an array with shape (1, 150, 150, 2) (2 channels).
Does the warning imply the data augmentation was unsuccessful? Was it only performed for one channel images? If so, how to perform data augmentation for two-channels of images (not one channel this time and then concatenation)?
It means they don't expect two channel images. It's non standard.
The standard images are:
1 channel: grayscale
3 channels: RGB
4 channels: RGBA
Since it's a warning, we don't really know what's going on.
Check the outputs of this generator yourself.
x, y = theGenerator[someIndex]
Plot x[0] and others.
In case the generated images aren't good, you can do the augmentations yourself using a python generator or a custom keras.utils.Sequence.