The 3d CNN works with the videos, MRI, and scan datasets. Can you tell me If I have to feed the input (video) to the proposed 3d CNN network, and train it's weights, how can I able to do that? as 3d CNN expect 5 dimensional inputs;
[batch size, channels, depth, height, weight]
how can I extract depth from the videos?
If I have 10 video of 10 different classes. The duration of each video is 6 seconds. I extract 2 frames for each second and it goes around 12 frames for each video.
Size of RGB videos is 112x112 --> Height = 112, Width=112, and Channels=3
If I keep the batch size equals 2
1 video --> 6 seconds --> 12 frames (1sec == 2frames) [each frame (3,112,112)]
10 videos (10 classes) --> 60 seconds --> 120 frames
So the 5 dimensions will be something like this; [2, 3, 12, 112, 112]
2 --> Two videos will be processed for each batch size.
3 --> RGB channels
12 --> each video contains 12 frames
112 --> Height of each video
112 --> Width of each video
Am I right?
Yes, that seems to make sense if you're looking to use a 3D CNN. You're essentially adding a dimension to your input which is the temporal one, it is logical to use the depth dimension for it. This way you keep the channel axis as the feature channel (i.e. not a spatial-temporal dimension).
Keep in mind 3D CNNs are really memory intensive. There exist other methods to work with temporal dependent input. Here you are not really dealing with a third dimension (a 'spatial' dimension that is), so you're not required to use a 3D CNN.
Edit:
If I give the input of the above dimension to the 3d CNN, will it learn both features (spatial and temporal)? [...] Can you make me understand, spatial and temporal features?
If you use a 3D CNN then your filters will have a 3D kernel, and the convolution will be three dimensional: along the two spatial dimensions (width and height) as well as the depth dimensions (here corresponding to the temporal dimensions, since you're using depth dimension for the sequence of videos frames. A 3D CNN will allow you to capture local ('local' because the perception field is limited by the sizes of the kernels and the overall number of layers in the CNN) spatial and temporal information.
Related
I didn't understand the patch size and the input size of the fully connected layers. Why the first connected layer has input with 3 dimensions?
Thanks
Patch size for fully connected layers
It's simply the size of the weight matrix of each fully connected layer. For example the first fully connected layer takes a 5x5x2048=51200 sized input and produces 1024 sized output. Therefore it has a 51200 x 1024 weight matrix
Input size
It's an image of size 224x224 with 3 channels. Three channels are simply the RGB channels.
I used python 3 and when i insert transform random crop size 224 it gives miss match error.
here my code
what did i wrong ?
Your code makes variations on resnet: you changed the number of channels, the number of bottlenecks at each "level", and you removed a "level" entirely. As a result, the dimension of the feature map you have at the end of layer3 is not 64: you have a larger spatial dimension than you anticipated by the nn.AvgPool2d(8). The error message you got actually tells you that the output of level3 is of shape 64x56x56 and after avg pooling with kernel and stride 8 you have 64x7x7=3136 dimensional feature vector, instead of only 64 you are expecting.
What can you do?
As opposed to "standard" resnet, you removed stride from conv1 and you do not have max pool after conv1. Moreover, you removed layer4 which also have a stride. Therefore, You can add pooling to your net to reduce the spatial dimensions of layer3.
Alternatively, you can replace nn.AvgPool(8) with nn.AdaptiveAvgPool2d([1, 1]) an avg pool that outputs only one feature regardless of the spatial dimensions of the input feature map.
I'm trying to develop a fully-convolutional neural net to estimate the 2D locations of keypoints in images that contain renders of known 3D models. I've read plenty of literature on this subject (human pose estimation, model based estimation, graph networks for occluded objects with known structure) but no method I've seen thus far allows for estimating an arbitrary number of keypoints of different classes in an image. Every method I've seen is trained to output k heatmaps for k keypoint classes, with one keypoint per heatmap. In my case, I'd like to regress k heatmaps for k keypoint classes, with an arbitrary number of (non-overlapping) points per heatmap.
In this toy example, the network would output heatmaps around each visible location of an upper vertex for each shape. The cubes have 4 vertices on top, the extruded pentagons have 2, and the pyramids just have 1. Sometimes points are offscreen or occluded, and I don't wish to output heatmaps for occluded points.
The architecture is a 6-6 layer Unet (as in this paper https://arxiv.org/pdf/1804.09534.pdf). The ground truth heatmaps are normal distributions centered around each keypoint. When training the network with a batch size of 5 and l2 loss, the network learns to never make an estimate whatsoever, just outputting blank images. Datatypes are converted properly and normalized from 0 to 1 for input and 0 to 255 for output. I'm not sure how to solve this, are there any red flags with my general approach? I'll post code if there's no clear problem in general...
I performed data augmentation for images with two channels. My data set is formatted in the shape of (image_Numbers, image_height, image_weights, image_channels), where image_channels = 2.
In performing data augmentation using datagen (created by ImageDataGenerator), a userwarining message is generated:
UserWarning: NumpyArrayIterator is set to use the data format convention
"channels_last" (channels on axis 3),
i.e. expected either 1, 3 or 4 channels on axis 3.
However, it was passed an array with shape (1, 150, 150, 2) (2 channels).
Does the warning imply the data augmentation was unsuccessful? Was it only performed for one channel images? If so, how to perform data augmentation for two-channels of images (not one channel this time and then concatenation)?
It means they don't expect two channel images. It's non standard.
The standard images are:
1 channel: grayscale
3 channels: RGB
4 channels: RGBA
Since it's a warning, we don't really know what's going on.
Check the outputs of this generator yourself.
x, y = theGenerator[someIndex]
Plot x[0] and others.
In case the generated images aren't good, you can do the augmentations yourself using a python generator or a custom keras.utils.Sequence.
This is probably very silly question, but I couldn't find details anywhere.
So I have an audio recording (wav file) that is 3 seconds long. That is my sample and it needs to be classified as [class_A] or [class_B].
By following some tutroial on MFCC, I divided the sample into frames (291 frames to be exact) and I've gotten MFCCs from each frame.
Now I have 291 feature vectors, the length of each vector is 13.
My question is; how exactly do you use those vectors with classifier (k-NN for example)? I have 291 vectors that represent 1 sample. I know how to work with 1 vector for 1 sample, but I don't know what to do if I have 291 of them. I couldn't really find explanation anywhere.
Each of your vectors will represent the spectral characteristics of your audio file, as it varies in time. Depending on the length of your frames, you might want to group some of them (for example by averaging by dimension) to match the resolution with which you want the classifier to work. As an example, think of a particular sound that might have an envelope with an Attack time of 2ms: that may be as fine-grained as you want to get with your time quantization so you could a) group and average the number of MFCC vectors that represent 2ms; or b) recompute the MFCCs with the desired time resolution.
If you really want to keep the resolution that fine, you can concatenate the 291 vectors and treat it like a single vector (of 291 x 13 dimensions), which will probably need a huge dataset to train on.