I am new to CNNs and am building a model using Keras to combine inputs from multiple sources. Two of my sources have different dimensions and cannot be scaled by an integer number (i.e., x2 or x3 smaller). Therefore, simply max-pooling will not work. I am having trouble figuring out how to downsample the larger image. Here are the exact dimensions:
Image1: 7000 x 4000
Image2: 2607 x 1370
Is there a best practice for dealing with non-conventional downsampling?
I am applying a Conv2D layer and am thinking that combing the appropriately sized filter (1787x1261 with stride=1) with a max pooling (2x2 and stride=2) would give me the correct dimensions. Any reason why that is a bad idea? This does seem like a large filter compared to the total size of the image.
Somewhat related, would it be better to run the model on smaller chunks of the full image? That way I could control the size of each chunk?
Related
I am trying to implement an image classification task for the grayscale images, which were converted from some sensor readings. It means that I had initially time series data e.g. acceleration or displacement, then I transformed them into images. Before I do the transformation, I did apply normalization across the data. I have a 1000x9 image dimension where 1000 represents the total time step and 9 is the number of data points. The split ratio is 70%, 15%, and 15% for training, validation, and test data sets. There are 10 different labels, each label has 100 images, it's a multi-class classification task.
An example of my array before image conversion is:
As you see above, the precisions are so sensitive. When I convert them into images, I am able to see the darkness and white part of the image;
Imagine that I have a directory from D1 to D9 (damaged cases) and UN (health case) and there are so many images like this.
Then, I have a CNN-network where my goal is to make a classification. But, there is a significant overfitting issue and whatever I do it's not working out. One of the architecture I've been working on;
Model summary;
I also augment the data. After 250 epochs, this is what I get;
So, what I wonder is that I tried to apply some regularization or augmentation but they do not give me kind of solid results. I experimented it by changing the number of hidden units, layers, etc. Would you think that I need to fully change my architecture? I basically consider two blocks of CNN and FC layers at the end. This is not the first time I've been working on images like this, but I cannot mitigate this overfitting issue. I appreciate it if any of you give me some solid suggestions so I can get smooth results. i was thinking to use some pre-trained models for transfer learning but the image dimension causes some problems, do you know if I can use any of those pre-trained models with 1000x9 image dimension? I know there are some overfiting topics in the forum, but since those images are coming from numerical arrays and I could not make it work, I wanted to create a new title. Thank you!
I am testing some well known models for computer vision: UNet, FC-DenseNet103, this implementation
I train them with 224x224 randomly cropped patches and do the same on the validation set.
Now when I run inference on some videos, I pass it the frames directly (1280x640) and it works. It runs the same operations on different image sizes and never gives an error. It actually gives a nice output, but the quality of the output depends on the image size...
Now it's been a long time since I've worked with neural nets but when I was using tensorflow I remember I had to crop the input images to the train crop size.
Why don't I need to do this anymore? What's happening under the hood?
It seems that the models that you are using have no linear layers. Because of this the output of the convolutional layers go straight into the softmax function. The softmax function doesn't take a specific shape for its input so it can take any shape as input. Because of this your model will work with any shape of image but the accuracy of your model will probably be far worse given different image shapes than the one you trained on.
There is always a specific input size in the documentation of the model. You should use this size. These are the current model limitations.
For UNets this may even be a ratio. I think it depends on implementation.
Just a note on resize:
transform.Resize((h,w))
transform.Resize(d)
In case of the (h, w), output size will be matched to this.
In the second case of d size, the smaller edge of the image will be matched to d.
For example, if height > width, then image will be re-scaled to (d * height / width, d)
The idea is to not ruin the aspect ratio of the image.
In normal ANN each training sample is represented by a row of the matrix and in that way batches of training data can be processed but in CNN how multiple images are processed.
The same with ANN, you can stack up the images to n-dimensions tensor to be processed.
For CNNs that are trained on images, for example, say your dataset is RGB (3-channel) images that are 256x256 pixels. A single image can be represented by a 3 x 256 x 256 matrix. If you set your batch size to be 10, that means you’re concatenating 10 images together into a 10 x 3 x 256 x 256 matrix.
Tuning the batch size is one of the aspects of getting training right - if your batch size is too small, then there will be a lot of variance within a batch, and your training loss curve will bounce around a lot. But if it’s too large, your GPU will run out of memory to hold it, or training will progress too slowly to see if it’s the optimization is diverging early on.
Using Keras for image segmentation on a highly imbalanced dataset, and I want to re-weight the classes proportional to pixels values in each class as described here. If a have binary classes with weights = [0.8, 0.2], how can I modify K.sparse_categorical_crossentropy(y_true, y_pred) to re-weight the loss according to the class which the pixel belongs to?
The input has shape (4, 256, 256, 1) (batch, height, width, channels) and the output is a vector of 0's and 1's (4, 65536, 1) (positive and negative class). The model and data is similar to the one here with the difference being the images are grayscale and the masks are binary (2 classes).
This is the custom loss function I used for my semantic segmentation project. It is modified from the categorical_crossentropy function found in keras/backend/tensorflow_backend.py.
def class_weighted_pixelwise_crossentropy(target, output):
output = tf.clip_by_value(output, 10e-8, 1.-10e-8)
weights = [0.8, 0.2]
return -tf.reduce_sum(target * weights * tf.log(output))
Note that my final version did not use class weighting - I found that it encouraged the model to use the underrepresented classes as filler for patches of the image that it was unsure about instead of making more realistic guesses, and thereby hurt performance.
Jessica's answer is clean and works well. I generally recommend it. But for the sake of variety:
I have found that sampling regions of interest that include a better ratio between the classes is an effective way to quickly learn skewed pixelwise classes.
In my case, I had two classes like you which makes things easier. I look for areas in the image that have appearances of the less represented class. I crop around it with some random offset a constantly sized bounding box ( i repeat the process multiple times per image). This yields a large set of small images that have fairly equal ratios of each class.
I should probably add here that the network will have to be set to input shape of (None, None, num_chanals) for this to then work on your original images.
Because you skip out on the vast majority of pixels ( that belong to the majority class) the training is very fast but doesn't leverage all of the data for the majority class.
In tensorflow 2.x the model.fit method has a class_weight argument to do this natively, passing a dictionary of weights for each class. Documentation
Main Problem
I cannot understand the Plot of the weights of a specific layer.
I used a method from no-learn : plot_conv_weights(layer, figsize=(6, 6))
Im using lasagne as my neural-network library.
The plot comes out fine, but I dont know how i should interpret it.
Neural Network Structure
The structure im using :
InputLayer 1x31x31
Conv2DLayer 20x3x3
Conv2DLayer 20x3x3
Conv2DLayer 20x3x3
MaxPool2DLayer 2x2
Conv2DLayer 40x3x3
Conv2DLayer 40x3x3
Conv2DLayer 40x3x3
MaxPool2DLayer 40x2x2
DropoutLayer
DenseLayer 96
DropoutLayer 96
DenseLayer 32
DropoutLayer 32
DenseLayer 1 as sigmoid
Here are the weights of the first 3 Layers :
** About the Images **
So for me, they look random and i cannot interpret them!
However, on Cs231, it says the following :
Conv/FC Filters. The second common strategy is to visualize the
weights. These are usually most interpretable on the first CONV layer
which is looking directly at the raw pixel data, but it is possible to
also show the filter weights deeper in the network. The weights are
useful to visualize because well-trained networks usually display nice
and smooth filters without any noisy patterns. Noisy patterns can be
an indicator of a network that hasn’t been trained for long enough, or
possibly a very low regularization strength that may have led to
overfitting
http://cs231n.github.io/understanding-cnn/
Then why mine are random?
The structure is trained and performs well for its task.
References
http://cs231n.github.io/understanding-cnn/
https://github.com/dnouri/nolearn/blob/master/nolearn/lasagne/visualize.py
Normally when you visualize the weights you want to check 2 things:
That they are smooth and cover a wide range of values, i.e. it's not a bunch of 1's and 0's. That would mean the non-linearity is being saturated.
That they have some kind of structure. Normally you tend to see oriented edges although this is more difficult to see when you have small filters like 3x3.
That being said, your weights do not appear to be saturated, but they indeed seem to be too random.
During training, did the network converge correctly?
I am also surprised at how big your filters are (30x30). Not sure what you are trying to accomplish with that.