I am pretty new to deep learning, so I got one question:
Assume an input Grayscale image of shape (128,128,1). Target (Output) is as well an (128,128,1) sized image, e.g. for segmentation, depth prediction etc.. Usually with valid padding the size of the image shrinks after several convolution layers.
What are decent (maybe not the toughest one) variants to keep the size or predict a same sized image? Is it via same-padding? Is it via tranpose convolution or upsampling? Should I use a FCN at the end and reshape them to the image size? I am using pytorch. I would be glad for any hints, because I didn't find much in the internet.
Best
TLDR; You want to look at Deconv networks (Convolution transpose) that help regenerate an image using convolution operations. You want to build an encoder-decoder convolution architecture that compresses an image to a latent representation using convolutions and then decodes an image from this compressed representation. For image segmentation, a popular architecture is U-net.
NOTE: I cant answer for pytorch, so I will he sharing the Tensorflow equivalent. Please feel to ignore the code, but since you are looking for the concept, I can help you with what you need to solve this.
You are trying to generate an image as the output of the network.
A series convolution operation help to Downsample an image. Since you need an output 2D matrix (gray scale image), you want to Upsample as well. Such a network is called a Deconv network.
The first series of layers convolve over the input, 'flattening' them into a vector of channels. The next set of layers use 2D Conv Transpose or Deconv operations to change the channels back into a 2D matrix (Gray scale image)
Refer to this image for reference -
Here is a sample code that shows you how you can take a (10,3,1) image to a (12,10,1) image using a deconv net.
You can find the conv2dtranspose layer implementation in pytorch here.
from tensorflow.keras import layers, Model, utils
inp = layers.Input((128,128,1)) ##
x = layers.Conv2D(2, (3,3))(inp) ## Convolution part
x = layers.Conv2D(4, (3,3))(x) ##
x = layers.Conv2D(6, (3,3))(x) ##
##########
x = layers.Conv2DTranspose(6, (3,3))(x)
x = layers.Conv2DTranspose(4, (3,3))(x) ## ## Deconvolution part
out = layers.Conv2DTranspose(1, (3,3))(x) ##
model = Model(inp, out)
utils.plot_model(model, show_shapes=True, show_layer_names=False)
Also, if you are looking for tried and tested architectures in this domain, check out U-net; U-Net: Convolutional Networks for Biomedical Image Segmentation. This is an encoder-decoder (conv2d, conv2d-transpose) architecture that uses a concept called skip connections to avoid information loss and generate better image segmentation masks.
Related
I am testing some well known models for computer vision: UNet, FC-DenseNet103, this implementation
I train them with 224x224 randomly cropped patches and do the same on the validation set.
Now when I run inference on some videos, I pass it the frames directly (1280x640) and it works. It runs the same operations on different image sizes and never gives an error. It actually gives a nice output, but the quality of the output depends on the image size...
Now it's been a long time since I've worked with neural nets but when I was using tensorflow I remember I had to crop the input images to the train crop size.
Why don't I need to do this anymore? What's happening under the hood?
It seems that the models that you are using have no linear layers. Because of this the output of the convolutional layers go straight into the softmax function. The softmax function doesn't take a specific shape for its input so it can take any shape as input. Because of this your model will work with any shape of image but the accuracy of your model will probably be far worse given different image shapes than the one you trained on.
There is always a specific input size in the documentation of the model. You should use this size. These are the current model limitations.
For UNets this may even be a ratio. I think it depends on implementation.
Just a note on resize:
transform.Resize((h,w))
transform.Resize(d)
In case of the (h, w), output size will be matched to this.
In the second case of d size, the smaller edge of the image will be matched to d.
For example, if height > width, then image will be re-scaled to (d * height / width, d)
The idea is to not ruin the aspect ratio of the image.
I am looking at using Landsat imagery to train a CNN for unsupervised pixel-wise semantic segmentation classification. That said, I have been unable to find a method that allows me to crop images from the larger Landsat image for training and then predict on the original image. Essentially here is what I am trying to do:
Original Landsat image (5,000 x 5,000 - this is an arbitrary size, not exactly sure of the actual dimensions off-hand) -> crop the image into (100 x 100) chunks -> train the model on these cropped images -> output a prediction for each pixel in the original (uncropped) image.
That said, I am not sure if I should predict on the cropped images and stitch them together after they are predicted or if I can predict on the original image.
Any clarification/code examples would be greatly appreciated. For reference, I use both pytorch and tensorflow.
Thank you!
Lance D
Borrowing from Ronneberger et al., what we have been doing is to split the input Landsat scene and corresponding ground truth mask into overlapping tiles. Take the original image and pad it by the overlap margin (we use reflection for the padding) then split into tiles. Here is a code snippet using scikit-image:
import skimage as sk
patches = sk.util.view_as_windows(image,
(self.tile_height+2*self.image_margin,
self.tile_width+2*self.image_margin,raster_value['channels']),
(self.tile_height,self.tile_width,raster_value['channels'])
I don't know what you are using for a loss function for unsupervised segmentation. In our case with supervised learning, we crop the final segmentation prediction to match the ground truth output shape. In the Ronneberger paper they relied on shrinkage due to the use of valid padding.
For predictions you would do the same (split into overlapping tiles) and stitch the result.
In order to do multiclass segmentation the masks need to be one-hot-encoded. For example if I have a 100 images of shape 224x224x3 with 5 different classes I would have a set of masks with shape (100, 224, 224, 5) i.e the last dimension (the channel) refers to the class of the pixel. Take a grayscale masks that contains 6 classes where each pixel has the label 1-6, I can easily convert this to the categorical mask I need using tf.keras.utils.to_categorical.
If I use the ImageDataGenerator provided with keras I know I can create a generator for both images and masks then zip them together for the problem (as code shows below) but where i'm confused is how do I convert the masks into this categorical one-hot-encoded structure whilst using the ImageDataGenerator? The ImageDataGenerator only finds files in directories that are saved as images therefore I can't convert the masks and then save them down as numpy arrays (the one-hot-encoded masks) for the generator to pick up, as images can't have that have more than 4 channels right? Is there somehow of telling the generator to do this conversion? Or does this therefore limit the number of classes I can have in my problem?
One solution is to write my own custom generator with the sequence class which I have done but I'm keen on understanding if this is possible to do with Keras inbuilt ImageDataGenenerator? Could writing my a lambda layer on the network be the solution?
mask_categorical = tf.keras.utils.to_categoricl(mask) #converts 224x224 grayscale mask to one-hot encoding version
imgDataGen = ImageDataGenerator(rescale=1/255.)
maskDataGen = ImageDataGenerator()
imageGenerator =imageDataGen.flow_from_directory("dataset/image/",
class_mode=None, seed=40)
maskGenerator = maskDataGen.flow_from_directory("dataset/mask/",
class_mode=None, seed=40)
trainGenerator = zip(imageGenerator, maskGenerator)
I am using keras and python for satellite image segmentation. It is my understanding that to get (pixel level)predictions for image segmentation, model reshapes layer of dimension(-1,num_classes,height,width) to shape (-1,num_classes,height*width).This is then followed by applying activation function like softmax or sigmoid. My question is how to recover images after this step back in the format either channel first or channel last?
example code
o = (Reshape(( num_classes , outputHeight*outputWidth)))(o)
o = (Permute((2, 1)))(o)
o = (Activation('softmax'))(o)
I have tried adding following layer to the model at the end
o = (Reshape((outputHeight, outputWidth, num_classes)))(o)
Is this correct? will this reorient the image pixels in the same order as original or not?
Another alternative may be to use following code on individual images.
array.reshape(height, width, num_classes)
Which method should i use to get pixel level segmentation result?
No, if you are interested in an image segmentation, you should not flatten and then reshape your tensors. Instead, use a fully convolutional model, like the U-Net. You find a lot of example implementations of it on github, e.g. here
I am trying to train a classifier to separate images taken by a particle physics detector into two classes. For each image, I also have a coordinate (x,y,z) describing where the particle interaction took place. That coordinate is very useful is understanding these images by eye, but doesn't have an obvious translation to weighting image pixels.
I've been trying some basic machine learning techniques in scikit-learn, feeding in data points with 103 features: the three axes of the coordinates, and the 10x10 pixels of the image. Those basic techniques aren't cutting it, unfortunately, so I thought I'd try to take advantage of the properties of convolutional neural networks. Since I've never tried that before, Keras seemed like an easy way to get started.
Looking at Keras, I see that I ought to provide an input shape. I could presumably use a input shape of (103), but if I understand CNN correctly, I'd lose all the advantages of CNN for images. Intuitively, what I want the input shape to be is (3)+(10,10). Is that a sensible concept in the world of CNN? Can it be done in Keras?
You might want to look into the Merge layer. In essence this allows you to use two independent inputs, maybe give them a few different processing layers and them combine them for the rest of the model.
With this you could, for example, do several convolutional layers to process the image and then simply merge it with the coordinate inputs.