Difference between transform & target_transform in pytorch? - pytorch

Difference between transform & target_transform in pytorch?
Context: Every TorchVision Dataset includes two arguments: transform and target_transform to modify the samples and labels respectively.

If you look at the source code, particularly the __getitem__ method for any of the torchvision Dataset classes, e.g., torchvision.datasets.DatasetFolder, you can see that transform and target_transform are used to modify / augment / transform the image and the target respectively.
Examples where this might be useful include object detection and semantic segmentation, where if you apply a translation / rotation / shear / scale / crop on the source image, you would also want corresponding transforms on the bounding boxes / segmentation masks.
As an additional example, you can take a look at this official tutorial where target_transform is used to convert integer class labels to one-hot format for image classification.

Related

How can I control (or at least record) parameters passed to torchvision transform on an image by image basis?

I am studying the effects of blur and noise on an image classifier, and I would like to use torchvision transforms to apply varied amounts of Gaussian blur and Poisson noise my images. It's pretty trivial to specify a probability distribution for the noise and blur parameters, but I can't figure out how to either control those parameters on an image by image basis or get PyTorch to record the parameters actually used for each image. Could I do this by defining my transform inside the dataset class rather than passing it to the dataloader, so that each time I load an image a custom transform is created and it's parameters are returned with the image and its label?
The transform mechanism provided by PyTorch uses simple callable objects that are called automatically upon loading samples from Dataset. There is nothing fundamentally stopping you from doing all your transforms from Dataset itself. Since you haven't provided any code, I can only offer some pseudocodes
from torchvision.transforms.functional import gaussian_blur
class CoolDataset(Dataset):
def __init__(self, root_dir):
self.image_list = os.listdir(root_dir)
self.sample_wise_blur_std = [0.1, 0.2, ..]
def __getitem__(self, i):
img = read_image(self.image_list[i]) # .shape is (C,H,W)
blurred = gaussian_blur(img, (3,3), std=self.sample_wise_blur_std[i])
return img, self.sample_wise_blur_std[i]
with transform parameters fused in your Dataset definition, your Dataloader will collate them for you
cooldl = DataLoader(CoolDataset('/path/to/images'), ...)
for X, blurs in cooldl:
# X.shape is (B,C,H,W)
# blurs.shape is (B,)
pass
I hope this is what you are looking for.

Output of CNN should be image

I am pretty new to deep learning, so I got one question:
Assume an input Grayscale image of shape (128,128,1). Target (Output) is as well an (128,128,1) sized image, e.g. for segmentation, depth prediction etc.. Usually with valid padding the size of the image shrinks after several convolution layers.
What are decent (maybe not the toughest one) variants to keep the size or predict a same sized image? Is it via same-padding? Is it via tranpose convolution or upsampling? Should I use a FCN at the end and reshape them to the image size? I am using pytorch. I would be glad for any hints, because I didn't find much in the internet.
Best
TLDR; You want to look at Deconv networks (Convolution transpose) that help regenerate an image using convolution operations. You want to build an encoder-decoder convolution architecture that compresses an image to a latent representation using convolutions and then decodes an image from this compressed representation. For image segmentation, a popular architecture is U-net.
NOTE: I cant answer for pytorch, so I will he sharing the Tensorflow equivalent. Please feel to ignore the code, but since you are looking for the concept, I can help you with what you need to solve this.
You are trying to generate an image as the output of the network.
A series convolution operation help to Downsample an image. Since you need an output 2D matrix (gray scale image), you want to Upsample as well. Such a network is called a Deconv network.
The first series of layers convolve over the input, 'flattening' them into a vector of channels. The next set of layers use 2D Conv Transpose or Deconv operations to change the channels back into a 2D matrix (Gray scale image)
Refer to this image for reference -
Here is a sample code that shows you how you can take a (10,3,1) image to a (12,10,1) image using a deconv net.
You can find the conv2dtranspose layer implementation in pytorch here.
from tensorflow.keras import layers, Model, utils
inp = layers.Input((128,128,1)) ##
x = layers.Conv2D(2, (3,3))(inp) ## Convolution part
x = layers.Conv2D(4, (3,3))(x) ##
x = layers.Conv2D(6, (3,3))(x) ##
##########
x = layers.Conv2DTranspose(6, (3,3))(x)
x = layers.Conv2DTranspose(4, (3,3))(x) ## ## Deconvolution part
out = layers.Conv2DTranspose(1, (3,3))(x) ##
model = Model(inp, out)
utils.plot_model(model, show_shapes=True, show_layer_names=False)
Also, if you are looking for tried and tested architectures in this domain, check out U-net; U-Net: Convolutional Networks for Biomedical Image Segmentation. This is an encoder-decoder (conv2d, conv2d-transpose) architecture that uses a concept called skip connections to avoid information loss and generate better image segmentation masks.

How to use ImageDataGenerator with multi-label masks for multi-class image segmentation?

In order to do multiclass segmentation the masks need to be one-hot-encoded. For example if I have a 100 images of shape 224x224x3 with 5 different classes I would have a set of masks with shape (100, 224, 224, 5) i.e the last dimension (the channel) refers to the class of the pixel. Take a grayscale masks that contains 6 classes where each pixel has the label 1-6, I can easily convert this to the categorical mask I need using tf.keras.utils.to_categorical.
If I use the ImageDataGenerator provided with keras I know I can create a generator for both images and masks then zip them together for the problem (as code shows below) but where i'm confused is how do I convert the masks into this categorical one-hot-encoded structure whilst using the ImageDataGenerator? The ImageDataGenerator only finds files in directories that are saved as images therefore I can't convert the masks and then save them down as numpy arrays (the one-hot-encoded masks) for the generator to pick up, as images can't have that have more than 4 channels right? Is there somehow of telling the generator to do this conversion? Or does this therefore limit the number of classes I can have in my problem?
One solution is to write my own custom generator with the sequence class which I have done but I'm keen on understanding if this is possible to do with Keras inbuilt ImageDataGenenerator? Could writing my a lambda layer on the network be the solution?
mask_categorical = tf.keras.utils.to_categoricl(mask) #converts 224x224 grayscale mask to one-hot encoding version
imgDataGen = ImageDataGenerator(rescale=1/255.)
maskDataGen = ImageDataGenerator()
imageGenerator =imageDataGen.flow_from_directory("dataset/image/",
class_mode=None, seed=40)
maskGenerator = maskDataGen.flow_from_directory("dataset/mask/",
class_mode=None, seed=40)
trainGenerator = zip(imageGenerator, maskGenerator)

Graph cut performed before training or as a post-processing to a pixel-based classification

I'm currently performing a pixel-based classification of an image using simple supervised classifiers implemented in Scikit-learn. The image is first reshaped into a vector of single pixel intensities, then the training and the classification are carried out as in the following:
from sklearn.linear_model import SGDClassifier
classifier = SGDClassifier(verbose=True)
classifier.fit(training_data, training_target)
predictions = classifier.predict(test_data)
The problem with pixel-based classification is the noisy nature of the resulting classified image. To prevent it, I wanted to use Graph Cut (e.g. Boykov-Kolmogorov implementation) to take into account the spatial context between pixels. But, the implmentations I found in Python (NetworkX, Graph-tool) and in C++ (OpenGM and the original implementation: [1] and [2]) don't show how to go from an image to a Graph, except for [2] which is in matlab, and I'm not really enough familiar with either of Graph Cut and matlab.
So my question is basically how can graph cuts be integrated into the previous classification (e.g. before the training or as a post-processing)?
I had a look at the graph algorithms in Scikit-image (here), but these work only on RGB images with discreet values, whereas my pixel values are continuous.
I found this image restoration tutorial which does more or less what I was looking for. Besides, you use a Python library wrapper (PyMaxflow) to call the maxflow algorithm to partition the graph.
It starts from the noisy image on the left, and takes into account the spatial constraint between pixels, to obtain the binary image on the right.

How to do weight imbalanced classes for cross entropy loss in Keras?

Using Keras for image segmentation on a highly imbalanced dataset, and I want to re-weight the classes proportional to pixels values in each class as described here. If a have binary classes with weights = [0.8, 0.2], how can I modify K.sparse_categorical_crossentropy(y_true, y_pred) to re-weight the loss according to the class which the pixel belongs to?
The input has shape (4, 256, 256, 1) (batch, height, width, channels) and the output is a vector of 0's and 1's (4, 65536, 1) (positive and negative class). The model and data is similar to the one here with the difference being the images are grayscale and the masks are binary (2 classes).
This is the custom loss function I used for my semantic segmentation project. It is modified from the categorical_crossentropy function found in keras/backend/tensorflow_backend.py.
def class_weighted_pixelwise_crossentropy(target, output):
output = tf.clip_by_value(output, 10e-8, 1.-10e-8)
weights = [0.8, 0.2]
return -tf.reduce_sum(target * weights * tf.log(output))
Note that my final version did not use class weighting - I found that it encouraged the model to use the underrepresented classes as filler for patches of the image that it was unsure about instead of making more realistic guesses, and thereby hurt performance.
Jessica's answer is clean and works well. I generally recommend it. But for the sake of variety:
I have found that sampling regions of interest that include a better ratio between the classes is an effective way to quickly learn skewed pixelwise classes.
In my case, I had two classes like you which makes things easier. I look for areas in the image that have appearances of the less represented class. I crop around it with some random offset a constantly sized bounding box ( i repeat the process multiple times per image). This yields a large set of small images that have fairly equal ratios of each class.
I should probably add here that the network will have to be set to input shape of (None, None, num_chanals) for this to then work on your original images.
Because you skip out on the vast majority of pixels ( that belong to the majority class) the training is very fast but doesn't leverage all of the data for the majority class.
In tensorflow 2.x the model.fit method has a class_weight argument to do this natively, passing a dictionary of weights for each class. Documentation

Resources