GANs on color images - pytorch

Most (PyTorch) open source GANs work on MNIST dataset, i.e. gray level image.
Can I use a GAN on each channel of a color image, then combine the result?

You can just have your generator and discriminator generate and classify 3-channel images - speaking in terms of implementation, make them work on B x 3 x H x W tensors instead of B x 1 x H x W, as they do for MNIST.
You can't just use your GAN on each channel separately and concatenate at the end, because you would have no way to ensure that each channel corresponds to the same image. Say you're generating celebrity faces by first generating red channel, then green and finally blue. How would you make sure that you don't get a female sample for the red channel and a male for the green?

Related

How to crop image using the Opencv's convexHull coordinates

So I am writing a small program to crop the white part of a license plate (check image). I succeed in finding the white rectangle using an HSV mask (with low and white colours) and by filtering the size of the contours. Nevertheless using:
This is the image i use as a base (i am covering the numbers for privacy reasons)
(x, y, w, h) = cv2.boundingRect(contour)
It gives a rectangle which crops a larger part of the license plate (when the plate is sideways). For this reason I used the following after the filtering:
hull = cv2.convexHull(contour)
cv2.drawContours(copy, contours=[hull],
contourIdx=0,
color=(255, 0, 0), thickness=2)
This marks the correct area on the picture as seen below:
white marked area
Now my main problem is how can I crop only the marked part that was detected using the convexHull functions. I am quite new to the world of computer vision but I could not find something that could help me but from what I understand through my experiments with HSV and HSL I need to create a mask which will crop focus only on the specific area of the image but how can i create a mask from the hull result?
using the boundingRect method i normally do:
# img_plate is the original image
img_plate[y:y + h, x: x + w]
But this will crop a larger image and not the one i really need.
Thank you in advance for all your answers.

Direct Heatmap Regression with Fully Convolutional Nets

I'm trying to develop a fully-convolutional neural net to estimate the 2D locations of keypoints in images that contain renders of known 3D models. I've read plenty of literature on this subject (human pose estimation, model based estimation, graph networks for occluded objects with known structure) but no method I've seen thus far allows for estimating an arbitrary number of keypoints of different classes in an image. Every method I've seen is trained to output k heatmaps for k keypoint classes, with one keypoint per heatmap. In my case, I'd like to regress k heatmaps for k keypoint classes, with an arbitrary number of (non-overlapping) points per heatmap.
In this toy example, the network would output heatmaps around each visible location of an upper vertex for each shape. The cubes have 4 vertices on top, the extruded pentagons have 2, and the pyramids just have 1. Sometimes points are offscreen or occluded, and I don't wish to output heatmaps for occluded points.
The architecture is a 6-6 layer Unet (as in this paper https://arxiv.org/pdf/1804.09534.pdf). The ground truth heatmaps are normal distributions centered around each keypoint. When training the network with a batch size of 5 and l2 loss, the network learns to never make an estimate whatsoever, just outputting blank images. Datatypes are converted properly and normalized from 0 to 1 for input and 0 to 255 for output. I'm not sure how to solve this, are there any red flags with my general approach? I'll post code if there's no clear problem in general...

Keras image augmentation for images with two channels

I performed data augmentation for images with two channels. My data set is formatted in the shape of (image_Numbers, image_height, image_weights, image_channels), where image_channels = 2.
In performing data augmentation using datagen (created by ImageDataGenerator), a userwarining message is generated:
UserWarning: NumpyArrayIterator is set to use the data format convention
"channels_last" (channels on axis 3),
i.e. expected either 1, 3 or 4 channels on axis 3.
However, it was passed an array with shape (1, 150, 150, 2) (2 channels).
Does the warning imply the data augmentation was unsuccessful? Was it only performed for one channel images? If so, how to perform data augmentation for two-channels of images (not one channel this time and then concatenation)?
It means they don't expect two channel images. It's non standard.
The standard images are:
1 channel: grayscale
3 channels: RGB
4 channels: RGBA
Since it's a warning, we don't really know what's going on.
Check the outputs of this generator yourself.
x, y = theGenerator[someIndex]
Plot x[0] and others.
In case the generated images aren't good, you can do the augmentations yourself using a python generator or a custom keras.utils.Sequence.

What should be the batch for Keras LSTM CNN to process image sequence

I want to use image sequence to predict 1 output.
training data:
[(x_img1, y1), (x_img2, y2), ..., (x_img10, y10)]
Color image dimension:
(100, 120, 3)
Output dimention: (1)
Model implemented in Keras:
img_sequence_length = 3
model = Sequential()
model.add(TimeDistributed(Convolution2D(24, 5, 5, subsample=(2, 2), border_mode="same", activation=‘rely’, name='conv1'),
input_shape=(img_sequence_length,
100,
120,
3)))
….
model.add(LSTM(64, return_sequences=True, name='lstm_1'))
model.add(LSTM(10, return_sequences=False, name='lstm_2'))
model.add(Dense(256))
model.add(Dense(1, name='output'))
The batch should be:
A)
[ [(x_img1, y1), (x_img2, y2), (x_img3, y3)],
[(x_img2, y2), (x_img3, y3), (x_img4, y4)],
…
]
Or
B)
[ [(x_img1, y1), (x_img2, y2), (x_img3, y3)],
[(x_img4, y4), (x_img5, y5), (x_img6, y6)],
…
]
Why?
This choice really depends on what you want to achieve. Understanding what your data is totally influences the decision. (Not only the shape and type of the data, but what it means and what you want from it. Is it a video? Many videos? Do I want the name of the character in a little segment of a video? Or to know the state of the plot continously along the video?)
In option A:
This option is used when all your images form a single long sequence and you want to predict the next element in the sequence by knowing a specific number of previous images.
Each group of 3 images in that batch is completely independent.
The layer doesn't keep a memory between them, and the actual memory is length 3.
The simulation of a long sequence happens because you are repeating images in each batch, like a sliding window. But there is no connection or memory transfer from one group to another.
You use this if the sequence has any logical possibility of being predicted from 3 images.
Imagine you have one long video, but you watch only three 3 seconds of it and try to deduce something from those 3 seconds. But, your memory is completely washed away before you watch another 3 seconds. When you watch this 3 new seconds, you will not be able to remember what you watched before and you will not be able to say you watched 4 seconds. All you learn will be confined in 3 second segments.
In option B:
In this option, each group of 3 images has absolutely no connection at all to the others. You can use this as if every group of 3 images were a different sequence (not belonging to a long sequence).
Image you have lots of videos, and they talk about different things. One is Titanic, the other is the Avengers, and so on.
This batch may be used for a case similar to the one proposed in A, but your sliding window would have a step 3. This would make learning faster, but also make it less learning.
Other options:
You can take a look at this question, its answer and the comments to have more ideas.
Some hints on splitting the data:
First, input and output data must be separate:
X = [item[0] for item in training_data]
Y = [item[1] for item in training_data]
Then you must separate the sequences properly.
As you defined in the input_shape, X must follow the same shape.
X.shape must be (numberOfSequences, img_sequence_length, 100, 120, 3)
So if it's a list of images, you must make sure that every image is a numpy array (transform them if necessary), and that you will later convert X to numpy:
X = np.asarray(X_with_numpy_images)
And if you have only one Y for each sequence, you may have it shaped as:
Y.shape must be (numberOfSequences,1)
You probably mount it taking values in steps of 3:
Y = [Y[(i+1)*3 - 1] for i in range(numberOfSequences)]
Now it's important to understand if each sequence of 3 images is independent of the other sequences, or if you have just one huge sequence divided in small parts.
In case one, use LSTM(..., stateful=False), in case two, use LSTM(...,stateful=True)
And you will also probably need to reshape the tensors properly in the transition from the convolutional layers to the LSTM layers, because LSTM will require inputs shaped as (NumberOfSequences, SequenceLength, Features)
A suggestion is to use reshape layers:
model.add(Reshape((img_sequence_length,100*120*3)))
#of course the last dimension may be different if you don't use `padding='same'` in the convolutions of if you use pooling.

Training of SVM classifier using SIFT features

please i like to classify a set of image in 4 class with SIFT DESCRIPTOR and SVM. Now, using SIFT extractor I get keypoints of different sizes exemple img1 have 100 keypoints img2 have 55 keypoints.... how build histograms that give fixed size vectors with matlab
In this case, perhaps dense sift is a good choice.
There are two main stages:
Stage 1: Creating a codebook.
Divide the input image into a set of sub-images.
Apply sift on each sub-image. Each key point will have 128 dimensional feature vector.
Encode these vectors to create a codebook by simply applying k-means clustering with a chosen k. Each image will produce a matrix Vi (i <= n and n is the number of images used to create the codeword.) of size 128 * m, where m is the number of key points gathered from the image. The input to K-means is therefore, a big matrix V created by horizontal concatenation of Vi, for all i. The output of K-means is a matrix C with size 128 * k.
Stage 2: Calculating Histograms.
For each image in the dataset, do the following:
Create a histogram vector h of size k and initialize it to zeros.
Apply dense sift as in step 2 in stage 1.
For each key point's vector find the index of it's "best match" vector in the codebook matrix C (can be the minimum in the Euclidian distance) .
Increase the corresponding bin to this index in h by 1.
Normalize h by L1 or L2 norms.
Now h is ready for classification.
Another possibility is to use Fisher's vector instead of a codebook, https://hal.inria.fr/file/index/docid/633013/filename/jegou_aggregate.pdf
You will always get different number of keypoints for different images, but the size of feature vector of each descriptor point remains same i.e. 128. People prefer using Vector Quantization or K-Mean Clustering and build Bag-of-Words model histogram. You can have a look at this thread.
Using the conventional SIFT approach you will never have the same number of key points in every image. One way of achieving that is to sample the descriptors densely, using Dense SIFT, that places a regular grid on top of the image. If all images have the same size, then you will have the same number of key points per image.

Resources