I'm trying to use part of Google Deepminds CGQN network in Keras (Deepmind Paper). Depending on how much input images they give to the network, the network understands more about the 3D environment it is trying to predict. Here is an scheme of their network:
I would also like to use multiple input "images" like they did with the Mθ network. So my question is: Using Keras, how can I reuse a part of the network an arbitrary number of times and then sum all of the outputs it generates, which will be used as an input to the next part of the network?
Thanks in advance!
You can achieve this using the functional API, I'll just give a proof-of-concept here:
images_in = Input(shape=(None, 32, 32, 3)) # Some number of 32x32 colour images
# think of it as a video, a sequence of images for example
shared_conv = Conv2D(32, 2, ...) # some shared layer that you want to apply to every image
features = TimeDistributed(shared_conv)(images_in) # applies shared_conv to every image
Here TimeDistributed applies a given layer across the time dimension, which in our case means it applies to every image and you'll get an output for every image. There are more examples in the documentation linked above and you can implement a shared set of layers / submodel and then apply that to every image and the take the reduced sum.
Related
We want to train our model on varying input dimensions. Every input in a given batch and across batches has different dimensions.
We cannot resize our input (since we’ll lose our microscopic features). Now, since we cannot resize our input, converting them into batches of numpy array becomes impossible. In order to handle this now I have made the list for the input and each list of element contained (height, width, 1). Height is variable size and width is constant.
Sometime my input excessively large. In order to do that I have plan to use model.fit_generator(). In this, We find the max height and width of input in a batch and pad every other input with zeros so that every input in the batch has an equal dimension. Now we can easily convert it to a numpy array or a tensor and pass it to the fit_generator(). The model automatically learns to ignore the zeros and learns features from the intended portion from the padded input. This way we have a batch with equal input dimensions but every batch has a different shape (due to difference in max height and width of input across batches).
Now until here, I described the things what I have learned and what I have plan to do with variable input data. But I am stuck with the following confusions:
1- I have plan to use CNN first and then LSTM on that. I am using tensorflow keras. There, we have the facility of padding and masking . However, As for as I know that LSTM can work on masking and padding ignore 0-padded values. However, I am concerned about the CNN (does CNN ignores 0-padded values), because my padded input will first feed to CNN. I have seen some discussion in the following links:
How to apply masking layer to sequential CNN model in Keras?
https://github.com/keras-team/keras/issues/411
In these link, they mentioned that Unfortunately masking is not yet supported by the Keras Conv layers. However, now we can see alot of development and advancements specifically in the form of tensorflow Keras. So I am wondering that now tensorflow keras can support masking input?
2- To use the generator, we can use custom keras generator. For that I went through a vary good tutorial. I made the mind to use this. But I am wondering is there any advance built-in facility in tensorflow keras to use generator and save me to write custom keras generator?
I am testing some well known models for computer vision: UNet, FC-DenseNet103, this implementation
I train them with 224x224 randomly cropped patches and do the same on the validation set.
Now when I run inference on some videos, I pass it the frames directly (1280x640) and it works. It runs the same operations on different image sizes and never gives an error. It actually gives a nice output, but the quality of the output depends on the image size...
Now it's been a long time since I've worked with neural nets but when I was using tensorflow I remember I had to crop the input images to the train crop size.
Why don't I need to do this anymore? What's happening under the hood?
It seems that the models that you are using have no linear layers. Because of this the output of the convolutional layers go straight into the softmax function. The softmax function doesn't take a specific shape for its input so it can take any shape as input. Because of this your model will work with any shape of image but the accuracy of your model will probably be far worse given different image shapes than the one you trained on.
There is always a specific input size in the documentation of the model. You should use this size. These are the current model limitations.
For UNets this may even be a ratio. I think it depends on implementation.
Just a note on resize:
transform.Resize((h,w))
transform.Resize(d)
In case of the (h, w), output size will be matched to this.
In the second case of d size, the smaller edge of the image will be matched to d.
For example, if height > width, then image will be re-scaled to (d * height / width, d)
The idea is to not ruin the aspect ratio of the image.
So I've got a simple pytorch example of how to train a ResNet CNN to learn MNIST labeling from this link:
https://zablo.net/blog/post/using-resnet-for-mnist-in-pytorch-tutorial/index.html
It's working great, but I want to hack it a bit so that it does 2 things. First, instead of predicting digits, it predicts animal shapes/colors for a project I'm working on. That's already working quite well already and am happy with it.
Second, I'd like to hack the training (and possibly layers) so that predictions is done in parallel on multiple images at a time. In the MNIST example, basically prediction (or output) would be done for an image that has 10 digits at a time concatenated by me. For clarity, each 10-image input will have the digits 0-9 appearing only once each. The key here is that each of the 10 digit gets a unique class/label from the CNN/ResNet and each class gets assigned exactly once. And that digits that have high confidence will prevent other digits with lower confidence from using that label (a Hungarian algorithm type of approach).
So in my use case I want to train on concatenated images (not single images) as in Fig A below and force the classifier to learn to predict the best unique label for each of the concatenated images and do this all at once. Such an approach should outperform single image classification - and it's particularly useful for my animal classification because otherwise the CNN can sometimes return the same ID for multiple animals which is impossible in my application.
I can already predict in series as in Fig B below. And indeed looking at the confidence of each prediction I am able to implement a Hungarian-algorithm like approach post-prediction to assign the best (most confident) unique IDs in each batch of 4 animals. But this doesn't always work and I'm wondering if ResNet can try and learn the greedy Hungarian assignment as well.
In particular, it's not clear that implementing A simply requires augmenting the data input and labels in the training set will do it automatically - because I don't know how to penalize or dissalow returning the same label twice for each group of images. So for now I can generate these training datasets like this:
print (train_loader.dataset.data.shape)
print (train_loader.dataset.targets.shape)
torch.Size([60000, 28, 28])
torch.Size([60000])
And I guess I would want the targets to be [60000, 10]. And each input image would be [1, 28, 28, 10]? But I'm not sure what the correct approach would be.
Any advice or available links?
I think this is a specific type of training, but I forgot the name.
I know that it might be a dumb question, but I searched everywhere for an answer but I could not get.
Okay first properly explaining my question,
When I was learning CNN I was told that kernels or filters or activation map represent a feature of image.
To be specific, assume a cat image identification, a feature map would represent a "whiskers"
and in images which the activation of this feature map would be high it is inferred as whisker is present in image and so the image is a cat. (Correct me if I am wrong)
Well now when I made a Keras ConvNet I save the model
and then loaded the model and
saved all the filters to png images.
What I saw was 3x3 px images where each each pixel was of different colour (green, blue or their various variants and so on)
So how these 3x3px random colour pattern images of kernels represent in any way the "whisker" or any other feature of cat?
Or how could I know which png images is which feature ie which is whisker detector filter etc?
I am asking this because I might be asked in oral examination by teacher.
Sorry for the length of answer (but I had to make it so to explain properly)
You need to have a further look into how convolutional neural networks operate: the main topic being the convolution itself. The convolution occurs with the input image and filters/kernels to produce feature maps. A feature map is what may highlight important features.
The filters/kernels do not know anything of the input data so when you save these you are only going to see psuedo-random images.
Put simply, where * is the convolution operator,
input_image * filter = feature map
What you want to save, if you want to vizualise what is occuring during convolution, are the feature maps. This website gives a very detailed account on how to do so, and it is the method I have used in the past.
I'd like to create an audio classification system with Keras that simply determines whether a given sample contains human voice or not. Nothing else. This would be my first machine learning attempt.
This audio preprocessor exists. It claims not to be done, but it's been forked a few times:
https://github.com/drscotthawley/audio-classifier-keras-cnn
I don't understand how this one would work, but I'm ready to give it a try:
https://github.com/keunwoochoi/kapre
But let's say I got one of those to work, would the rest of the process be similar to image classification? Basically, I've never fully understood when to use Softmax and when to use ReLu. Would this be similar with sound as it would with images once I've got the data mapped as a tensor?
Sounds can be seen as a 1D image and be worked with with 1D convolutions.
Often, dilated convolutions may do a good work, see Wave Nets
Sounds can also be seen as sequences and be worked with RNN layers (but maybe they're too bulky in amount of data for that)
For your case, you need only one output with a 'sigmoid' activation at the end and a 'binary_crossentropy' loss.
Result = 0 -> no voice
Result = 1 -> there's voice
When to use 'softmax'?
The softmax function is good for multiclass problems (not your case) where you want only one class as a result. All the results of a softmax function will sum 1. It's intended to be like a probability of each class.
It's mainly used at the final layer, because you only get classes as the final result.
It's good for cases when only one class is correct. And in this case, it goes well with the loss categorical_crossentropy.
Relu and other activations in the middle of the model
These are not very ruled. There are lots of possibilities. I often see relu in image convolutional models.
Important things to know are they "ranges". What are the limits of their outputs?
Sigmoid: from 0 to 1 -- at the end of the model this will be the best option for your presence/abscence classification. Also good for models that want many possible classes together.
Tanh: from -1 to 1
Relu: from 0 to limitless (it simply cuts negative values)
Softmax: from 0 to 1, but making sure the sum of all values is 1. Good at the end of models that want only 1 class among many classes.
Oftentimes it is useful to preprocess the audio to a spectrogram:
Using this as input, you can use classical image classification approaches (like convolutional neural networks). In your case you could divide the input audio in frames of around 20ms-100ms (depending on the time resolution you need) and convert those frames to spectograms. Convolutional networks can also be combined with recurrent units to take a larger time context into account.
It is also possible to train neural networks on raw waveforms using 1D Convolutions. However research has shown that preprocessing approaches using a frequency transformation achieve better results in general.