Diffusion with melspectrogram data - pytorch

I am trying to put a data set of melspectrogram tensors into a diffusion model, the shape of the ensors is (128, 646) (a 15 second second audio file).
I want to run it through a diffusion model like the one in this notebook: (https://colab.research.google.com/drive/1sjy9odlSSy0RBVgMTgP7s99NXsqglsUL?usp=sharing)
This code is for images of size 64 x 64
My questions are as follows:
How do i adjust the model to accept tensors instead of images?
Would it be a viable solution to pad the tensors to look 'square'
Do you have any other advice for diffusion on tensors?
Thank you.
I havent tried anything yet...i am still researching how to do this.

Related

Classification of Images with size corellation

I work on a project of classification of images (in 5 types).
The images have various sizes, so I reshape them first and then use a CNN.
I have already pretty good results. However, the type of my images is very correlated with their shape (for example image of type A are very little and image of type B is very big, in 99% of case).
So, I think that my score would significatively increase if I would give to my model the information on the original size of the image. But I don't know how do this... How can I give this information to my model? (I work with Keras)
Thanks in advance,
Larapa

Torch model forward with a diferent image size

I am testing some well known models for computer vision: UNet, FC-DenseNet103, this implementation
I train them with 224x224 randomly cropped patches and do the same on the validation set.
Now when I run inference on some videos, I pass it the frames directly (1280x640) and it works. It runs the same operations on different image sizes and never gives an error. It actually gives a nice output, but the quality of the output depends on the image size...
Now it's been a long time since I've worked with neural nets but when I was using tensorflow I remember I had to crop the input images to the train crop size.
Why don't I need to do this anymore? What's happening under the hood?
It seems that the models that you are using have no linear layers. Because of this the output of the convolutional layers go straight into the softmax function. The softmax function doesn't take a specific shape for its input so it can take any shape as input. Because of this your model will work with any shape of image but the accuracy of your model will probably be far worse given different image shapes than the one you trained on.
There is always a specific input size in the documentation of the model. You should use this size. These are the current model limitations.
For UNets this may even be a ratio. I think it depends on implementation.
Just a note on resize:
transform.Resize((h,w))
transform.Resize(d)
In case of the (h, w), output size will be matched to this.
In the second case of d size, the smaller edge of the image will be matched to d.
For example, if height > width, then image will be re-scaled to (d * height / width, d)
The idea is to not ruin the aspect ratio of the image.

Output of CNN should be image

I am pretty new to deep learning, so I got one question:
Assume an input Grayscale image of shape (128,128,1). Target (Output) is as well an (128,128,1) sized image, e.g. for segmentation, depth prediction etc.. Usually with valid padding the size of the image shrinks after several convolution layers.
What are decent (maybe not the toughest one) variants to keep the size or predict a same sized image? Is it via same-padding? Is it via tranpose convolution or upsampling? Should I use a FCN at the end and reshape them to the image size? I am using pytorch. I would be glad for any hints, because I didn't find much in the internet.
Best
TLDR; You want to look at Deconv networks (Convolution transpose) that help regenerate an image using convolution operations. You want to build an encoder-decoder convolution architecture that compresses an image to a latent representation using convolutions and then decodes an image from this compressed representation. For image segmentation, a popular architecture is U-net.
NOTE: I cant answer for pytorch, so I will he sharing the Tensorflow equivalent. Please feel to ignore the code, but since you are looking for the concept, I can help you with what you need to solve this.
You are trying to generate an image as the output of the network.
A series convolution operation help to Downsample an image. Since you need an output 2D matrix (gray scale image), you want to Upsample as well. Such a network is called a Deconv network.
The first series of layers convolve over the input, 'flattening' them into a vector of channels. The next set of layers use 2D Conv Transpose or Deconv operations to change the channels back into a 2D matrix (Gray scale image)
Refer to this image for reference -
Here is a sample code that shows you how you can take a (10,3,1) image to a (12,10,1) image using a deconv net.
You can find the conv2dtranspose layer implementation in pytorch here.
from tensorflow.keras import layers, Model, utils
inp = layers.Input((128,128,1)) ##
x = layers.Conv2D(2, (3,3))(inp) ## Convolution part
x = layers.Conv2D(4, (3,3))(x) ##
x = layers.Conv2D(6, (3,3))(x) ##
##########
x = layers.Conv2DTranspose(6, (3,3))(x)
x = layers.Conv2DTranspose(4, (3,3))(x) ## ## Deconvolution part
out = layers.Conv2DTranspose(1, (3,3))(x) ##
model = Model(inp, out)
utils.plot_model(model, show_shapes=True, show_layer_names=False)
Also, if you are looking for tried and tested architectures in this domain, check out U-net; U-Net: Convolutional Networks for Biomedical Image Segmentation. This is an encoder-decoder (conv2d, conv2d-transpose) architecture that uses a concept called skip connections to avoid information loss and generate better image segmentation masks.

Convolutional Neural Network - Visualizing weights

Main Problem
I cannot understand the Plot of the weights of a specific layer.
I used a method from no-learn : plot_conv_weights(layer, figsize=(6, 6))
Im using lasagne as my neural-network library.
The plot comes out fine, but I dont know how i should interpret it.
Neural Network Structure
The structure im using :
InputLayer 1x31x31
Conv2DLayer 20x3x3
Conv2DLayer 20x3x3
Conv2DLayer 20x3x3
MaxPool2DLayer 2x2
Conv2DLayer 40x3x3
Conv2DLayer 40x3x3
Conv2DLayer 40x3x3
MaxPool2DLayer 40x2x2
DropoutLayer
DenseLayer 96
DropoutLayer 96
DenseLayer 32
DropoutLayer 32
DenseLayer 1 as sigmoid
Here are the weights of the first 3 Layers :
** About the Images **
So for me, they look random and i cannot interpret them!
However, on Cs231, it says the following :
Conv/FC Filters. The second common strategy is to visualize the
weights. These are usually most interpretable on the first CONV layer
which is looking directly at the raw pixel data, but it is possible to
also show the filter weights deeper in the network. The weights are
useful to visualize because well-trained networks usually display nice
and smooth filters without any noisy patterns. Noisy patterns can be
an indicator of a network that hasn’t been trained for long enough, or
possibly a very low regularization strength that may have led to
overfitting
http://cs231n.github.io/understanding-cnn/
Then why mine are random?
The structure is trained and performs well for its task.
References
http://cs231n.github.io/understanding-cnn/
https://github.com/dnouri/nolearn/blob/master/nolearn/lasagne/visualize.py
Normally when you visualize the weights you want to check 2 things:
That they are smooth and cover a wide range of values, i.e. it's not a bunch of 1's and 0's. That would mean the non-linearity is being saturated.
That they have some kind of structure. Normally you tend to see oriented edges although this is more difficult to see when you have small filters like 3x3.
That being said, your weights do not appear to be saturated, but they indeed seem to be too random.
During training, did the network converge correctly?
I am also surprised at how big your filters are (30x30). Not sure what you are trying to accomplish with that.

Can I use Keras or a similar CNN tool on a paired image and coordinate?

I am trying to train a classifier to separate images taken by a particle physics detector into two classes. For each image, I also have a coordinate (x,y,z) describing where the particle interaction took place. That coordinate is very useful is understanding these images by eye, but doesn't have an obvious translation to weighting image pixels.
I've been trying some basic machine learning techniques in scikit-learn, feeding in data points with 103 features: the three axes of the coordinates, and the 10x10 pixels of the image. Those basic techniques aren't cutting it, unfortunately, so I thought I'd try to take advantage of the properties of convolutional neural networks. Since I've never tried that before, Keras seemed like an easy way to get started.
Looking at Keras, I see that I ought to provide an input shape. I could presumably use a input shape of (103), but if I understand CNN correctly, I'd lose all the advantages of CNN for images. Intuitively, what I want the input shape to be is (3)+(10,10). Is that a sensible concept in the world of CNN? Can it be done in Keras?
You might want to look into the Merge layer. In essence this allows you to use two independent inputs, maybe give them a few different processing layers and them combine them for the rest of the model.
With this you could, for example, do several convolutional layers to process the image and then simply merge it with the coordinate inputs.

Resources