Differentiable image compression operations in PyTorch - pytorch

During a CNN classification model training while calculating the loss I am applying the encoding jpeg compression on the image in PyTorch. While I call loss.backward() it must also backpropagate through encoding and compression operation performed on the images.
Are those compression algorithms (e.g. encoding and JPEG compression) are differentiable otherwise how to backpropagate the loss gradient through those operations?
If those operations are not differentiable is there any differentiable compression algorithm that exists in PyTorch which performs H.264 encoding and JPEG compression?
Any suggestions will be highly helpful.

To start with, carefully consider whether you need to differentiate across the JPEG compression step. The vast majority of projects do not differentiate across this step, and if you're unsure if you need to, you probably don't.
If you really need to differentiate across an image compressor, you might consider a codec that is easier to implement than JPEG. Wavelett-based compression (the technology behind the ill-fated JPEG 2000 format) is mathematically elegant and easy to differentiate across. In a recent application of this technique, Thies et al. 2019 represent an image as a laplacian pyramid, with a loss component that serves to force sparsity in the higher resolution levels.
Now, as a thought experiment, we can look at the different steps within JPEG compression and determine if they could be implemented in a differentiable way.
Color transform (RBG to YCbCr): We can represent this as a point-wise convolution.
Chroma downsampling: Easy enough with torch.nn.functional.interpolate on the chroma channels.
Discrete Cosine Transform (DCT): Now things are getting interesting. Here is a Pytorch implementation of DCT that might work: https://github.com/zh217/torch-dct.
Quantization table: Easy again. This should just be multiplying output of the DCT with the values in the table.
Huffman encoding: Hard; I'm not sure this is possible. The number of output elements is going to vary based on the image entropy, which rules out many differentiable building blocks. Depending on your application, you might be able to skip this step (this step is lossless compression; so if you're trying to differentiate across the compression artifacts introduced by JPEG, the previous steps should be sufficient).
For an interesting related work on inputting JPEG DCT components directly into a neural net, see Faster Neural Networks Straight from JPEG.

Related

Using MFCCs and Mel-Spectrograms with CNN

I would like to get some feedback as to why in a lot of research papers that the researchers pass MFCCs through a Convolution Neural Network (CNN)? Inherently, the CNN in itself is a feature extraction process.
Any tips and advice as to why this process is commonly used.
Thanks!
MFCCs mimic non-linear human ear perception of sound and it approximates the human auditory system's response.
Therefore, MFCCs are widely used in speech recognition.
While CNNs are being used in feature extraction, raw audio signals are not commonly used as input in CNNs.
The reason for this is audio signals are inherently being prone to noise, and are often contaminated with frequency bands that are not useful for the intended applications.
Therefore, it is a common practice to preprocess the signal to remove noise and remove irrelevant frequency bands by means of bandpass filters, and then extract relevant features from it.
The features can either be time-domain features; such as amplitude envelope, root mean square energy, or zero-crossing rate, or frequency domain features; such as band energy ratio, spectral centroid, and spectral flux, or time-frequency representations; such as spectrogram and mel-spectrogram.
CNNs are then used to extract local patterns in these extracted features.
Especially, for the time-frequency representations, 2D CNNs are used to extract features, similar to the feature extraction process in image recognition applications.

How to use a trained deeplearning model at different resolutions?

I have trained a model for image segmentation task on 320x240x3 resolution images using tensorflow 2.x. I am wondering if there is a way to use the same model or tweak the model to make it work on different resolutions?
I have to use a model trained on a 320x240 resolution for Full HD (1920x1080) and SD(1280x720) images but as the GPU Memory is not sufficient to train the model at the specified resolutions with my architecture, I have trained it on 320x240 images.
I am looking for a scalable solution that works at all the resolutions. Any Suggestions?
The answer to your question is no: you cannot use a model trained at a particular resolution to be used at different resolution; in essence, this is why we train the models at different resolutions, to check the performance and possibly improve it.
The suggestion below omits one crucial aspect: that, depending on the task at hand, increasing the resolution can considerably improve the results in object detection and image segmentation, particularly if you have small objects.
The only solution for your problem, considering the GPU memory constraint, is to try to split the initial image into smaller parts (or maybe tiles) and train per part(say 320x240) and then reconstruct the initial image; otherwise, there is no other solution than to increase the GPU memory in order to train at higher resolutions.
PS: I understood your question after reading it a couple of times; I suggest that you modify a little bit the details w.r.t the resolution.
YEAH, you can do it in high resolution image. But the small resolution is easy to train and it is easy for the model to find the features of the image. Training in small resolution models saves your time and makes your model faster since it has the less number of parameters. HD images contains large amount of pixels, so if you train your model in higher resolution images, it makes your training and model slower as it contains large number of parameters due to the presence of higher number of pixels and it makes difficult for your model to find features in the high resolution image. So, mostly your are advisable to use lower resolution instead of higher resolution.

How to make feature vectors size equal for training neural networks?

I am training a neural network, but the feature vectors do not have the same size.
This problem may be fixed by adding some zeros or removing some values, but the greater problem would be data loss or generating meaningless data.
So, is there any approach to make them equal size, without mentioned weaknesses? Maybe transformation to other dimensions?
I do not want to use random values or "NA".
Adding zeros or zero padding is the most common method of making very short audio signals longer as well as it can be used to match the lengths of audio data before feature extraction.
In my understanding, this does not affect the outcome of the analysis, specially as you are using a neural network.

Which Spectrogram best represents features of an audio file for CNN based model?

I am looking to understand various spectrograms for audio analysis. I want to convert an audio file into 10 second chunks, generate spectrograms for each and use a CNN model to train on top of those images to see if they are good or bad.
I have looked at linear, log, mel, etc and read somewhere that mel based spectrogram is best to be used for this. But with no proper verifiable information. I have used the simple following code to generate mel spectrogram.
y,sr= librosa.core.load(r'C:\Users\Tej\Desktop\NoiseWork\NoiseOnly\song.wav')
S = librosa.feature.melspectrogram(y=y, sr=sr)
librosa.display.specshow(librosa.power_to_db(S, ref=np.max))
My question is which spectrogram best represents features of an audio file for training with CNN? I have used linear but some audio files the linear spectrogram seems to be the same
To add to what has been stated, I recommend reading through A Comparison of Audio Signal Preprocessing Methods for Deep Neural Networks on Music Tagging by Keunwoo Choi, György Fazekas, Kyunghyun Cho, and Mark Sandler.
For their data, they achieved nearly identical classification accuracy between simple STFTs and melspectrograms. So melspectrograms seem to be the clear winner for dimension reduction if you don't mind the preprocessing. The authors also found, as jonner mentions, that log-scaling (essentially converting amplitude to a db scale) improves accuracy. You can easily do this with Librosa (using your code) like this:
y,sr= librosa.core.load(r'C:\Users\Tej\Desktop\NoiseWork\NoiseOnly\song.wav')
S = librosa.feature.melspectrogram(y=y, sr=sr)
S_db = librosa.core.power_to_db(S)
As for normalization after db-scaling, that seems hit or miss depending on your data. From the paper above, the authors found nearly no difference using various normalization techniques for their data.
One last thing that should be mentioned is a somewhat new method called Per-Channel Energy Normalization. I recommend reading Per-Channel Energy Normalization: Why and How by Vincent Lostanlen, Justin Salamon, Mark Cartwright, Brian McFee,
Andrew Farnsworth, Steve Kelling, and Juan Pablo Bello. Unfortunately, there are some parameters that need adjusting depending on the data, but in many cases seems to do as well as or better than logmelspectrograms. You can implement it in Librosa like this:
y,sr= librosa.core.load(r'C:\Users\Tej\Desktop\NoiseWork\NoiseOnly\song.wav')
S = librosa.feature.melspectrogram(y=y, sr=sr)
S_pcen = librosa.pcen(S)
Although, like I mentioned, there are parameters within pcen that need adjusting! Here is Librosa's documentation on PCEN to get you started if you are interested.
Log-scaled mel-spectrograms is the current "standard" for use with Convolutional Neural Networks. It was the most commonly used in Audio Event Detection and Audio Scene Classification literature between 2015-2018.
To be more invariant to amplitude changes, normalized is usually applied. Either to entire clips or the windows being classified. Mean/std normalization works fine, generally.
But from the perspective of a CNN, there is relatively small difference between the different spectrometer variations. So this is unlikely to fix your issue if two or more spectrograms are basically the same.

audio features extraction using restricted boltzmann machine

I want to extract Audio Features using RBM (Restricted Boltzmann Machine). For this, I am giving the spectrogram (PCA whitened) as an input to the RBM.
For each audio file, The spectrogram is a matrix with no. of columns fixed but with different number of rows for each audio file. My question how can I train my RBM, or how can I extract the features from audio using RBM, given this spectrogram matrix. I read in a paper by Honglak Lee, paper title Unsupervised Feature Learning for Audio Classification using convolutional deep belief networks. http://machinelearning.wustl.edu/mlpapers/paper_files/NIPS2009_1171.pdf
"We then trained 300 first layer bases with a filter length of 6 and a max-pooling ratio of 3."
First, what is meant by bases here. (They have used Convolutional Deep Belief Networks, so I guess, bases do not mean weights here).
Second, what do they mean by using a filter length of 6? How can I do it? Any hint will be appreciated. (I am new to RBM)
I think what is confusing here is they add a convolutional layer to their deep belief network. The idea of the convolutional layer is they use kernels that are specific to a small region of the image, in their case a 6 element window. I'm not an expert in audio problems, but I believe bases refer to the different bands in the spectrograph.

Resources