audio features extraction using restricted boltzmann machine

audio features extraction using restricted boltzmann machine - audio

I want to extract Audio Features using RBM (Restricted Boltzmann Machine). For this, I am giving the spectrogram (PCA whitened) as an input to the RBM.
For each audio file, The spectrogram is a matrix with no. of columns fixed but with different number of rows for each audio file. My question how can I train my RBM, or how can I extract the features from audio using RBM, given this spectrogram matrix. I read in a paper by Honglak Lee, paper title Unsupervised Feature Learning for Audio Classification using convolutional deep belief networks. http://machinelearning.wustl.edu/mlpapers/paper_files/NIPS2009_1171.pdf
"We then trained 300 first layer bases with a filter length of 6 and a max-pooling ratio of 3."
First, what is meant by bases here. (They have used Convolutional Deep Belief Networks, so I guess, bases do not mean weights here).
Second, what do they mean by using a filter length of 6? How can I do it? Any hint will be appreciated. (I am new to RBM)

I think what is confusing here is they add a convolutional layer to their deep belief network. The idea of the convolutional layer is they use kernels that are specific to a small region of the image, in their case a 6 element window. I'm not an expert in audio problems, but I believe bases refer to the different bands in the spectrograph.

Related

Which Spectrogram best represents features of an audio file for CNN based model?

I am looking to understand various spectrograms for audio analysis. I want to convert an audio file into 10 second chunks, generate spectrograms for each and use a CNN model to train on top of those images to see if they are good or bad.
I have looked at linear, log, mel, etc and read somewhere that mel based spectrogram is best to be used for this. But with no proper verifiable information. I have used the simple following code to generate mel spectrogram.
y,sr= librosa.core.load(r'C:\Users\Tej\Desktop\NoiseWork\NoiseOnly\song.wav')
S = librosa.feature.melspectrogram(y=y, sr=sr)
librosa.display.specshow(librosa.power_to_db(S, ref=np.max))
My question is which spectrogram best represents features of an audio file for training with CNN? I have used linear but some audio files the linear spectrogram seems to be the same

To add to what has been stated, I recommend reading through A Comparison of Audio Signal Preprocessing Methods for Deep Neural Networks on Music Tagging by Keunwoo Choi, György Fazekas, Kyunghyun Cho, and Mark Sandler.
For their data, they achieved nearly identical classification accuracy between simple STFTs and melspectrograms. So melspectrograms seem to be the clear winner for dimension reduction if you don't mind the preprocessing. The authors also found, as jonner mentions, that log-scaling (essentially converting amplitude to a db scale) improves accuracy. You can easily do this with Librosa (using your code) like this:
y,sr= librosa.core.load(r'C:\Users\Tej\Desktop\NoiseWork\NoiseOnly\song.wav')
S = librosa.feature.melspectrogram(y=y, sr=sr)
S_db = librosa.core.power_to_db(S)
As for normalization after db-scaling, that seems hit or miss depending on your data. From the paper above, the authors found nearly no difference using various normalization techniques for their data.
One last thing that should be mentioned is a somewhat new method called Per-Channel Energy Normalization. I recommend reading Per-Channel Energy Normalization: Why and How by Vincent Lostanlen, Justin Salamon, Mark Cartwright, Brian McFee,
Andrew Farnsworth, Steve Kelling, and Juan Pablo Bello. Unfortunately, there are some parameters that need adjusting depending on the data, but in many cases seems to do as well as or better than logmelspectrograms. You can implement it in Librosa like this:
y,sr= librosa.core.load(r'C:\Users\Tej\Desktop\NoiseWork\NoiseOnly\song.wav')
S = librosa.feature.melspectrogram(y=y, sr=sr)
S_pcen = librosa.pcen(S)
Although, like I mentioned, there are parameters within pcen that need adjusting! Here is Librosa's documentation on PCEN to get you started if you are interested.

Log-scaled mel-spectrograms is the current "standard" for use with Convolutional Neural Networks. It was the most commonly used in Audio Event Detection and Audio Scene Classification literature between 2015-2018.
To be more invariant to amplitude changes, normalized is usually applied. Either to entire clips or the windows being classified. Mean/std normalization works fine, generally.
But from the perspective of a CNN, there is relatively small difference between the different spectrometer variations. So this is unlikely to fix your issue if two or more spectrograms are basically the same.

choose filter in convolution neural network

I have done implementation part of convolution neural network. But I am still confused about how to select the filter to obtain convolved feature in convolution neural network. As I know we detect features(like eyes, nose, mouth) to recognize a face from an image using convolution layer with the help of the filter.is it true that filter contains eyes, nose, mouth to recognize a face from an image?

There is no hard rule for this purpose.
In many university courses and even implemented models in papers, researcher uses 3x3 or 5x5 filters with with 1 or 2 strides.
It is one of your hyperparameters you should tune for your model. But the best way as a practice is to go to implemented model's documentations by google or others and find best size with respect to your conv layers.
But the last thing you should know is that the purpose of adding filters is to reduce nmber of parameters but keeping high quality features.
Here is a link to all models implemented using Tensoflow for different tasks.
Good luck

How do you decide on the dimensions of the convolutional neural filter?

If I have an image which is WxHx3 (RGB), how do I decide how big to make the filter masks? Is it a function of the dimensions (W and H) or something else? How does the dimensions of the second, third, ... filters compare to the dimensions of the first filter? (Any concrete pointers would be appreciated.)
I have seen the following, but they don't answer the question.
Dimensions in convolutional neural network
Convolutional Neural Networks: How many pixels will be covered by each of the filters?
How do you decide the parameters of a Convolutional Neural Network for image classification?

It would be great if you add details what are you trying to extract from the image and details of the dataset that you are trying to use.
A general assumption can be drawn from Alexnet and ZFnet about the filter mask sizes that are needed to be considered. There is no specific formulation which size should be considered for particular format but the size is kept low if a deeper analysis is required as many smaller details might miss with larger filter sizes. In the above link with Inception networks describes how effectively you can utilize the computing resources. If you dont have the issue of the resources, then from ZFNet you can observe the visualizations in multiple layers, there are many finer details visible. We can call it CNN even if it has one layer of convolution and pooling layer. The number of layers depends on the deep finer requirements.
I am not expert, but can recommend if your dataset is small as few thousands and not many features extraction is required, and if you are not sure about the size you can just simply go with the small sizes (small best and popular is 5x5 - Lenet5).

Trying to come up with features to extract from sound waves to use for an AI song composer

I am planning on making an AI song composer that would take in a bunch of songs of one instrument, extract musical notes (like ABCDEFG) and certain features from the sound wave, preform machine learning (most likely through recurrent neural networks), and output a sequence of ABCDEFG notes (aka generate its own songs / music).
I think that this would be an unsupervised learning problem, but I am not really sure.
I figured that I would use recurrent neural networks, but I have a few questions on how to approach this:
- What features from the sound wave I should extract so that the output music is melodious?
Also, I have a few other questions as well
- Is it possible, with recurrent neural networks, to output a vector of sequenced musical notes (ABCDEF)?
- Any smart way I can feed in the features of the soundwaves as well as sequence of musical notes?

Well i did something similar once(making a shazam like app on matlab) , i think you can use FFT(Fast Fourier Transform ) to break it down into the constituent frequencies and their corresponding amplitudes .Then you can use the frequency range of different instruments to select them out of the whole bunch and classify .

I already tried something similar with an RNN (Recurrent Neural Network). Try using an LSTM network (Long Short Term Memory), they are a WAY better than RNNs for this type of data processing from what I read afterward, because they do not suffer from the "vanishing gradient problem".
What Chris Thaliyath said is a good hint on how to train the feature detector.

how to make the image_shape dynamic in the convolution in Theano

I tried to process the tweets dataset using CNN in Theano. Different from images, the lenght of different tweets (corresponding to the image shape) is variable. So the shape of each tweet is different. However, in Theano, the convolution need that the shape information are constant values. So my question is that is there some way to make the image_shape dynamic?

Kalchbrenner et. al (2015) implemented an CNN that accepts dynamic length input and pools them into k elements. If there are less than k elements to begin with, the remaining are zero-padded. Their experiments with sentence classification show that such networks successfully represent grammatical structures.
For details check out:
the paper (http://arxiv.org/pdf/1404.2188v1.pdf)
Matlab code (link on page 2 of the paper)
suggestion for DCNNs for Theano/Keras (https://github.com/fchollet/keras/issues/373)

Convolutional neural networks are really better suited to processing images.
For processing tweets, you might want to read about recursive neural networks.
http://nlp.stanford.edu/~socherr/EMNLP2013_RNTN.pdf

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string