I would like to get some feedback as to why in a lot of research papers that the researchers pass MFCCs through a Convolution Neural Network (CNN)? Inherently, the CNN in itself is a feature extraction process.
Any tips and advice as to why this process is commonly used.
Thanks!
MFCCs mimic non-linear human ear perception of sound and it approximates the human auditory system's response.
Therefore, MFCCs are widely used in speech recognition.
While CNNs are being used in feature extraction, raw audio signals are not commonly used as input in CNNs.
The reason for this is audio signals are inherently being prone to noise, and are often contaminated with frequency bands that are not useful for the intended applications.
Therefore, it is a common practice to preprocess the signal to remove noise and remove irrelevant frequency bands by means of bandpass filters, and then extract relevant features from it.
The features can either be time-domain features; such as amplitude envelope, root mean square energy, or zero-crossing rate, or frequency domain features; such as band energy ratio, spectral centroid, and spectral flux, or time-frequency representations; such as spectrogram and mel-spectrogram.
CNNs are then used to extract local patterns in these extracted features.
Especially, for the time-frequency representations, 2D CNNs are used to extract features, similar to the feature extraction process in image recognition applications.
Related
I've noticed in some papers that the authors use MFCC to classify non-speech audio files. Can you offer some advice as to what aspects of the MFCC algorithm that makes it useful for urban sound (that is, non speech audio files) classification and why? Thanks!
I am looking to understand various spectrograms for audio analysis. I want to convert an audio file into 10 second chunks, generate spectrograms for each and use a CNN model to train on top of those images to see if they are good or bad.
I have looked at linear, log, mel, etc and read somewhere that mel based spectrogram is best to be used for this. But with no proper verifiable information. I have used the simple following code to generate mel spectrogram.
y,sr= librosa.core.load(r'C:\Users\Tej\Desktop\NoiseWork\NoiseOnly\song.wav')
S = librosa.feature.melspectrogram(y=y, sr=sr)
librosa.display.specshow(librosa.power_to_db(S, ref=np.max))
My question is which spectrogram best represents features of an audio file for training with CNN? I have used linear but some audio files the linear spectrogram seems to be the same
To add to what has been stated, I recommend reading through A Comparison of Audio Signal Preprocessing Methods for Deep Neural Networks on Music Tagging by Keunwoo Choi, György Fazekas, Kyunghyun Cho, and Mark Sandler.
For their data, they achieved nearly identical classification accuracy between simple STFTs and melspectrograms. So melspectrograms seem to be the clear winner for dimension reduction if you don't mind the preprocessing. The authors also found, as jonner mentions, that log-scaling (essentially converting amplitude to a db scale) improves accuracy. You can easily do this with Librosa (using your code) like this:
y,sr= librosa.core.load(r'C:\Users\Tej\Desktop\NoiseWork\NoiseOnly\song.wav')
S = librosa.feature.melspectrogram(y=y, sr=sr)
S_db = librosa.core.power_to_db(S)
As for normalization after db-scaling, that seems hit or miss depending on your data. From the paper above, the authors found nearly no difference using various normalization techniques for their data.
One last thing that should be mentioned is a somewhat new method called Per-Channel Energy Normalization. I recommend reading Per-Channel Energy Normalization: Why and How by Vincent Lostanlen, Justin Salamon, Mark Cartwright, Brian McFee,
Andrew Farnsworth, Steve Kelling, and Juan Pablo Bello. Unfortunately, there are some parameters that need adjusting depending on the data, but in many cases seems to do as well as or better than logmelspectrograms. You can implement it in Librosa like this:
y,sr= librosa.core.load(r'C:\Users\Tej\Desktop\NoiseWork\NoiseOnly\song.wav')
S = librosa.feature.melspectrogram(y=y, sr=sr)
S_pcen = librosa.pcen(S)
Although, like I mentioned, there are parameters within pcen that need adjusting! Here is Librosa's documentation on PCEN to get you started if you are interested.
Log-scaled mel-spectrograms is the current "standard" for use with Convolutional Neural Networks. It was the most commonly used in Audio Event Detection and Audio Scene Classification literature between 2015-2018.
To be more invariant to amplitude changes, normalized is usually applied. Either to entire clips or the windows being classified. Mean/std normalization works fine, generally.
But from the perspective of a CNN, there is relatively small difference between the different spectrometer variations. So this is unlikely to fix your issue if two or more spectrograms are basically the same.
I have a task with speaker verification.
My task is calculate the similarity between two audio speech voice, then compare with a threshold.
Ex: similarity score between two audio is 70%, threshold is 50%. Hence the speaker is the same person.
The speech is text-independent, it's can be any conversation.
I have experiment in using MFCC, GMM for speaker recognition task, but this task is difference, just compare two audio feature to have the similarity score. I don't know which feature is good for speaker verification and which algorithm can help me to calculate similarity score between 2 patterns.
Hope to have you guys's advices,
Many thanks.
State of the art these days is xvectors:
Deep Neural Network Embeddings for Text-Independent Speaker Verification
Implementation in Kaldi is here.
I am also working on TIMIT Dataset for speaker verification. I have extracted mfcc features and trained a UBM for same, and adapted for each speaker.When it comes to adaptation I have used diagonal matrix.
How are you testing the wav files? However, when it comes to features you can use pitch and energy.
I am planning on making an AI song composer that would take in a bunch of songs of one instrument, extract musical notes (like ABCDEFG) and certain features from the sound wave, preform machine learning (most likely through recurrent neural networks), and output a sequence of ABCDEFG notes (aka generate its own songs / music).
I think that this would be an unsupervised learning problem, but I am not really sure.
I figured that I would use recurrent neural networks, but I have a few questions on how to approach this:
- What features from the sound wave I should extract so that the output music is melodious?
Also, I have a few other questions as well
- Is it possible, with recurrent neural networks, to output a vector of sequenced musical notes (ABCDEF)?
- Any smart way I can feed in the features of the soundwaves as well as sequence of musical notes?
Well i did something similar once(making a shazam like app on matlab) , i think you can use FFT(Fast Fourier Transform ) to break it down into the constituent frequencies and their corresponding amplitudes .Then you can use the frequency range of different instruments to select them out of the whole bunch and classify .
I already tried something similar with an RNN (Recurrent Neural Network). Try using an LSTM network (Long Short Term Memory), they are a WAY better than RNNs for this type of data processing from what I read afterward, because they do not suffer from the "vanishing gradient problem".
What Chris Thaliyath said is a good hint on how to train the feature detector.
I want to extract Audio Features using RBM (Restricted Boltzmann Machine). For this, I am giving the spectrogram (PCA whitened) as an input to the RBM.
For each audio file, The spectrogram is a matrix with no. of columns fixed but with different number of rows for each audio file. My question how can I train my RBM, or how can I extract the features from audio using RBM, given this spectrogram matrix. I read in a paper by Honglak Lee, paper title Unsupervised Feature Learning for Audio Classification using convolutional deep belief networks. http://machinelearning.wustl.edu/mlpapers/paper_files/NIPS2009_1171.pdf
"We then trained 300 first layer bases with a filter length of 6 and a max-pooling ratio of 3."
First, what is meant by bases here. (They have used Convolutional Deep Belief Networks, so I guess, bases do not mean weights here).
Second, what do they mean by using a filter length of 6? How can I do it? Any hint will be appreciated. (I am new to RBM)
I think what is confusing here is they add a convolutional layer to their deep belief network. The idea of the convolutional layer is they use kernels that are specific to a small region of the image, in their case a 6 element window. I'm not an expert in audio problems, but I believe bases refer to the different bands in the spectrograph.