Extract MFCC coefficient without the log? - audio

I am currently trying to replicate the works of a paper, in which they train a cnn using MFCC features without the DCT performed at the end. It is basically the log of the energies of the filter banks.
I know that kaldi can compute the MFCC features using the make_mfcc.sh script. But can the script somehow be altered to compute the MFCC without the DCT performed at the end, if not is there other tools that might me able to do so?
MFCCs are commonly derived as follows:
Take the Fourier transform of (a windowed excerpt of) a signal.
Map the powers of the spectrum obtained above onto the mel scale,
using triangular overlapping windows.
Take the logs of the powers at each of the mel frequencies.
Take the discrete cosine transform of the list of mel log powers, as
if it were a signal.
The MFCCs are the amplitudes of the resulting spectrum.

You can use make_fbank script to extract log energies.

Related

Using MFCCs and Mel-Spectrograms with CNN

I would like to get some feedback as to why in a lot of research papers that the researchers pass MFCCs through a Convolution Neural Network (CNN)? Inherently, the CNN in itself is a feature extraction process.
Any tips and advice as to why this process is commonly used.
Thanks!
MFCCs mimic non-linear human ear perception of sound and it approximates the human auditory system's response.
Therefore, MFCCs are widely used in speech recognition.
While CNNs are being used in feature extraction, raw audio signals are not commonly used as input in CNNs.
The reason for this is audio signals are inherently being prone to noise, and are often contaminated with frequency bands that are not useful for the intended applications.
Therefore, it is a common practice to preprocess the signal to remove noise and remove irrelevant frequency bands by means of bandpass filters, and then extract relevant features from it.
The features can either be time-domain features; such as amplitude envelope, root mean square energy, or zero-crossing rate, or frequency domain features; such as band energy ratio, spectral centroid, and spectral flux, or time-frequency representations; such as spectrogram and mel-spectrogram.
CNNs are then used to extract local patterns in these extracted features.
Especially, for the time-frequency representations, 2D CNNs are used to extract features, similar to the feature extraction process in image recognition applications.

CNN Keras Multi Label Output Prediction

I'm trying to implement the CNN code from Andreas Werdich: https://github.com/awerdich/physionet
" The goal of this project was to implement a deep-learning algorithm that classifies electrocardiogram (ECG) recordings from a single-channel handheld ECG device into four distinct categories: normal sinus rhythm (N), atrial fibrillation (A), other rhythm (O), or too noisy to be classified (~). "
Executing the code works fine. But now after the model was trained I'm not sure how to predict a different ECG signal. He uses ECG signal stored in hdf5 files.
"For each group of data in the hdf5 file representing a single ECG time series, the following metadata was saved as attribute:
baseline voltage in uV
bit depth
gain
sampling frequency
measurement units"
After training I saved the model with
model.save(filepath)
I put it on filedropper: http://www.filedropper.com/ecgcnn
And I have an hdf5 file full with ECG signals that I'd like to predict: http://www.filedropper.com/physioval
I tried using the model.predict function, but it didn't work. I'm not quite sure how to pass on the ECG signal, because I need 4 different classifications.
Does anyone know how I can make the prediction work?
Thanks

Which Spectrogram best represents features of an audio file for CNN based model?

I am looking to understand various spectrograms for audio analysis. I want to convert an audio file into 10 second chunks, generate spectrograms for each and use a CNN model to train on top of those images to see if they are good or bad.
I have looked at linear, log, mel, etc and read somewhere that mel based spectrogram is best to be used for this. But with no proper verifiable information. I have used the simple following code to generate mel spectrogram.
y,sr= librosa.core.load(r'C:\Users\Tej\Desktop\NoiseWork\NoiseOnly\song.wav')
S = librosa.feature.melspectrogram(y=y, sr=sr)
librosa.display.specshow(librosa.power_to_db(S, ref=np.max))
My question is which spectrogram best represents features of an audio file for training with CNN? I have used linear but some audio files the linear spectrogram seems to be the same
To add to what has been stated, I recommend reading through A Comparison of Audio Signal Preprocessing Methods for Deep Neural Networks on Music Tagging by Keunwoo Choi, György Fazekas, Kyunghyun Cho, and Mark Sandler.
For their data, they achieved nearly identical classification accuracy between simple STFTs and melspectrograms. So melspectrograms seem to be the clear winner for dimension reduction if you don't mind the preprocessing. The authors also found, as jonner mentions, that log-scaling (essentially converting amplitude to a db scale) improves accuracy. You can easily do this with Librosa (using your code) like this:
y,sr= librosa.core.load(r'C:\Users\Tej\Desktop\NoiseWork\NoiseOnly\song.wav')
S = librosa.feature.melspectrogram(y=y, sr=sr)
S_db = librosa.core.power_to_db(S)
As for normalization after db-scaling, that seems hit or miss depending on your data. From the paper above, the authors found nearly no difference using various normalization techniques for their data.
One last thing that should be mentioned is a somewhat new method called Per-Channel Energy Normalization. I recommend reading Per-Channel Energy Normalization: Why and How by Vincent Lostanlen, Justin Salamon, Mark Cartwright, Brian McFee,
Andrew Farnsworth, Steve Kelling, and Juan Pablo Bello. Unfortunately, there are some parameters that need adjusting depending on the data, but in many cases seems to do as well as or better than logmelspectrograms. You can implement it in Librosa like this:
y,sr= librosa.core.load(r'C:\Users\Tej\Desktop\NoiseWork\NoiseOnly\song.wav')
S = librosa.feature.melspectrogram(y=y, sr=sr)
S_pcen = librosa.pcen(S)
Although, like I mentioned, there are parameters within pcen that need adjusting! Here is Librosa's documentation on PCEN to get you started if you are interested.
Log-scaled mel-spectrograms is the current "standard" for use with Convolutional Neural Networks. It was the most commonly used in Audio Event Detection and Audio Scene Classification literature between 2015-2018.
To be more invariant to amplitude changes, normalized is usually applied. Either to entire clips or the windows being classified. Mean/std normalization works fine, generally.
But from the perspective of a CNN, there is relatively small difference between the different spectrometer variations. So this is unlikely to fix your issue if two or more spectrograms are basically the same.

Graph cut performed before training or as a post-processing to a pixel-based classification

I'm currently performing a pixel-based classification of an image using simple supervised classifiers implemented in Scikit-learn. The image is first reshaped into a vector of single pixel intensities, then the training and the classification are carried out as in the following:
from sklearn.linear_model import SGDClassifier
classifier = SGDClassifier(verbose=True)
classifier.fit(training_data, training_target)
predictions = classifier.predict(test_data)
The problem with pixel-based classification is the noisy nature of the resulting classified image. To prevent it, I wanted to use Graph Cut (e.g. Boykov-Kolmogorov implementation) to take into account the spatial context between pixels. But, the implmentations I found in Python (NetworkX, Graph-tool) and in C++ (OpenGM and the original implementation: [1] and [2]) don't show how to go from an image to a Graph, except for [2] which is in matlab, and I'm not really enough familiar with either of Graph Cut and matlab.
So my question is basically how can graph cuts be integrated into the previous classification (e.g. before the training or as a post-processing)?
I had a look at the graph algorithms in Scikit-image (here), but these work only on RGB images with discreet values, whereas my pixel values are continuous.
I found this image restoration tutorial which does more or less what I was looking for. Besides, you use a Python library wrapper (PyMaxflow) to call the maxflow algorithm to partition the graph.
It starts from the noisy image on the left, and takes into account the spatial constraint between pixels, to obtain the binary image on the right.

Building GMM using SIDEKIT 1.2

I have a 2 dimensional data in the form of a text file. I have to build a GMM based on this data using Sidekit 1.2.
Which function should I use to estimate the parameters of the Gaussian model (Mean, covariance matrix, weighted average etc.)
Can you please provide a small example with your own set of (x,y) data and build a GMM using that ?
Any help would be greatly appreciated.
Sidekit is a toolkit built mainly for the task of speaker recognition, and its framework (as other similar toolkits) relies on the training data consisting of audio files in the formats .wav, .sph or raw PCM.
If you're just building a GMM and don't plan to use it for speaker recognition experiments, I would recommend using another toolkit for general statistical purposes (scikit-learn might be a good choice).
If you do plan to do speaker recognition tasks, you will have to some initial work on your data. If your text-data is some form of speaker data, you could convert it to the appropriate format. For example, if the y part is raw audio, convert it to wav-files. If y is cepstral features or other features, store it in h5.-format. After doing this, you can build a GMM for speaker recognition tasks by following the tutorials on the Sidekit homepage.

Resources