Meaning of MFCC - audio

I have a conceptual problem.
I know what is a mel scale and what it represent and I know that this kind of spectrogram still has too much information for what I need.
I think that if we want reduce the number of information of the spectrogram we use the MFCC.
But I really don't get what the MFCC is and what it represent?
I use a MFCC matrix in a speech recognition process, but I don't understand what all of the number inside that vector represent.
The array is 13x130 and I don't know what all these float mean. I understood that more long is my audio track bigger is my matrix (e.g 13x250, 13x400).
I hope that I make myself clear.

Related

What is Mel spectrogram as an audio sequence and how do I apply it?

I was under the impression that Mel-spectrograms were simply spectrograms with mel scale as the y axis. However, recently, I read in a research paper this line "Data representations such as Mel-Spectrograms can be seen from two different perspectives: either as an image, or as an audio sequence."
What does this mean? It implies Mel-spectrograms are not just spectrograms, but can be interpreted in another way. If so, what is it exactly, and how can it be applied?
Spectrograms are 2-dimensional data, with the axes being Time and Frequency. There is 1 channel, which is the Energy/Power at a given Time-Frequency bin.
Images are also 2-dimensional data, where the axes are spatial extent (X/Y). If the image is grayscale, it also has just 1 channel.
Since many signal processing approaches does particularly care about the meaning of the axes, one can use many image processing techniques on spectrograms, and it can be quite useful.
There is however, nothing Mel specific about this. It applies the same with a linear/STFT spectrogram, a Chromagram or any other Time-Frequency representation.

About the usage of vocoders

I'm quite new to AI and I'm currently developing a model for non-parallel voice conversions. One confusing problem that I have is the use of vocoders.
So my model needs Mel spectrograms as the input and the current model that I'm working on is using the MelGAN vocoder (Github link) which can generate 22050Hz Mel spectrograms from raw wav files (which is what I need) and back. I recently tried WaveGlow Vocoder (PyPI link) which can also generate Mel spectrograms from raw wav files and back.
But, in other models such as,
WaveRNN , VocGAN , WaveGrad
There's no clear explanation about wav to Mel spectrograms generation. Do most of these models don't require the wav to Mel spectrograms feature because they largely cater to TTS models like Tacotron? or is it possible that all of these have that feature and I'm just not aware of it?
A clarification would be highly appreciated.
How neural vocoders handle audio -> mel
Check e.g. this part of the MelGAN code: https://github.com/descriptinc/melgan-neurips/blob/master/mel2wav/modules.py#L26
Specifically, the Audio2Mel module simply uses standard methods to create log-magnitude mel spectrograms like this:
Compute the STFT by applying the Fourier transform to windows of the input audio,
Take the magnitude of the resulting complex spectrogram,
Multiply the magnitude spectrogram by a mel filter matrix. Note that they actually get this matrix from librosa!
Take the logarithm of the resulting mel spectrogram.
Regarding the confusion
Your confusion might stem from the fact that, usually, authors of Deep Learning papers only mean their mel-to-audio "decoder" when they talk about "vocoders" -- the audio-to-mel part is always more or less the same. I say this might be confusing since, to my understanding, the classical meaning of the term "vocoder" includes both an encoder and a decoder.
Unfortunately, these methods will not always work exactly in the same manner as there are e.g. different methods to create the mel filter matrix, different padding conventions etc.
For example, librosa.stft has a center argument that will pad the audio before applying the STFT, while tensorflow.signal.stft does not have this (it would require manual padding beforehand).
An example for the different methods to create mel filters would be the htk argument in librosa.filters.mel, which switches between the "HTK" method and "Slaney". Again taking Tensorflow as an example, tf.signal.linear_to_mel_weight_matrix does not support this argument and always uses the HTK method. Unfortunately, I am not familiar with torchaudio, so I don't know if you need to be careful there, as well.
Finally, there are of course many parameters such as the STFT window size, hop length, the frequencies covered by the mel filters etc, and changing these relative to what a reference implementation used may impact your results. Since different code repositories likely use slightly different parameters, I suppose the answer to your question "will every method do the operation(to create a mel spectrogram) in the same manner?" is "not really". At the end of the day, you will have to settle for one set of parameters either way...
Bonus: Why are these all only decoders and the encoder is always the same?
The direction Mel -> Audio is hard. Not even Mel -> ("normal") spectrogram is well-defined since the conversion to mel spectrum is lossy and cannot be inverted. Finally, converting a spectrogram to audio is difficult since the phase needs to be estimated. You may be familiar with methods like Griffin-Lim (again, librosa has it so you can try it out). These produce noisy, low-quality audio. So the research focuses on improving this process using powerful models.
On the other hand, Audio -> Mel is simple, well-defined and fast. There is no need to define "custom encoders".
Now, a whole different question is whether mel spectrograms are a "good" encoding. Using methods like variational autoencoders, you could perhaps find better (e.g. more compact, less lossy) audio encodings. These would include custom encoders and decoders and you would not get away with standard librosa functions...

Extraction of sound features for goodness of pronunciation evaluation

I'm working on concept of Mobile application for children logopaedic exercises (goodness of pronunciation evaluation). In first iteration we want implement evaluation of correct pronunciation of one isolated consonant (russian equivalent of English “sh” [ʃ] sound). Result could be “correct” or “incorrect” (better points, e.g. from 1 to 5).
We have ~50 samples recorded by speech therapists and marked in 5 points quality measure. Each sample contains separate sound (0.5-2 seconds). We can get more samples in future.
In general, I split this problem in following steps:
Preprocess sound signal (reduce noise, amplify/attenuate, remove silent periods);
Extract proper signal features which are correlated with consonant pronunciation quality. Features are vector of numbers produced from sound chunk (frame). Feature candidates: frequency spectrum of a sound, MFCC coefficients, amplitude spectrum,... Another question is feature frame size (time duration).
Use some classification algorithm ("Machine learning" in general) to make classification based on features from sound training set.
The main problem I stacked with is lack of methodology how to extract features.
I have tried to use the MFCC approach, but it seems, that feature vector depends more on sound intensity variation during sample (Frankly, I did that conclusions just looking on plots of MFCC coefficients like https://drive.google.com/file/d/0BzBavyZHrcMlS0xLQ2phbmxoRVk/view?usp=sharing where X values are 13 MFCC coefficients and each line represents one sound frame of 25 ms).
I am not sure in pure spectrum characteristics, because of noise nature of consonants.
A lot of papers and blog posts describes problem of speech recognition in word and utterance context. My intuition says that I need different approach for my problem.
Examples of good features for similar tasks and general methodology of features evaluation will be both usable for me. Thanks.

Canny algorithm is enough for creating a feature descriptor image and giving for SVM?

i retrieve contours from images by using canny algorithm. it's enough to have a descriptor image and put in SVM and find similarities? Or i need necessarily other features like elongation, perimeter, area ?
I talk about this, because inspired by this example: http://scikit-learn.org/dev/auto_examples/plot_digits_classification.html i give my image in greyscale first, in canny algorithm style second and in both cases my confusion matrix was plenty of 0 like precision, recall, f1-score, support measure
My advice is:
unless you have a low number of images in your database and/or the recognition is going to be really specific (not a random thing for example) I would highly recommend you to apply one or more features extractors such SIFT, Fourier Descriptors, Haralick's Features, Hough Transform to extract more details which could be summarised in a short vector.
Then you could apply SVM after all this in order to get more accuracy.

Where can I learn how to work with audio data formats?

I'm working on an openGL project that involves a speaking cartoon face. My hope is to play the speech (encoded as mp3s) and animate its mouth using the audio data. I've never really worked with audio before so I'm not sure where to start, but some googling led me to believe my first step would be converting the mp3 to pcm.
I don't really anticipate the need for any Fourier transforms, though that could be nice. The mouth really just needs to move around when there's audio (I was thinking of basing it on volume).
Any tips on to implement something like this or pointers to resources would be much appreciated. Thanks!
-S
Whatever you do, you're going to need to decode the MP3s into PCM data first. There are a number of third-party libraries that can do this for you. Then, you'll need to analyze the PCM data and do some signal processing on it.
Automatically generating realistic lipsync data from audio is a very hard problem, and you're wise to not try to tackle it. I like your idea of simply basing it on the volume. One way you could compute the current volume is to use a rolling window of some size (e.g. 1/16 second), and compute the average power in the sound wave over that window. That is, at frame T, you compute the average power over frames [T-N, T], where N is the number of frames in your window.
Thanks to Parseval's theorem, we can easily compute the power in a wave without having to take the Fourier transform or anything complicated -- the average power is just the sum of the squares of the PCM values in the window, divided by the number of frames in the window. Then, you can convert the power into a decibel rating by dividing it by some base power (which can be 1 for simplicity), taking the logarithm, and multiplying by 10.

Resources