About the usage of vocoders - audio

I'm quite new to AI and I'm currently developing a model for non-parallel voice conversions. One confusing problem that I have is the use of vocoders.
So my model needs Mel spectrograms as the input and the current model that I'm working on is using the MelGAN vocoder (Github link) which can generate 22050Hz Mel spectrograms from raw wav files (which is what I need) and back. I recently tried WaveGlow Vocoder (PyPI link) which can also generate Mel spectrograms from raw wav files and back.
But, in other models such as,
WaveRNN , VocGAN , WaveGrad
There's no clear explanation about wav to Mel spectrograms generation. Do most of these models don't require the wav to Mel spectrograms feature because they largely cater to TTS models like Tacotron? or is it possible that all of these have that feature and I'm just not aware of it?
A clarification would be highly appreciated.

How neural vocoders handle audio -> mel
Check e.g. this part of the MelGAN code: https://github.com/descriptinc/melgan-neurips/blob/master/mel2wav/modules.py#L26
Specifically, the Audio2Mel module simply uses standard methods to create log-magnitude mel spectrograms like this:
Compute the STFT by applying the Fourier transform to windows of the input audio,
Take the magnitude of the resulting complex spectrogram,
Multiply the magnitude spectrogram by a mel filter matrix. Note that they actually get this matrix from librosa!
Take the logarithm of the resulting mel spectrogram.
Regarding the confusion
Your confusion might stem from the fact that, usually, authors of Deep Learning papers only mean their mel-to-audio "decoder" when they talk about "vocoders" -- the audio-to-mel part is always more or less the same. I say this might be confusing since, to my understanding, the classical meaning of the term "vocoder" includes both an encoder and a decoder.
Unfortunately, these methods will not always work exactly in the same manner as there are e.g. different methods to create the mel filter matrix, different padding conventions etc.
For example, librosa.stft has a center argument that will pad the audio before applying the STFT, while tensorflow.signal.stft does not have this (it would require manual padding beforehand).
An example for the different methods to create mel filters would be the htk argument in librosa.filters.mel, which switches between the "HTK" method and "Slaney". Again taking Tensorflow as an example, tf.signal.linear_to_mel_weight_matrix does not support this argument and always uses the HTK method. Unfortunately, I am not familiar with torchaudio, so I don't know if you need to be careful there, as well.
Finally, there are of course many parameters such as the STFT window size, hop length, the frequencies covered by the mel filters etc, and changing these relative to what a reference implementation used may impact your results. Since different code repositories likely use slightly different parameters, I suppose the answer to your question "will every method do the operation(to create a mel spectrogram) in the same manner?" is "not really". At the end of the day, you will have to settle for one set of parameters either way...
Bonus: Why are these all only decoders and the encoder is always the same?
The direction Mel -> Audio is hard. Not even Mel -> ("normal") spectrogram is well-defined since the conversion to mel spectrum is lossy and cannot be inverted. Finally, converting a spectrogram to audio is difficult since the phase needs to be estimated. You may be familiar with methods like Griffin-Lim (again, librosa has it so you can try it out). These produce noisy, low-quality audio. So the research focuses on improving this process using powerful models.
On the other hand, Audio -> Mel is simple, well-defined and fast. There is no need to define "custom encoders".
Now, a whole different question is whether mel spectrograms are a "good" encoding. Using methods like variational autoencoders, you could perhaps find better (e.g. more compact, less lossy) audio encodings. These would include custom encoders and decoders and you would not get away with standard librosa functions...

Related

What is Coded Feature in Auto Encoder ? Is there any formula used to calculate the Coded Feature?

I am working on AutoCoder. I have drawn graphs for Input values and Reconstructed values using Matplotlib. How to draw graph for Coded Features and what is Coded Features?
An auto-encoder is a type of neural network where you combine an encoder and a decoder. Typically the output of the encoder has less parameters than the input and the decoder takes this reduced number of parameters to recreate the original input as closely as possible. You can think of the output of the encoder (also named 'latent vector') as a lossy compression of the input. There are however multiple other use-cases and types of auto-encoders. You can read more about them on Wikipedia and this blogpost on towardsdatascience.com also gives a very good explanation with examples.
If you want an explanation on how to draw graphs, you will need to give us more information on your input dataset and maybe show us your graphs of the input and output, because I'm not sure what you mean.

Extraction of sound features for goodness of pronunciation evaluation

I'm working on concept of Mobile application for children logopaedic exercises (goodness of pronunciation evaluation). In first iteration we want implement evaluation of correct pronunciation of one isolated consonant (russian equivalent of English “sh” [ʃ] sound). Result could be “correct” or “incorrect” (better points, e.g. from 1 to 5).
We have ~50 samples recorded by speech therapists and marked in 5 points quality measure. Each sample contains separate sound (0.5-2 seconds). We can get more samples in future.
In general, I split this problem in following steps:
Preprocess sound signal (reduce noise, amplify/attenuate, remove silent periods);
Extract proper signal features which are correlated with consonant pronunciation quality. Features are vector of numbers produced from sound chunk (frame). Feature candidates: frequency spectrum of a sound, MFCC coefficients, amplitude spectrum,... Another question is feature frame size (time duration).
Use some classification algorithm ("Machine learning" in general) to make classification based on features from sound training set.
The main problem I stacked with is lack of methodology how to extract features.
I have tried to use the MFCC approach, but it seems, that feature vector depends more on sound intensity variation during sample (Frankly, I did that conclusions just looking on plots of MFCC coefficients like https://drive.google.com/file/d/0BzBavyZHrcMlS0xLQ2phbmxoRVk/view?usp=sharing where X values are 13 MFCC coefficients and each line represents one sound frame of 25 ms).
I am not sure in pure spectrum characteristics, because of noise nature of consonants.
A lot of papers and blog posts describes problem of speech recognition in word and utterance context. My intuition says that I need different approach for my problem.
Examples of good features for similar tasks and general methodology of features evaluation will be both usable for me. Thanks.

How can audio data be abstracted for comparison purposes?

I am working on a project involving machine learning and data comparison.
For the purpose of this project, I am feeding abstracted video data to a neuronal network.
Now, abstracting image data is quite simple. I can take still-frames at certain points in the video, scale them down into 5 by 5 pixels (or any other manageable resolution) and get the pixel values for analysis.
The resulting data gives a unique, small and somewhat data-rich sample (even 5 samples of 5x5 px are enough to distinguish a drama from a nature documentary, etc).
However, I am stuck on the audio part. Since audio consists of samples and each sample by itself has no inherent meaning, I can't find a way to abstract audio down into processable blocks.
Are there common techniques for this process? If not, what metrics can audio data be quantified and abstracted in?
The process you require is audio feature extraction. A large number of feature detection algorithms exist, usually specialising in signals that are music or speech.
For music, chromacity, rhythm, harmonic distribution are all features you might extract - along with many more.
Typically, audio feature extraction algorithms work at a fairly macro level - that is to say thousands of samples at a time.
A good place to get started is Sonic visualiser which is a plug-in host for audio visualisation algorithms - many of which are feature extractors.
YAAFE may also have some useful stuff in it.

FFTW for exponential frequency axis

I have a group of related questions regarding FFTW and audio analysis on Linux.
What is the easiest-to-use, most comprehensive audio library in Linux/Ubuntu that will allow me to decode any of a variety of audio formats (MP3, etc.) and acquire a buffer of raw 16-bit PCM values? gstreamer?
I intend on taking that raw buffer and feeding it to FFTW to acquire frequency-domain data (without complex information or phase information). I think I should use one of their "r2r" methods, probably the DHT. Is this correct?
It seems that FFTW's output frequency axis is discretized in linear increments that are based on the buffer length. It further seems that I can't change this discretization within FFTW so I must do it after the DHT. Instead of a linear frequency axis, I need an exponential axis that follows 2^(i/12). I think I'll have to take the DHT output and run it through some custom anti-aliasing function. Is there a Linux library to do such anti-aliasing? If not, would a basic cosine-based anti-aliasing function work?
Thanks.
This is an age old problem with FFTs and working with audio - ideally we want a log frequency scale for audio but the DFT/FFT has a linear scale. You will need to choose an FFT size that gives sufficient resolution at the low end of your frequency range, and then accumulate bins across the frequency range of interest to give yourself a pseudo-logarithmic representation. There are more complex schemes, but essentially it all boils down to the same thing.
I've seen libsndfile used all over the place:
http://www.mega-nerd.com/libsndfile/
It's LGPL too. It can read pretty much all the open source and lossless audio format you would care about. It doesn't do MP3, however, because of licensing costs.

Real time pitch detection

I'm trying to do real time pitch detection of a users singing, but I'm running into alot of problems. I've tried lots of methods, including FFT (FFT Problem (Returns random results)) and autocorrelation (Autocorrelation pitch detection returns random results with mic input), but I can't seem to get any methods to give a good result. Can anyone suggest a method for real-time pitch tracking or how to improve on a method I already have? I can't seem to find any good C / C++ methods for real time pitch detection.
Thanks,
Niall.
Edit: Just to note, i've checked that the mic input data is correct, and that when using a sine wave the results are more or less the correct pitch.
Edit: Sorry this is late, but at the moment, im visualizing the autocolleration by taking the values out of the results array, and each index, and plotting the index on the X axis and the value on the Y axis (both are divided by 100000 or something, and im using OpenGL), plugging the data into a VST host and using VST plugins isn't an option to me. At the moment, it just looks like some random dots. Am i doing it correctly, or can you please point me torwards some code for doing it or help me understand how to visualize the raw audio data and autocorrelation data.
Taking a step back... To get this working you MUST figure out a way to plot intermediate steps of this process. What you're trying to do is not particularly hard, but it is error prone and fiddly. Clipping, windowing, bad wiring, aliasing, DC offsets, reading the wrong channels, the weird FFT frequency axis, impedance mismatches, frame size errors... who knows. But if you can plot the raw data, and then plot the FFT, all will become clear.
I found several open source implementations of real-time pitch tracking
dywapitchtrack uses a wavelet-based algorithm
"Realtime C# Pitch Tracker" uses a modified autocorrelation approach now removed from Codeplex - try searching on GitHub
aubio (mentioned by piem; several algorithms are available)
There are also some pitch trackers out there which might not be designed for real-time, but may be usable that way for all I know, and could also be useful as a reference to compare your real-time tracker to:
Praat is an open source package sometimes used for pitch extraction by linguists and you can find the algorithm documented at http://www.fon.hum.uva.nl/paul/praat.html
Snack and WaveSurfer also contain a pitch extractor
I know this answer isn't going to make everyone happy but here goes.
This stuff is hard, very hard. Firstly go read as many tutorials as you can find on FFT, Autocorrelation, Wavelets. Although I'm still struggling with DSP I did get some insights from the following.
https://www.coursera.org/course/audio the course isn't running at the moment but the videos are still available.
http://miracle.otago.ac.nz/tartini/papers/Philip_McLeod_PhD.pdf thesis about the development of a pitch recognition algorithm.
http://dsp.stackexchange.com a whole site dedicated to digital signal processing.
If like me you didn't do enough maths to completely follow the tutorials don't give up as some of the diagrams and examples still helped me to understand what was going on.
Next is test data and testing. Write yourself a library that generates test files to use in checking your algorithm/s.
1) A super simple pure sine wave generator. So say you are looking at writing YAT(Yet Another Tuner) then use your sine generator to create a series of files around 440Hz say from 420-460Hz in varying increments and see how sensitive and accurate your code is. Can it resolve to within 5Hz, 1Hz, finer still?
2) Then upgrade your sine wave generator so that it adds a series of weaker harmonics to the signal.
3) Next are real world variations on harmonics. So whilst for most stringed instruments you'll see a series of harmonics as simple multiples of the fundamental frequency F0, for instruments like clarinets and flutes because of the way the air behaves in the chamber the even harmonics will be missing or very weak. And for some instruments F0 is missing but can be determined from the distribution of the other harmonics. F0 being what the human ear perceives as pitch.
4) Throw in some deliberate distortion by shifting the harmonic peak frequencies up and down in an irregular manner
The point being that if you are creating files with known results then its easier to verify that what you are building actually works, bugs aside of course.
There are also a number of "libraries" out there containing sound samples.
https://freesound.org from the Coursera series mentioned above.
http://theremin.music.uiowa.edu/MIS.html
Next be aware that your microphone is not perfect and unless you have spent thousands of dollars on it will have a fairly variable frequency response range. In particular if you are working with low notes then cheaper microphones, read the inbuilt ones in your PC or Phone, have significant rolloff starting at around 80-100Hz. For reasonably good external ones you might get down to 30-40Hz. Go find the data on your microphone.
You can also check what happens by playing the tone through speakers and then recording with you favourite microphone. But of course now we are talking about 2 sets of frequency response curves.
When it comes to performance there are a number of freely available libraries out there although do be aware of the various licensing models.
Above all don't give up after your first couple of tries. Best of luck.
Here's the C++ source code for an unusual two-stage algorithm that I devised which can do Realtime Pitch Detection on polyphonic MP3 files while being played on Windows. This free application (PitchScope Player, available on web) is frequently used to detect the notes of a guitar or saxophone solo upon a MP3 recording. The algorithm is designed to detect the most dominant pitch (a musical note) at any given moment in time within a MP3 music file. Note onsets are accurately inferred by a significant change in the most dominant pitch (a musical note) at any given moment during the MP3 recording.
When a single key is pressed upon a piano, what we hear is not just one frequency of sound vibration, but a composite of multiple sound vibrations occurring at different mathematically related frequencies. The elements of this composite of vibrations at differing frequencies are referred to as harmonics or partials. For instance, if we press the Middle C key on the piano, the individual frequencies of the composite's harmonics will start at 261.6 Hz as the fundamental frequency, 523 Hz would be the 2nd Harmonic, 785 Hz would be the 3rd Harmonic, 1046 Hz would be the 4th Harmonic, etc. The later harmonics are integer multiples of the fundamental frequency, 261.6 Hz ( ex: 2 x 261.6 = 523, 3 x 261.6 = 785, 4 x 261.6 = 1046 ). Linked at the bottom, is a snapshot of the actual harmonics which occur during a polyphonic MP3 recording of a guitar solo.
Instead of a FFT, I use a modified DFT transform, with logarithmic frequency spacing, to first detect these possible harmonics by looking for frequencies with peak levels (see diagram below). Because of the way that I gather data for my modified Log DFT, I do NOT have to apply a Windowing Function to the signal, nor do add and overlap. And I have created the DFT so its frequency channels are logarithmically located in order to directly align with the frequencies where harmonics are created by the notes on a guitar, saxophone, etc.
Now being retired, I have decided to release the source code for my pitch detection engine within a free demonstration app called PitchScope Player. PitchScope Player is available on the web, and you could download the executable for Windows to see my algorithm at work on a mp3 file of your choosing. The below link to GitHub.com will lead you to my full source code where you can view how I detect the harmonics with a custom Logarithmic DFT transform, and then look for partials (harmonics) whose frequencies satisfy the correct integer relationship which defines a 'pitch'.
My Pitch Detection Algorithm is actually a two-stage process: a) First the ScalePitch is detected ('ScalePitch' has 12 possible pitch values: {E, F, F#, G, G#, A, A#, B, C, C#, D, D#} ) b) and after ScalePitch is determined, then the Octave is calculated by examining all the harmonics for the 4 possible Octave-Candidate notes. The algorithm is designed to detect the most dominant pitch (a musical note) at any given moment in time within a polyphonic MP3 file. That usually corresponds to the notes of an instrumental solo. Those interested in the C++ source code for my Two-Stage Pitch Detection algorithm might want to start at the Estimate_ScalePitch() function within the SPitchCalc.cpp file at GitHub.com.
https://github.com/CreativeDetectors/PitchScope_Player
Below is the image of a Logarithmic DFT (created by my C++ software) for 3 seconds of a guitar solo on a polyphonic mp3 recording. It shows how the harmonics appear for individual notes on a guitar, while playing a solo. For each note on this Logarithmic DFT we can see its multiple harmonics extending vertically, because each harmonic will have the same time-width. After the Octave of the note is determined, then we know the frequency of the Fundamental.
I had a similar problem with microphone input on a project I did a few years back - turned out to be due to a DC offset.
Make sure you remove any bias before attempting FFT or whatever other method you are using.
It is also possible that you are running into headroom or clipping problems.
Graphs are the best way to diagnose most problems with audio.
Take a look at this sample application:
http://www.codeproject.com/KB/audio-video/SoundCatcher.aspx
I realize the app is in C# and you need C++, and I realize this is .Net/Windows and you're on a mac... But I figured his FFT implementation might be a starting reference point. Try to compare your FFT implementation to his. (His is the iterative, breadth-first version of Cooley-Tukey's FFT). Are they similar?
Also, the "random" behavior you're describing might be because you're grabbing data returned by your sound card directly without assembling the values from the byte-array properly. Did you ask your sound card to sample 16 bit values, and then gave it a byte-array to store the values in? If so, remember that two consecutive bytes in the returned array make up one 16-bit audio sample.
Java code for a real-time real detector is available at http://code.google.com/p/freqazoid/.
It works fairly well on any computer running post-2008 real-time Java. The project has been dropped and could be picked up by any interested party. Contact me if you want further details.
Check out aubio, and open source library which includes several state-of-the-art methods for pitch tracking.
I have asked a similar question here:
C/C++/Obj-C Real-time algorithm to ascertain Note (not Pitch) from Vocal Input
EDIT:
Performous contains a C++ module for realtime pitch detection
Also Yin Pitch-Tracking algorithm
You could do real time pitch detection, be it of a singer's voice, with TarsosDSP
https://github.com/JorenSix/TarsosDSP
just in case anyone hasn't heard of it yet :-)
Can you adapt anything from instrument tuners? My delightfully compact guitar tuner is able to detect the pitch of the strings pretty well. I see this reference to a piano tuner which explains an algorithm to some extent.
Here are some open source libraries that implement pitch detection:
WORLD : speech analysis/synthesis toolkit. This is especially suitable if your source signal is voice.
aubio : audio feature extraction library. Implements many pitch detection algorithms.
Pitch detection : a collection of pitch detection algorithms implemented in C++.
dywapitchtrack : a high quality pitch detection algorithm.
YIN : another implementation of the YIN algorithm in a single C++ source file.

Resources