Detect multiple voices without speech recognition - audio

Is there a way to just detect in realtime if there are multiple people speaking? Do I need a voice recognition api for that?
I don't want to separate the audio and I don't want to transcribe it either. My approach would be to frequently record using one mic (-> mono) and then analyse those recordings. But how then would I detect und distinguish voices? I'd narrow it down by looking only at relevant frequencies, but then...
I do understand that this is no trivial undertaking. That's why I do hope there's an api out there capable of doing this out of the box - preferably an mobile/web-friendly api.
Now this might sound like a shopping list for Christmas but as mentioned I do not need to know anything about the content. So my guess is that a full fledged speech recognition would have a high toll on the performance.

Most of similar problems (adult/children classifier, speech/music classifier, single voice / voice mixture classifier) are standard machine learning problems. You can solve them with classifier like GMM. You only need to construct training data for your task, so:
Take some amount of clean recordings, you can download audiobook
Prepare mixed data by mixing clean recordings
Train GMM classifier on both
Compare probabilities from clean speech GMM and mixed speech GMM and decide the presence of mixture by ratio of probabilities from two classifiers.
You can find some code samples here:
https://github.com/littleowen/Conceptor
For example you can try
https://github.com/littleowen/Conceptor/blob/master/Gender.ipynb

Related

speech to text training for impaired voice

I want to train and use an ML based personal voice to text converter for a highly impaired voice, for a small set of 300-400 words. This is to be used for people with voice impairment. But cannot be generic because each person will have a unique voice input for words, depending on their type of impairment.
Wanted to know if there are any ML engines which allow for such a training. If not, what is the best approach to go about it.
Thanks
Most of the speech recognition engines support training (wav2letter, deepspeech, espnet, kaldi, etc), you just need to feed in the data. The only issue is that you need a lot of data to train reliably (1000 of samples for each word). You can check Google Commands dataset for example of how to train from scratch.
Since the training dataset will be pretty small for your case and will consist of just a few samples, you can probably start with existing pretrained model and finetune it on your samples to get best accuracy. You need to look on "few short learning" setups.
You can probably look on wav2vec 2.0 pretrained model, it should be effective for such learning. You can find examples and commands for fine-tuning and inference here.
You can also try fine-tuning Japser models in Google Commands for NVIDIA NEMO. It might be a little less effective but could still work and should be easier to setup.
I highely recommend watching the youtube original series "The age of AI"'s First season, episode two.
Basically, google already done this for people who can't really form normal words with impared voice. It is very interesting and speaks a little bit about how they done and doing that with ML technologies.
enter link description here

Which Spectrogram best represents features of an audio file for CNN based model?

I am looking to understand various spectrograms for audio analysis. I want to convert an audio file into 10 second chunks, generate spectrograms for each and use a CNN model to train on top of those images to see if they are good or bad.
I have looked at linear, log, mel, etc and read somewhere that mel based spectrogram is best to be used for this. But with no proper verifiable information. I have used the simple following code to generate mel spectrogram.
y,sr= librosa.core.load(r'C:\Users\Tej\Desktop\NoiseWork\NoiseOnly\song.wav')
S = librosa.feature.melspectrogram(y=y, sr=sr)
librosa.display.specshow(librosa.power_to_db(S, ref=np.max))
My question is which spectrogram best represents features of an audio file for training with CNN? I have used linear but some audio files the linear spectrogram seems to be the same
To add to what has been stated, I recommend reading through A Comparison of Audio Signal Preprocessing Methods for Deep Neural Networks on Music Tagging by Keunwoo Choi, György Fazekas, Kyunghyun Cho, and Mark Sandler.
For their data, they achieved nearly identical classification accuracy between simple STFTs and melspectrograms. So melspectrograms seem to be the clear winner for dimension reduction if you don't mind the preprocessing. The authors also found, as jonner mentions, that log-scaling (essentially converting amplitude to a db scale) improves accuracy. You can easily do this with Librosa (using your code) like this:
y,sr= librosa.core.load(r'C:\Users\Tej\Desktop\NoiseWork\NoiseOnly\song.wav')
S = librosa.feature.melspectrogram(y=y, sr=sr)
S_db = librosa.core.power_to_db(S)
As for normalization after db-scaling, that seems hit or miss depending on your data. From the paper above, the authors found nearly no difference using various normalization techniques for their data.
One last thing that should be mentioned is a somewhat new method called Per-Channel Energy Normalization. I recommend reading Per-Channel Energy Normalization: Why and How by Vincent Lostanlen, Justin Salamon, Mark Cartwright, Brian McFee,
Andrew Farnsworth, Steve Kelling, and Juan Pablo Bello. Unfortunately, there are some parameters that need adjusting depending on the data, but in many cases seems to do as well as or better than logmelspectrograms. You can implement it in Librosa like this:
y,sr= librosa.core.load(r'C:\Users\Tej\Desktop\NoiseWork\NoiseOnly\song.wav')
S = librosa.feature.melspectrogram(y=y, sr=sr)
S_pcen = librosa.pcen(S)
Although, like I mentioned, there are parameters within pcen that need adjusting! Here is Librosa's documentation on PCEN to get you started if you are interested.
Log-scaled mel-spectrograms is the current "standard" for use with Convolutional Neural Networks. It was the most commonly used in Audio Event Detection and Audio Scene Classification literature between 2015-2018.
To be more invariant to amplitude changes, normalized is usually applied. Either to entire clips or the windows being classified. Mean/std normalization works fine, generally.
But from the perspective of a CNN, there is relatively small difference between the different spectrometer variations. So this is unlikely to fix your issue if two or more spectrograms are basically the same.

Building GMM using SIDEKIT 1.2

I have a 2 dimensional data in the form of a text file. I have to build a GMM based on this data using Sidekit 1.2.
Which function should I use to estimate the parameters of the Gaussian model (Mean, covariance matrix, weighted average etc.)
Can you please provide a small example with your own set of (x,y) data and build a GMM using that ?
Any help would be greatly appreciated.
Sidekit is a toolkit built mainly for the task of speaker recognition, and its framework (as other similar toolkits) relies on the training data consisting of audio files in the formats .wav, .sph or raw PCM.
If you're just building a GMM and don't plan to use it for speaker recognition experiments, I would recommend using another toolkit for general statistical purposes (scikit-learn might be a good choice).
If you do plan to do speaker recognition tasks, you will have to some initial work on your data. If your text-data is some form of speaker data, you could convert it to the appropriate format. For example, if the y part is raw audio, convert it to wav-files. If y is cepstral features or other features, store it in h5.-format. After doing this, you can build a GMM for speaker recognition tasks by following the tutorials on the Sidekit homepage.

Trying to come up with features to extract from sound waves to use for an AI song composer

I am planning on making an AI song composer that would take in a bunch of songs of one instrument, extract musical notes (like ABCDEFG) and certain features from the sound wave, preform machine learning (most likely through recurrent neural networks), and output a sequence of ABCDEFG notes (aka generate its own songs / music).
I think that this would be an unsupervised learning problem, but I am not really sure.
I figured that I would use recurrent neural networks, but I have a few questions on how to approach this:
- What features from the sound wave I should extract so that the output music is melodious?
Also, I have a few other questions as well
- Is it possible, with recurrent neural networks, to output a vector of sequenced musical notes (ABCDEF)?
- Any smart way I can feed in the features of the soundwaves as well as sequence of musical notes?
Well i did something similar once(making a shazam like app on matlab) , i think you can use FFT(Fast Fourier Transform ) to break it down into the constituent frequencies and their corresponding amplitudes .Then you can use the frequency range of different instruments to select them out of the whole bunch and classify .
I already tried something similar with an RNN (Recurrent Neural Network). Try using an LSTM network (Long Short Term Memory), they are a WAY better than RNNs for this type of data processing from what I read afterward, because they do not suffer from the "vanishing gradient problem".
What Chris Thaliyath said is a good hint on how to train the feature detector.

Gender Detection by audio

I've been searching everywhere for some form of gender detection by reading frequency data of a audio file. I've had no luck with finding a program that could do that or even anything that can output audio data so I can write a basic program to read it and manipulate it to determine gender of the speaker.
Do any of you know where I can find something to help me with this?
To reiterate, I basically want to have a program that when a person talks into a microphone it will say the gender of the speaker with a fair amount of precision. My full plan is to also have speech to text feature on it, so the program will write out what the speaker said and give some extremely basic demographics on the speaker.
*Preferably with a common scripting language thats cross platform or linux supported.
Though an old question but still if someone is interested in doing gender detection from audio, You can easily do this by extracting MFCC (Mel-frequency Cepstral coefficient) features and model it with machine learning model GMM (Gausssian Mixture model)
One can follow this tutorial which implements the same and has evaluated it on subset extracted from Google's AudioSet gender wise data.
https://appliedmachinelearning.wordpress.com/2017/06/14/voice-gender-detection-using-gmms-a-python-primer/
You're going to want to look into formant detection and linear predictive coding. Heres a paper that has some signal flow diagrams that could be ported over to scipy/numpy.

Resources