Gender Detection by audio - audio

I've been searching everywhere for some form of gender detection by reading frequency data of a audio file. I've had no luck with finding a program that could do that or even anything that can output audio data so I can write a basic program to read it and manipulate it to determine gender of the speaker.
Do any of you know where I can find something to help me with this?
To reiterate, I basically want to have a program that when a person talks into a microphone it will say the gender of the speaker with a fair amount of precision. My full plan is to also have speech to text feature on it, so the program will write out what the speaker said and give some extremely basic demographics on the speaker.
*Preferably with a common scripting language thats cross platform or linux supported.

Though an old question but still if someone is interested in doing gender detection from audio, You can easily do this by extracting MFCC (Mel-frequency Cepstral coefficient) features and model it with machine learning model GMM (Gausssian Mixture model)
One can follow this tutorial which implements the same and has evaluated it on subset extracted from Google's AudioSet gender wise data.
https://appliedmachinelearning.wordpress.com/2017/06/14/voice-gender-detection-using-gmms-a-python-primer/

You're going to want to look into formant detection and linear predictive coding. Heres a paper that has some signal flow diagrams that could be ported over to scipy/numpy.

Related

speech to text training for impaired voice

I want to train and use an ML based personal voice to text converter for a highly impaired voice, for a small set of 300-400 words. This is to be used for people with voice impairment. But cannot be generic because each person will have a unique voice input for words, depending on their type of impairment.
Wanted to know if there are any ML engines which allow for such a training. If not, what is the best approach to go about it.
Thanks
Most of the speech recognition engines support training (wav2letter, deepspeech, espnet, kaldi, etc), you just need to feed in the data. The only issue is that you need a lot of data to train reliably (1000 of samples for each word). You can check Google Commands dataset for example of how to train from scratch.
Since the training dataset will be pretty small for your case and will consist of just a few samples, you can probably start with existing pretrained model and finetune it on your samples to get best accuracy. You need to look on "few short learning" setups.
You can probably look on wav2vec 2.0 pretrained model, it should be effective for such learning. You can find examples and commands for fine-tuning and inference here.
You can also try fine-tuning Japser models in Google Commands for NVIDIA NEMO. It might be a little less effective but could still work and should be easier to setup.
I highely recommend watching the youtube original series "The age of AI"'s First season, episode two.
Basically, google already done this for people who can't really form normal words with impared voice. It is very interesting and speaks a little bit about how they done and doing that with ML technologies.
enter link description here

I want to know 'd-vector' for speaker diarization

When segmented speech audio was added to DNN model, I understood that the average value of the features extracted from the last hidden layer is 'd-vector'.
In that case, I want to know if the d-vector of the speaker can be extracted even if I put the voice of the speaker without learning.
By using this, when a segmented value of a voice file spoken by multiple people (using a mel-filterbank or MFCC) is put in, can we distinguish the speaker by clustering the extracted d-vector value as mentioned before?
To answer your questions:
After you train the model, you can get the d-vector simply by forward-propagating the input vector through the network. Normally you look at the output (final layer) of the ANN, but you can equally retrieve values from penultimate (the d-vector) layer.
Yes, you can distinguish speakers with the d-vector, as it produces in a way a high-level embedding of the audio signal that will have unique features for different people. See e.g. this paper.

Find timestamp of a word in an audio

I have a audio file of human speech. The length of the audio is about 1 minute. I want to find a timestamp of a word or a phrase spoken in the audio.
Is there any existing library that can do the task?
There are at least two ways to approach this issue: speech recognition and machine learning. Which is more suitable depends on your circumstances.
With speech recognition you could run the audio through an established speech-to-text recognizer and assess the timestamp of the word based on its distance from the beginning of the resulting string. With machine learning you would establish a model for the audio produced by the word or phrase from training data, then slice the test audio into suitable lengths and run each against the model to assess the likelihood of its being the word you are looking for.
The machine learning approach is likely to be the more accurate with respect to timestamp, but of course requires a lot of training data to establish the model in the first place.

Detect multiple voices without speech recognition

Is there a way to just detect in realtime if there are multiple people speaking? Do I need a voice recognition api for that?
I don't want to separate the audio and I don't want to transcribe it either. My approach would be to frequently record using one mic (-> mono) and then analyse those recordings. But how then would I detect und distinguish voices? I'd narrow it down by looking only at relevant frequencies, but then...
I do understand that this is no trivial undertaking. That's why I do hope there's an api out there capable of doing this out of the box - preferably an mobile/web-friendly api.
Now this might sound like a shopping list for Christmas but as mentioned I do not need to know anything about the content. So my guess is that a full fledged speech recognition would have a high toll on the performance.
Most of similar problems (adult/children classifier, speech/music classifier, single voice / voice mixture classifier) are standard machine learning problems. You can solve them with classifier like GMM. You only need to construct training data for your task, so:
Take some amount of clean recordings, you can download audiobook
Prepare mixed data by mixing clean recordings
Train GMM classifier on both
Compare probabilities from clean speech GMM and mixed speech GMM and decide the presence of mixture by ratio of probabilities from two classifiers.
You can find some code samples here:
https://github.com/littleowen/Conceptor
For example you can try
https://github.com/littleowen/Conceptor/blob/master/Gender.ipynb

Trying to come up with features to extract from sound waves to use for an AI song composer

I am planning on making an AI song composer that would take in a bunch of songs of one instrument, extract musical notes (like ABCDEFG) and certain features from the sound wave, preform machine learning (most likely through recurrent neural networks), and output a sequence of ABCDEFG notes (aka generate its own songs / music).
I think that this would be an unsupervised learning problem, but I am not really sure.
I figured that I would use recurrent neural networks, but I have a few questions on how to approach this:
- What features from the sound wave I should extract so that the output music is melodious?
Also, I have a few other questions as well
- Is it possible, with recurrent neural networks, to output a vector of sequenced musical notes (ABCDEF)?
- Any smart way I can feed in the features of the soundwaves as well as sequence of musical notes?
Well i did something similar once(making a shazam like app on matlab) , i think you can use FFT(Fast Fourier Transform ) to break it down into the constituent frequencies and their corresponding amplitudes .Then you can use the frequency range of different instruments to select them out of the whole bunch and classify .
I already tried something similar with an RNN (Recurrent Neural Network). Try using an LSTM network (Long Short Term Memory), they are a WAY better than RNNs for this type of data processing from what I read afterward, because they do not suffer from the "vanishing gradient problem".
What Chris Thaliyath said is a good hint on how to train the feature detector.

Resources