Find timestamp of a word in an audio - audio

I have a audio file of human speech. The length of the audio is about 1 minute. I want to find a timestamp of a word or a phrase spoken in the audio.
Is there any existing library that can do the task?

There are at least two ways to approach this issue: speech recognition and machine learning. Which is more suitable depends on your circumstances.
With speech recognition you could run the audio through an established speech-to-text recognizer and assess the timestamp of the word based on its distance from the beginning of the resulting string. With machine learning you would establish a model for the audio produced by the word or phrase from training data, then slice the test audio into suitable lengths and run each against the model to assess the likelihood of its being the word you are looking for.
The machine learning approach is likely to be the more accurate with respect to timestamp, but of course requires a lot of training data to establish the model in the first place.

Related

speech to text training for impaired voice

I want to train and use an ML based personal voice to text converter for a highly impaired voice, for a small set of 300-400 words. This is to be used for people with voice impairment. But cannot be generic because each person will have a unique voice input for words, depending on their type of impairment.
Wanted to know if there are any ML engines which allow for such a training. If not, what is the best approach to go about it.
Thanks
Most of the speech recognition engines support training (wav2letter, deepspeech, espnet, kaldi, etc), you just need to feed in the data. The only issue is that you need a lot of data to train reliably (1000 of samples for each word). You can check Google Commands dataset for example of how to train from scratch.
Since the training dataset will be pretty small for your case and will consist of just a few samples, you can probably start with existing pretrained model and finetune it on your samples to get best accuracy. You need to look on "few short learning" setups.
You can probably look on wav2vec 2.0 pretrained model, it should be effective for such learning. You can find examples and commands for fine-tuning and inference here.
You can also try fine-tuning Japser models in Google Commands for NVIDIA NEMO. It might be a little less effective but could still work and should be easier to setup.
I highely recommend watching the youtube original series "The age of AI"'s First season, episode two.
Basically, google already done this for people who can't really form normal words with impared voice. It is very interesting and speaks a little bit about how they done and doing that with ML technologies.
enter link description here

I want to know 'd-vector' for speaker diarization

When segmented speech audio was added to DNN model, I understood that the average value of the features extracted from the last hidden layer is 'd-vector'.
In that case, I want to know if the d-vector of the speaker can be extracted even if I put the voice of the speaker without learning.
By using this, when a segmented value of a voice file spoken by multiple people (using a mel-filterbank or MFCC) is put in, can we distinguish the speaker by clustering the extracted d-vector value as mentioned before?
To answer your questions:
After you train the model, you can get the d-vector simply by forward-propagating the input vector through the network. Normally you look at the output (final layer) of the ANN, but you can equally retrieve values from penultimate (the d-vector) layer.
Yes, you can distinguish speakers with the d-vector, as it produces in a way a high-level embedding of the audio signal that will have unique features for different people. See e.g. this paper.

Which feature, algorithm is good for Speaker Verification

I have a task with speaker verification.
My task is calculate the similarity between two audio speech voice, then compare with a threshold.
Ex: similarity score between two audio is 70%, threshold is 50%. Hence the speaker is the same person.
The speech is text-independent, it's can be any conversation.
I have experiment in using MFCC, GMM for speaker recognition task, but this task is difference, just compare two audio feature to have the similarity score. I don't know which feature is good for speaker verification and which algorithm can help me to calculate similarity score between 2 patterns.
Hope to have you guys's advices,
Many thanks.
State of the art these days is xvectors:
Deep Neural Network Embeddings for Text-Independent Speaker Verification
Implementation in Kaldi is here.
I am also working on TIMIT Dataset for speaker verification. I have extracted mfcc features and trained a UBM for same, and adapted for each speaker.When it comes to adaptation I have used diagonal matrix.
How are you testing the wav files? However, when it comes to features you can use pitch and energy.

Detect multiple voices without speech recognition

Is there a way to just detect in realtime if there are multiple people speaking? Do I need a voice recognition api for that?
I don't want to separate the audio and I don't want to transcribe it either. My approach would be to frequently record using one mic (-> mono) and then analyse those recordings. But how then would I detect und distinguish voices? I'd narrow it down by looking only at relevant frequencies, but then...
I do understand that this is no trivial undertaking. That's why I do hope there's an api out there capable of doing this out of the box - preferably an mobile/web-friendly api.
Now this might sound like a shopping list for Christmas but as mentioned I do not need to know anything about the content. So my guess is that a full fledged speech recognition would have a high toll on the performance.
Most of similar problems (adult/children classifier, speech/music classifier, single voice / voice mixture classifier) are standard machine learning problems. You can solve them with classifier like GMM. You only need to construct training data for your task, so:
Take some amount of clean recordings, you can download audiobook
Prepare mixed data by mixing clean recordings
Train GMM classifier on both
Compare probabilities from clean speech GMM and mixed speech GMM and decide the presence of mixture by ratio of probabilities from two classifiers.
You can find some code samples here:
https://github.com/littleowen/Conceptor
For example you can try
https://github.com/littleowen/Conceptor/blob/master/Gender.ipynb

Gender Detection by audio

I've been searching everywhere for some form of gender detection by reading frequency data of a audio file. I've had no luck with finding a program that could do that or even anything that can output audio data so I can write a basic program to read it and manipulate it to determine gender of the speaker.
Do any of you know where I can find something to help me with this?
To reiterate, I basically want to have a program that when a person talks into a microphone it will say the gender of the speaker with a fair amount of precision. My full plan is to also have speech to text feature on it, so the program will write out what the speaker said and give some extremely basic demographics on the speaker.
*Preferably with a common scripting language thats cross platform or linux supported.
Though an old question but still if someone is interested in doing gender detection from audio, You can easily do this by extracting MFCC (Mel-frequency Cepstral coefficient) features and model it with machine learning model GMM (Gausssian Mixture model)
One can follow this tutorial which implements the same and has evaluated it on subset extracted from Google's AudioSet gender wise data.
https://appliedmachinelearning.wordpress.com/2017/06/14/voice-gender-detection-using-gmms-a-python-primer/
You're going to want to look into formant detection and linear predictive coding. Heres a paper that has some signal flow diagrams that could be ported over to scipy/numpy.

Resources