Is Speech-to-Text voice training data sampled at 48kHz still good for improving recognition of 16kHz speech - speech-to-text

We are training our Azure Cognitive Services Custom Speech model using data recorded in .wav (RIFF) format at 16bit, 16kHz as per the documentation.
But, we have obtained a dataset of speech recorded at 48kHz and encoded as MP3. Speech Studio seems to be able to train the service using this data without problems but we would like to know if doing so, with the higher sample rate, will only be of use in recognising streamed data also at the higher rate or does that not matter?

Having a higher sample rate like the one you described is desirable in terms of quality of the audio, but it generally won't influence speech recognition. As long as you meet the audio format minimum requirements, speech recognition should work just fine.

Related

Building GMM using SIDEKIT 1.2

I have a 2 dimensional data in the form of a text file. I have to build a GMM based on this data using Sidekit 1.2.
Which function should I use to estimate the parameters of the Gaussian model (Mean, covariance matrix, weighted average etc.)
Can you please provide a small example with your own set of (x,y) data and build a GMM using that ?
Any help would be greatly appreciated.
Sidekit is a toolkit built mainly for the task of speaker recognition, and its framework (as other similar toolkits) relies on the training data consisting of audio files in the formats .wav, .sph or raw PCM.
If you're just building a GMM and don't plan to use it for speaker recognition experiments, I would recommend using another toolkit for general statistical purposes (scikit-learn might be a good choice).
If you do plan to do speaker recognition tasks, you will have to some initial work on your data. If your text-data is some form of speaker data, you could convert it to the appropriate format. For example, if the y part is raw audio, convert it to wav-files. If y is cepstral features or other features, store it in h5.-format. After doing this, you can build a GMM for speaker recognition tasks by following the tutorials on the Sidekit homepage.

Find timestamp of a word in an audio

I have a audio file of human speech. The length of the audio is about 1 minute. I want to find a timestamp of a word or a phrase spoken in the audio.
Is there any existing library that can do the task?
There are at least two ways to approach this issue: speech recognition and machine learning. Which is more suitable depends on your circumstances.
With speech recognition you could run the audio through an established speech-to-text recognizer and assess the timestamp of the word based on its distance from the beginning of the resulting string. With machine learning you would establish a model for the audio produced by the word or phrase from training data, then slice the test audio into suitable lengths and run each against the model to assess the likelihood of its being the word you are looking for.
The machine learning approach is likely to be the more accurate with respect to timestamp, but of course requires a lot of training data to establish the model in the first place.

Detect multiple voices without speech recognition

Is there a way to just detect in realtime if there are multiple people speaking? Do I need a voice recognition api for that?
I don't want to separate the audio and I don't want to transcribe it either. My approach would be to frequently record using one mic (-> mono) and then analyse those recordings. But how then would I detect und distinguish voices? I'd narrow it down by looking only at relevant frequencies, but then...
I do understand that this is no trivial undertaking. That's why I do hope there's an api out there capable of doing this out of the box - preferably an mobile/web-friendly api.
Now this might sound like a shopping list for Christmas but as mentioned I do not need to know anything about the content. So my guess is that a full fledged speech recognition would have a high toll on the performance.
Most of similar problems (adult/children classifier, speech/music classifier, single voice / voice mixture classifier) are standard machine learning problems. You can solve them with classifier like GMM. You only need to construct training data for your task, so:
Take some amount of clean recordings, you can download audiobook
Prepare mixed data by mixing clean recordings
Train GMM classifier on both
Compare probabilities from clean speech GMM and mixed speech GMM and decide the presence of mixture by ratio of probabilities from two classifiers.
You can find some code samples here:
https://github.com/littleowen/Conceptor
For example you can try
https://github.com/littleowen/Conceptor/blob/master/Gender.ipynb

How do I tell the difference between an 8Khz acoustic model and a 16Khz model?

I'm able to get a reasonable level of accuracy with 8khz audio files. Now I want to try a higher sample rate, if I can.
Looking at the acoustic models available on this page, they list:
en-us-8khz.tar.gz
en-us-semi-full.tar.gz
en-us-semi.tar.gz
en-us.tar.gz
The one that says 8khz is obviously the one for the 8khz sample rate, but what about the other three? What sample rates do they match?
If I use a 16khz audio file, which of these acoustic models do I need to use?
And in the absense of the sample rate being in the file name, how do I figure out the sample rate of an acoustic model?
You can open the file feat.params in model folder and look for -upperf parameter. In 8khz model -upperf is usually 3500 or 4000. For 16khz model -upperf is more than 4000, usually 6800.

Gender Detection by audio

I've been searching everywhere for some form of gender detection by reading frequency data of a audio file. I've had no luck with finding a program that could do that or even anything that can output audio data so I can write a basic program to read it and manipulate it to determine gender of the speaker.
Do any of you know where I can find something to help me with this?
To reiterate, I basically want to have a program that when a person talks into a microphone it will say the gender of the speaker with a fair amount of precision. My full plan is to also have speech to text feature on it, so the program will write out what the speaker said and give some extremely basic demographics on the speaker.
*Preferably with a common scripting language thats cross platform or linux supported.
Though an old question but still if someone is interested in doing gender detection from audio, You can easily do this by extracting MFCC (Mel-frequency Cepstral coefficient) features and model it with machine learning model GMM (Gausssian Mixture model)
One can follow this tutorial which implements the same and has evaluated it on subset extracted from Google's AudioSet gender wise data.
https://appliedmachinelearning.wordpress.com/2017/06/14/voice-gender-detection-using-gmms-a-python-primer/
You're going to want to look into formant detection and linear predictive coding. Heres a paper that has some signal flow diagrams that could be ported over to scipy/numpy.

Resources