Is there an algorithm for Speaker Error Rate for speech-to-text diarization? - speech-to-text

Some speech-to-text services, like Google Speech-to-Text, offer speaker differentiation via diarization which attempts to identify and separate multiple speakers on a single audio recording. This is often needed when multiple speakers are in a meeting room sharing a single microphone.
Is there an algorithm and implementation to calculate the correctness of speaker separation?
This would be used in conjunction with Word Error Rate which is often used to test correctness of baseline transcription.

The commonly used approach for this appears to be the Diarization Error Rate (DER) defined by NIST in the NIST-RT projects.
A newer evaluation metric is the Jaccard Error Rate (JER) introduced in DIHARD II: The Second DIHARD Speech Diarization Challenge.
Two projects for measuring these include:
https://github.com/nryant/dscore
https://github.com/wq2012/SimpleDER
DER is referenced in these papers:
A Comparison of Neural Network Feature Transforms for Speaker Diarization
The ICSI RT-09 Speaker Diarization System

Related

Speech rate detection in python

I need to detect the speech rate (speed of spoken words) an a audio file. Most of codes available including pyaudioanalysis etc provide sampling rate, silence detection, or even emotion detection.
The need is I want to know how fast speaker is speaking. Can anyone suggest some code or technique please.
I worked with speech to text but there are 2 main problems
Not all the words are correct that is produced by the engine.
There can be long pauses in between the text that doesn't help for the detection of speech rate.
I was working with PRAAT software, and there is an extension for this in python(https://github.com/YannickJadoul/Parselmouth). A detailed explanation of the procedure is given here
There is an option for detection of speech rate with the script(https://sites.google.com/site/speechrate/Home/praat-script-syllable-nuclei-v2). Using Parselmouth we can run the script. In case if you are ok with using PRAAT software here is a step by step analysis https://sites.google.com/site/speechrate/Home/tutorial.
The script returns no of syllables, no of pauses, duration, speech rate, articulation rate, ASD(speaking_time/no_of_syllables).
for reference paper-https://www.researchgate.net/publication/24274554_Praat_script_to_detect_syllable_nuclei_and_measure_speech_rate_automatically
check this https://github.com/Shahabks/myprosody, this could work even.
Hope this helps.

Is it possible to, as accurately as possible, decompose an audio into MIDI, given the SoundFont that was used?

If I know the SoundFont that a MIDI to audio track has used, can I theoretically reverse the audio back into it's (most likely) MIDI components? If so, what would be one of the best approach to doing this?
The end goal is to try encoding audio (even voice samples) into MIDI such that I can reproduce the original audio in MIDI format better than, say, BearFileConverter. Hopefully with better results than just bandpass filters or FFT.
And no, this is not for any lossy audio compression or sheet transcription, this is mostly for my curiosity.
For monophonic music only, with no background sound, and if your SoundFont synthesis engine and your record sample rates are exactly matched (synchronized to 1ppm or better, have no additional effects, also both using a known A440 reference frequency, known intonation, etc.), then you can try using a set of cross correlations of your recorded audio against a set of synthesized waveform samples at each MIDI pitch from your a-priori known font to create a time line of statistical likelihoods for each MIDI note. Find the local maxima across your pitch range, threshold, and peak pick to find the most likely MIDI note onset times.
Another possibility is sliding sound fingerprinting, but at an even higher computational cost.
This fails in real life due to imperfectly matched sample rates plus added noise, speaker and room acoustic effects, multi-path reverb, and etc. You might also get false positives for note waveforms that are very similar to their own overtones. Voice samples vary even more from any template.
Forget bandpass filters or looking for FFT magnitude peaks, as this works reliably only for close to pure sinewaves, which very few musical instruments or interesting fonts sound like (or are as boring as).

What is the ideal audio level for Sphinx?

On my system, using my USB microphone, I've found that the audio level that works best with CMU Sphinx is about 20% of the maximum. This gives me 75% voice recognition accuracy. If I amplify this digitally I get far worse recognition accuracy (25%). Why is this? What is the recommended audio level for Sphinx? [Also I am using 16,000 samples/sec, 16-bit.]
pocketsphinx decoder uses channel amplitude normalization. Initial normalization value is configured to 20% audio level indeed inside the model (-cmninit parameter in feat.params). However, the level is updated as you decode, so it has only effect on first utterance. If you properly decode in continuous mode, level should not matter. Do not restart recognizer for every utterance, let it adapt to the noise and audio level.

How can audio data be abstracted for comparison purposes?

I am working on a project involving machine learning and data comparison.
For the purpose of this project, I am feeding abstracted video data to a neuronal network.
Now, abstracting image data is quite simple. I can take still-frames at certain points in the video, scale them down into 5 by 5 pixels (or any other manageable resolution) and get the pixel values for analysis.
The resulting data gives a unique, small and somewhat data-rich sample (even 5 samples of 5x5 px are enough to distinguish a drama from a nature documentary, etc).
However, I am stuck on the audio part. Since audio consists of samples and each sample by itself has no inherent meaning, I can't find a way to abstract audio down into processable blocks.
Are there common techniques for this process? If not, what metrics can audio data be quantified and abstracted in?
The process you require is audio feature extraction. A large number of feature detection algorithms exist, usually specialising in signals that are music or speech.
For music, chromacity, rhythm, harmonic distribution are all features you might extract - along with many more.
Typically, audio feature extraction algorithms work at a fairly macro level - that is to say thousands of samples at a time.
A good place to get started is Sonic visualiser which is a plug-in host for audio visualisation algorithms - many of which are feature extractors.
YAAFE may also have some useful stuff in it.

Speaker Recognition [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
How could I differentiate between two people speaking? As in if someone says "hello" and then another person says "hello" what kind of signature should I be looking for in the audio data? periodicity?
Thanks a lot to anyone who can answer this!
The solution to this problem lies in Digital Signal Processing (DSP). Speaker recognition is a complex problem which brings computers and communication engineering to work hand in hand. Most techniques of speaker identification require signal processing with machine learning (training over the speaker database and then identification using training data). The outline of algorithm which may be followed -
Record the audio in raw format. This serves as the digital signal which needs to be processed.
Apply some pre-processing routines over the captured signal. These routines could be simply signal normalization, or filtering the signal to remove noise (using band pass filters for normal frequency range of human voice. Band pass filters can in turn be created using a low pass and a high pass filter in combination.)
Once it is fairly certain that the captured signal is pretty much free from noise, feature extraction phase begins. Some of the known techniques which are used for extracting voice features are - Mel-Frequency Cepstral Coefficients (MFCC), Linear Predictive Coding (LPC) or simple FFT features.
Now, there are two phases - training and testing.
First the system needs to be trained over the voice features of different speakers before it is capable to distinguish between them. In order to ensure that the features are correctly calculated, it is recommended that several (>10) samples of voice from speakers must be collected for training purposes.
Training can be done using different techniques like neural networks or distance based classification to find the differences in the features of voices from different speakers.
In testing phase, the training data is used to find the voice feature set which lies at the lowest distance from the signal being tested. Different distances like Euclidean or Chebyshev distances might be used to calculate this proximity.
There are two open source implementations which enable speaker identification - ALIZE: http://mistral.univ-avignon.fr/index_en.html and MARF: http://marf.sourceforge.net/.
I know its a bit late to answer this question, but I hope someone finds it useful.
This is an extremely hard problem, even for experts in speech and signal processing. This page has much more information: http://en.wikipedia.org/wiki/Speaker_recognition
And some suggested technology starting points:
The various technologies used to
process and store voice prints include
frequency estimation, hidden Markov
models, Gaussian mixture models,
pattern matching algorithms, neural
networks, matrix representation,Vector
Quantization and decision trees. Some
systems also use "anti-speaker"
techniques, such as cohort models, and
world models.
Having only two people to differentiate, if they are uttering the same word or phrase will make this much easier. I suggest starting with something simple, and only adding complexity as needed.
To begin, I'd try sample counts of the digital waveform, binned by time and magnitude or (if you have the software functionality handy) an FFT of the entire utterance. I'd consider a basic modeling process first, too, such as linear discriminant (or whatever you already have available).
Another way to go is to use an array of microphones and differentiate between the postions and directions of the vocal sources. I consider this to be a easier approach since the position calculation is much less complicated than separating different speakers from a mono or stereo source.

Resources