How Google Speech to Text works? - audio

I would like to know, How google converts speech to text in their Speech Recognition API.
Have they stored almost all sounds and match them at particular frequency level or do they have some different audio encoder and decoder algorithm which analyses the voice for different sound pattern like "A", "The" , "B", "V", "D", "Hello" etc.,
It will also be great. if some one could share, How audio are encoded and how stored audio can be filtered with all different sounds, for an example :-
Music which has sound of playing guitar, drum and voice, I would like to filter them out in 3 output with guitar sound separately, drum sound separately, voice sound separately and further decoding voice to text.
Any documentation link or research paper for university would be great.
Thanks

Google speech recognizer is described here. To understand it you probably need to read a textbook Automatic Speech Recognition
A Deep Learning Approach first.
Separation of guitar and drums is usually implemented with Non-Negative Matrix Factorization.

Related

Algorithm to change speech pitch

I'm looking for a way to heighten the pitch of recorded speech audio.
I'd like to change the pitch only at the end of the speech, to create a sort of "up speak".
What are the typical algorithms to do this?
Thanks.
PSOLA (Pitch Synchronous Overlap and Add) is a digital signal processing technique used for speech processing and more specifically speech synthesis. It can be used to modify the pitch and duration of a speech signal.
Example code is
https://github.com/joaocarvalhoopen/Pitch_Shifter_using_PSOLA_algorithm

How to compare / match two non-identical sound clips

I need to take short sound samples every 5 seconds, and then upload these to our cloud server.
I then need to find a way to compare / check if that sample is part of a full long audio file.
The samples will be recorded from a phones microphone, so they will indeed not be exact.
I know this topic can get quite technical and complex, but I am sure there must be some libraries or online services that can assist in this complex audio matching / pairing.
One idea was to use a audio to text conversion service and then do matching based on the actual dialog. However this does not feel efficient to me. Where as matching based on actual sound frequencies or patterns would be a lot more efficient.
I know there are services out there such as Shazam that do this type of audio matching. However I would imagine their services are all propriety.
Some factors that could influence it:
Both audio samples with be timestamped. So we donot have to search through the entire sound clip.
To give you traction on getting an answer you need to focus on an answerable question where you have done battle and show your code
Off top of my head I would walk across the audio to pluck out a bucket of several samples ... then slide your bucket across several samples and perform another bucket pluck operation ... allow each bucket to contain overlap samples also contained in previous bucket as well as next bucket ... less samples quicker computation more samples greater accuracy to an extent YMMV
... feed each bucket into a Fourier Transform to render the time domain input audio into its frequency domain counterpart ... record into a database salient attributes of the FFT of each bucket like what are the X frequencies having most energy (greatest magnitude on your FFT)
... also perhaps store the standard deviation of those top X frequencies with respect to their energy (how disperse are those frequencies) ... define additional such attributes as needed ... for such a frequency domain approach to work you need relatively few samples in each bucket since FFT works on periodic time series data so if you feed it 500 milliseconds of complex audio like speech or music you no longer have periodic audio, instead you have mush
Then once all existing audio has been sent through above processing do same to your live new audio then identify what prior audio contains most similar sequence of buckets matching your current audio input ... use a Bayesian approach so your guesses have probabilistic weights attached which lend themselves to real-time updates
Sounds like a very cool project good luck ... here are some audio fingerprint resources
does audio clip A appear in audio file B
Detecting audio inside audio [Audio Recognition]
Detecting audio inside audio [Audio Recognition]
Detecting a specific pattern from a FFT in Arduino
Detecting a specific pattern from a FFT in Arduino
Audio Fingerprinting using the AudioContext API
https://news.ycombinator.com/item?id=21436414
https://iq.opengenus.org/audio-fingerprinting/
Chromaprint is the core component of the AcoustID project.
It's a client-side library that implements a custom algorithm for extracting fingerprints from any audio source
https://acoustid.org/chromaprint
Detecting a specific pattern from a FFT
Detecting a specific pattern from a FFT in Arduino
Audio landmark fingerprinting as a Node Stream module - nodejs converts a PCM audio signal into a series of audio fingerprints.
https://github.com/adblockradio/stream-audio-fingerprint
SO followup
How to compare / match two non-identical sound clips
How to compare / match two non-identical sound clips
Audio fingerprinting and recognition in Python
https://github.com/worldveil/dejavu
Audio Fingerprinting with Python and Numpy
http://willdrevo.com/fingerprinting-and-audio-recognition-with-python/
MusicBrainz: an open music encyclopedia (musicbrainz.org)
https://news.ycombinator.com/item?id=14478515
https://acoustid.org/chromaprint
How does Chromaprint work?
https://oxygene.sk/2011/01/how-does-chromaprint-work/
https://acoustid.org/
MusicBrainz is an open music encyclopedia that collects music metadata and makes it available to the public.
https://musicbrainz.org/
Chromaprint is the core component of the AcoustID project.
It's a client-side library that implements a custom algorithm for extracting fingerprints from any audio source
https://acoustid.org/chromaprint
Audio Matching (Audio Fingerprinting)
Is it possible to compare two similar songs given their wav files?
Is it possible to compare two similar songs given their wav files?
audio hash
https://en.wikipedia.org/wiki/Hash_function#Finding_similar_records
audio fingerprint
https://encrypted.google.com/search?hl=en&pws=0&q=python+audio+fingerprinting
ACRCloud
https://www.acrcloud.com/
How to recognize a music sample using Python and Gracenote?
Audio landmark fingerprinting as a Node Stream module - nodejs converts a PCM audio signal into a series of audio fingerprints.
https://github.com/adblockradio/stream-audio-fingerprint

Comparing voice input with existing audio sources

I'm currently working on creating a recipe for a script that would compare audio input with existing audio sources and return a match is any.
The idea is that the voice input would not be convertible to text. Those would be vocals such as dog ("woof") or cat ("meow") sound inputs.
In the end, I would like the script to conclude whether the input was a cat or dog sound, or none of the two.
I understand that It would require to pre process the sound input (low-pass; noise reduction etc), then do a spectrum analysis of the sound before comparing this to the existing spectrum analysis from the DB but I don't know where to start.
Are there any libraries for this kind of small project that could help?
How do I compare spectrum analysis?
How does spectrum analysis comparison take into account the possibility that two different people could make the same meow sound? Does it take into account a match up to a specific pourcentage?
Thanks for any guidance regarding this matter.

Transforming speechUtterance into a Wave form

I've been researching far and wide for a simple solution to animate the speech being played by Webkit's speechUtterance but it seems there is no library for that.
Now the question is, can I join the visualization of audio with that of speech utterance? All that I can currenty find is that the AudioContext requires some sort of datastream, which is not what Speech Utterance outputs.

Compare a source audio and microphone input based on intonation

I am working on a language learning app geared toward intermediate learners. One of the thing I want to do is to help learners with intonation.
**My idea is to:
play an audio with its pitch contour
as learners shadow the audio, display the pitch contour of the microphone input on top of the audio pitch contour as feedback to user
compare the microphone input with the audio to product a score.**
I'm not interested in pronunciation or voice recognition - they could be whistling for all I care, as long as the intonation is good.
I've done a bit of online research (mostly pitch contour and intonation comparison keywords) and these are the closest I have found:.
https://github.com/danafallon/IntonationCoach
https://github.com/melizalab/chirp
Based on my search there isn't anything ready-made out there but if anyone can point out what other keywords to search for, that would be appreciated.

Resources