Transforming speechUtterance into a Wave form - audio

I've been researching far and wide for a simple solution to animate the speech being played by Webkit's speechUtterance but it seems there is no library for that.
Now the question is, can I join the visualization of audio with that of speech utterance? All that I can currenty find is that the AudioContext requires some sort of datastream, which is not what Speech Utterance outputs.

Related

does converting from mulaw to linear impact audio quality?

I want to change audio encoding from mulaw to linear in order to use a linear speech recognition model from Google.
I'm using a telephony channel, so audio is encoded in mulaw, 8bits, 8000Hz.
When I use Google Mulaw model, there are some issue with recognizing some short single words -> basically they are not recognized at all -> API returns None
I was wondering if it is a good practise to change the encoding for Linear or Flac?
I already did it, but I cannot really measure the degree of this improvement.
It is always best practice to use either LINEAR16 for headerless audio data or FLAC for headered audio data. They both provide lossless codec. It is good practice to set the sampling rate to 16000 Hz otherwise you can set the sample_rate_hertz to match the native sample rate of the audio source (instead of re-sampling). Since Google Speech to Text API provides various ways to improve the audio quality, you can use World Level Confidence to measure the accuracy for response.
Ideally the audio would be recorded to start with using lossless codec like linear16 ot flac. But once you have it in format like mulaw transcoding it before sending to Google speech-to-text is not helpful.
Consider using model=phone_call and use_enhanced=true for better telephony quality.
For quick experimentation you can use STT UI https://cloud.google.com/speech-to-text/docs/ui-overview.

Algorithm to change speech pitch

I'm looking for a way to heighten the pitch of recorded speech audio.
I'd like to change the pitch only at the end of the speech, to create a sort of "up speak".
What are the typical algorithms to do this?
Thanks.
PSOLA (Pitch Synchronous Overlap and Add) is a digital signal processing technique used for speech processing and more specifically speech synthesis. It can be used to modify the pitch and duration of a speech signal.
Example code is
https://github.com/joaocarvalhoopen/Pitch_Shifter_using_PSOLA_algorithm

How to compare / match two non-identical sound clips

I need to take short sound samples every 5 seconds, and then upload these to our cloud server.
I then need to find a way to compare / check if that sample is part of a full long audio file.
The samples will be recorded from a phones microphone, so they will indeed not be exact.
I know this topic can get quite technical and complex, but I am sure there must be some libraries or online services that can assist in this complex audio matching / pairing.
One idea was to use a audio to text conversion service and then do matching based on the actual dialog. However this does not feel efficient to me. Where as matching based on actual sound frequencies or patterns would be a lot more efficient.
I know there are services out there such as Shazam that do this type of audio matching. However I would imagine their services are all propriety.
Some factors that could influence it:
Both audio samples with be timestamped. So we donot have to search through the entire sound clip.
To give you traction on getting an answer you need to focus on an answerable question where you have done battle and show your code
Off top of my head I would walk across the audio to pluck out a bucket of several samples ... then slide your bucket across several samples and perform another bucket pluck operation ... allow each bucket to contain overlap samples also contained in previous bucket as well as next bucket ... less samples quicker computation more samples greater accuracy to an extent YMMV
... feed each bucket into a Fourier Transform to render the time domain input audio into its frequency domain counterpart ... record into a database salient attributes of the FFT of each bucket like what are the X frequencies having most energy (greatest magnitude on your FFT)
... also perhaps store the standard deviation of those top X frequencies with respect to their energy (how disperse are those frequencies) ... define additional such attributes as needed ... for such a frequency domain approach to work you need relatively few samples in each bucket since FFT works on periodic time series data so if you feed it 500 milliseconds of complex audio like speech or music you no longer have periodic audio, instead you have mush
Then once all existing audio has been sent through above processing do same to your live new audio then identify what prior audio contains most similar sequence of buckets matching your current audio input ... use a Bayesian approach so your guesses have probabilistic weights attached which lend themselves to real-time updates
Sounds like a very cool project good luck ... here are some audio fingerprint resources
does audio clip A appear in audio file B
Detecting audio inside audio [Audio Recognition]
Detecting audio inside audio [Audio Recognition]
Detecting a specific pattern from a FFT in Arduino
Detecting a specific pattern from a FFT in Arduino
Audio Fingerprinting using the AudioContext API
https://news.ycombinator.com/item?id=21436414
https://iq.opengenus.org/audio-fingerprinting/
Chromaprint is the core component of the AcoustID project.
It's a client-side library that implements a custom algorithm for extracting fingerprints from any audio source
https://acoustid.org/chromaprint
Detecting a specific pattern from a FFT
Detecting a specific pattern from a FFT in Arduino
Audio landmark fingerprinting as a Node Stream module - nodejs converts a PCM audio signal into a series of audio fingerprints.
https://github.com/adblockradio/stream-audio-fingerprint
SO followup
How to compare / match two non-identical sound clips
How to compare / match two non-identical sound clips
Audio fingerprinting and recognition in Python
https://github.com/worldveil/dejavu
Audio Fingerprinting with Python and Numpy
http://willdrevo.com/fingerprinting-and-audio-recognition-with-python/
MusicBrainz: an open music encyclopedia (musicbrainz.org)
https://news.ycombinator.com/item?id=14478515
https://acoustid.org/chromaprint
How does Chromaprint work?
https://oxygene.sk/2011/01/how-does-chromaprint-work/
https://acoustid.org/
MusicBrainz is an open music encyclopedia that collects music metadata and makes it available to the public.
https://musicbrainz.org/
Chromaprint is the core component of the AcoustID project.
It's a client-side library that implements a custom algorithm for extracting fingerprints from any audio source
https://acoustid.org/chromaprint
Audio Matching (Audio Fingerprinting)
Is it possible to compare two similar songs given their wav files?
Is it possible to compare two similar songs given their wav files?
audio hash
https://en.wikipedia.org/wiki/Hash_function#Finding_similar_records
audio fingerprint
https://encrypted.google.com/search?hl=en&pws=0&q=python+audio+fingerprinting
ACRCloud
https://www.acrcloud.com/
How to recognize a music sample using Python and Gracenote?
Audio landmark fingerprinting as a Node Stream module - nodejs converts a PCM audio signal into a series of audio fingerprints.
https://github.com/adblockradio/stream-audio-fingerprint

Compare a source audio and microphone input based on intonation

I am working on a language learning app geared toward intermediate learners. One of the thing I want to do is to help learners with intonation.
**My idea is to:
play an audio with its pitch contour
as learners shadow the audio, display the pitch contour of the microphone input on top of the audio pitch contour as feedback to user
compare the microphone input with the audio to product a score.**
I'm not interested in pronunciation or voice recognition - they could be whistling for all I care, as long as the intonation is good.
I've done a bit of online research (mostly pitch contour and intonation comparison keywords) and these are the closest I have found:.
https://github.com/danafallon/IntonationCoach
https://github.com/melizalab/chirp
Based on my search there isn't anything ready-made out there but if anyone can point out what other keywords to search for, that would be appreciated.

How Google Speech to Text works?

I would like to know, How google converts speech to text in their Speech Recognition API.
Have they stored almost all sounds and match them at particular frequency level or do they have some different audio encoder and decoder algorithm which analyses the voice for different sound pattern like "A", "The" , "B", "V", "D", "Hello" etc.,
It will also be great. if some one could share, How audio are encoded and how stored audio can be filtered with all different sounds, for an example :-
Music which has sound of playing guitar, drum and voice, I would like to filter them out in 3 output with guitar sound separately, drum sound separately, voice sound separately and further decoding voice to text.
Any documentation link or research paper for university would be great.
Thanks
Google speech recognizer is described here. To understand it you probably need to read a textbook Automatic Speech Recognition
A Deep Learning Approach first.
Separation of guitar and drums is usually implemented with Non-Negative Matrix Factorization.

Resources