How does Google Speech API chunk the audio for transcription? - audio

When the Google Speech API returns long audio transcriptions, it returns it in the form of short chunks of text of varying length, each with some associated confidence value. I was wondering how the underlying algorithm decides where to place boundaries between the transcribed chunks of audio, since it seems to be more complicated than simply chunking the audio into fixed-duration pieces and transcribing each separately (although I could be wrong about this).

Related

does converting from mulaw to linear impact audio quality?

I want to change audio encoding from mulaw to linear in order to use a linear speech recognition model from Google.
I'm using a telephony channel, so audio is encoded in mulaw, 8bits, 8000Hz.
When I use Google Mulaw model, there are some issue with recognizing some short single words -> basically they are not recognized at all -> API returns None
I was wondering if it is a good practise to change the encoding for Linear or Flac?
I already did it, but I cannot really measure the degree of this improvement.
It is always best practice to use either LINEAR16 for headerless audio data or FLAC for headered audio data. They both provide lossless codec. It is good practice to set the sampling rate to 16000 Hz otherwise you can set the sample_rate_hertz to match the native sample rate of the audio source (instead of re-sampling). Since Google Speech to Text API provides various ways to improve the audio quality, you can use World Level Confidence to measure the accuracy for response.
Ideally the audio would be recorded to start with using lossless codec like linear16 ot flac. But once you have it in format like mulaw transcoding it before sending to Google speech-to-text is not helpful.
Consider using model=phone_call and use_enhanced=true for better telephony quality.
For quick experimentation you can use STT UI https://cloud.google.com/speech-to-text/docs/ui-overview.

I need to analyse many audio WAV files for characteristic noise, ideas?

I need to be able to analyze (search thru) hundreds of WAV files and detect but not remove static noise. As done currently now, I must listen to each conversation and find the characteristic noise/static manually, which takes too much time. Ideally, I would need a program that can read each new WAV file and be able to detect characteristic signatures of the static noise such as periods of bursts of white noise or full audio band, high amplitude noise (like AM radio noise over phone conversation such as a wall of white noise) or bursts of peek high frequency high amplitude (as in crackling on the phone line) in a background of normal voice. I do not need to remove the noise but simply detect it and flag the recording for further troubleshooting. Ideas?
I can listen to the recordings and find the static or crackling but this takes time. I need an automated or batch process that can run on its own and flag the troubled call recordings (WAV files for a phone PBX). These are SIP and analog conversations depending on the leg of the conversation so RTSP/SIP packet analysis might be an option, but the raw WAV file is the simplest. I can use Audacity, but this still requires opening each file and looking at the visual representation of the audio spectrometry and is only a little faster than listening to each call but still cumbersome.
I currently have no code or methods for this task. I simply listen to each call wav file to find the noise.
I need a batch Wav file search that can render wav file recordings that contain the characteristic noise or static or crackling over the recording phone conversation.
Unless you can tell the program how the noise looks like, it's going to be challenging to run any sort of batch processing. I was facing a similar challenge and that prompted me to develop (free and open source) software to help user in audio exploration, analysis and signal separation:
App: https://audioexplorer.online/
Docs: https://tracek.github.io/audio-explorer/
Source code: https://github.com/tracek/audio-explorer
Essentially, it visualises audio as a 2d scatter plot rather than only "linear", as in waveform or spectrogram. When you upload audio the following happens:
Onsets are detected (based on high-frequency content algorithm from aubio) according to the threshold you set. Set it to None if you want all.
Per each audio fragment, calculate audio features based on your selection. There's no universal best set of features, all depends on the application. You might try for starter with e.g. Pitch statistics. Consider setting proper values for bandpass filter and sample length (that's the length of audio fragment we're going to use). Sample length could be in future established dynamically. Check docs for more info.
The result is that for each fragment you have many features, e.g. 6 or 60. That means we have then k-dimensional (where k is number of features) structure, which we then project to 2d space with dimensionality reduction algorithm of your selection. Uniform Manifold Approximation and Projection is a sound choice.
In theory, the resulting embedding should be such that similar sounds (according to features we have selected) are closely together, while different further apart. Your noise should be now separated from your "not noise" and form cluster.
When you hover over the graph, in right-upper corner a set of icons appears. One is lasso selection. Use it to mark points, inspect spectrogram and e.g. download table with features that describe that signal. At that moment you can also reduce the noise (extra button appears) in a similar way to Audacity - it analyses the spectrum and reduces these frequencies with some smoothing.
It does not completely solve your problem right now, but could severely cut the effort. Going through hundreds of wavs could take better part of the day, but you will be done. Want it automated? There's CLI (command-line interface) that I am developing at the same time. In not-too-distant future it should take what you have labelled as noise and signal and then use supervised machine learning to go through everything in batch mode.
Suggestions / feedback? Drop an issue on GitHub.

How to compare / match two non-identical sound clips

I need to take short sound samples every 5 seconds, and then upload these to our cloud server.
I then need to find a way to compare / check if that sample is part of a full long audio file.
The samples will be recorded from a phones microphone, so they will indeed not be exact.
I know this topic can get quite technical and complex, but I am sure there must be some libraries or online services that can assist in this complex audio matching / pairing.
One idea was to use a audio to text conversion service and then do matching based on the actual dialog. However this does not feel efficient to me. Where as matching based on actual sound frequencies or patterns would be a lot more efficient.
I know there are services out there such as Shazam that do this type of audio matching. However I would imagine their services are all propriety.
Some factors that could influence it:
Both audio samples with be timestamped. So we donot have to search through the entire sound clip.
To give you traction on getting an answer you need to focus on an answerable question where you have done battle and show your code
Off top of my head I would walk across the audio to pluck out a bucket of several samples ... then slide your bucket across several samples and perform another bucket pluck operation ... allow each bucket to contain overlap samples also contained in previous bucket as well as next bucket ... less samples quicker computation more samples greater accuracy to an extent YMMV
... feed each bucket into a Fourier Transform to render the time domain input audio into its frequency domain counterpart ... record into a database salient attributes of the FFT of each bucket like what are the X frequencies having most energy (greatest magnitude on your FFT)
... also perhaps store the standard deviation of those top X frequencies with respect to their energy (how disperse are those frequencies) ... define additional such attributes as needed ... for such a frequency domain approach to work you need relatively few samples in each bucket since FFT works on periodic time series data so if you feed it 500 milliseconds of complex audio like speech or music you no longer have periodic audio, instead you have mush
Then once all existing audio has been sent through above processing do same to your live new audio then identify what prior audio contains most similar sequence of buckets matching your current audio input ... use a Bayesian approach so your guesses have probabilistic weights attached which lend themselves to real-time updates
Sounds like a very cool project good luck ... here are some audio fingerprint resources
does audio clip A appear in audio file B
Detecting audio inside audio [Audio Recognition]
Detecting audio inside audio [Audio Recognition]
Detecting a specific pattern from a FFT in Arduino
Detecting a specific pattern from a FFT in Arduino
Audio Fingerprinting using the AudioContext API
https://news.ycombinator.com/item?id=21436414
https://iq.opengenus.org/audio-fingerprinting/
Chromaprint is the core component of the AcoustID project.
It's a client-side library that implements a custom algorithm for extracting fingerprints from any audio source
https://acoustid.org/chromaprint
Detecting a specific pattern from a FFT
Detecting a specific pattern from a FFT in Arduino
Audio landmark fingerprinting as a Node Stream module - nodejs converts a PCM audio signal into a series of audio fingerprints.
https://github.com/adblockradio/stream-audio-fingerprint
SO followup
How to compare / match two non-identical sound clips
How to compare / match two non-identical sound clips
Audio fingerprinting and recognition in Python
https://github.com/worldveil/dejavu
Audio Fingerprinting with Python and Numpy
http://willdrevo.com/fingerprinting-and-audio-recognition-with-python/
MusicBrainz: an open music encyclopedia (musicbrainz.org)
https://news.ycombinator.com/item?id=14478515
https://acoustid.org/chromaprint
How does Chromaprint work?
https://oxygene.sk/2011/01/how-does-chromaprint-work/
https://acoustid.org/
MusicBrainz is an open music encyclopedia that collects music metadata and makes it available to the public.
https://musicbrainz.org/
Chromaprint is the core component of the AcoustID project.
It's a client-side library that implements a custom algorithm for extracting fingerprints from any audio source
https://acoustid.org/chromaprint
Audio Matching (Audio Fingerprinting)
Is it possible to compare two similar songs given their wav files?
Is it possible to compare two similar songs given their wav files?
audio hash
https://en.wikipedia.org/wiki/Hash_function#Finding_similar_records
audio fingerprint
https://encrypted.google.com/search?hl=en&pws=0&q=python+audio+fingerprinting
ACRCloud
https://www.acrcloud.com/
How to recognize a music sample using Python and Gracenote?
Audio landmark fingerprinting as a Node Stream module - nodejs converts a PCM audio signal into a series of audio fingerprints.
https://github.com/adblockradio/stream-audio-fingerprint

Watson Speech To Text service works faster for which type of audio file?

I have tried the Watson Speech to Text API for MP3 as well as WAV files. As per my observation, the same length of audio takes less time if its given in MP3 format as compared to WAV. 10 consecutive API calls with different audios took on an average 8.7 seconds for MP3 files. On the other hand the same input in WAV format took average 11.1 seconds. Does the service response time depend on the file type? Which file type is recommended to use to obtain the results faster?
Different encoding formats have different bitrates. mp3 and opus are lossy compression formats (although suitable for speech recognition when bitrates are not too low) so they offer the lowest bitrates. If you need to push less bytes over the network that is typically better for latency, so depending on your network speed you can see shorter processing times when using encoding with lower bitrates.
However, regarding the actual speech recognition process (ignoring the data transfer over the network) all encodings are equally fast since before the recognition starts all the audio is uncompressed, if necessary, and converted to the sampling rate of the target model (broadband or narrowband).

Is it possible to, as accurately as possible, decompose an audio into MIDI, given the SoundFont that was used?

If I know the SoundFont that a MIDI to audio track has used, can I theoretically reverse the audio back into it's (most likely) MIDI components? If so, what would be one of the best approach to doing this?
The end goal is to try encoding audio (even voice samples) into MIDI such that I can reproduce the original audio in MIDI format better than, say, BearFileConverter. Hopefully with better results than just bandpass filters or FFT.
And no, this is not for any lossy audio compression or sheet transcription, this is mostly for my curiosity.
For monophonic music only, with no background sound, and if your SoundFont synthesis engine and your record sample rates are exactly matched (synchronized to 1ppm or better, have no additional effects, also both using a known A440 reference frequency, known intonation, etc.), then you can try using a set of cross correlations of your recorded audio against a set of synthesized waveform samples at each MIDI pitch from your a-priori known font to create a time line of statistical likelihoods for each MIDI note. Find the local maxima across your pitch range, threshold, and peak pick to find the most likely MIDI note onset times.
Another possibility is sliding sound fingerprinting, but at an even higher computational cost.
This fails in real life due to imperfectly matched sample rates plus added noise, speaker and room acoustic effects, multi-path reverb, and etc. You might also get false positives for note waveforms that are very similar to their own overtones. Voice samples vary even more from any template.
Forget bandpass filters or looking for FFT magnitude peaks, as this works reliably only for close to pure sinewaves, which very few musical instruments or interesting fonts sound like (or are as boring as).

Resources