Which is best sound format for IBM Speech to Text? - audio

IBM advises using Opus sound format for audio submitted to its Watson Speech to Text service. The idea being that Opus is designed specifically for speech.
Otherwise, it says you will get better quality transcription when submitting audio in flac format than in mp3 format. The latter has the obvious advantage of its small size. There is after all a 100Mb limit for file submissions. So you weigh the balance of your needs. That all makes sense so far.
But looking at conversions done on a source WAV file, the Opus file is size is comparable with mp3.
Downsampling a 366Mb wav file to 8k sample rate (one of two sample rates advised for using the service), created a wav file of 66.4Mb. Converting that to flac, wav and opus produced flac: 43.6Mb; mp3: 6.2Mb; opus: 9.8Mb.
So is opus really the best choice for getting the most accurate transcription? And how can that be when it is so small compared to flac?

Opus is designed to efficiently encode speech. The details are explained in the linked wiki article, but just to give you a gist, consider that human vocalisation range is rather limited, roughly from 80 to 260 Hz. On the other hand, or hearing range is far greater, up to 20000 Hz. Whereas music encoders (like mp3) have to work roughly within our hearing range, voice-specialised encoders (like Opus) can focus on what matters to efficiently encode human voice, with no interest what lies significantly above our vocalisation range. That I hope provides some intuition why Opus is so efficient.
Is it the best? It's somewhat opinionated, but yes, I think it's among the best choices out there. To cite after Wikipedia, Opus replaces both Vorbis and Speex for new applications, and several blind listening tests have ranked it higher-quality than any other standard audio format at any given bitrate.

Related

Does ReplayGain work on Opus audio files, too, and how do I apply it?

ReplayGain is a proposed technical standard published by David Robinson in 2001 to measure and normalize the perceived loudness of audio in computer audio formats such as MP3 and Ogg Vorbis.
Does ReplayGain work on audio files encoded with Opus, too? And what's a command-line solution to apply it?

How to combine jpeg frames with uncompressed mono audio into an h264 stream or any other format processed by web browsers out of the box?

So I have an esp32 which captures images and sound. The esp32-camera library already returns the jpeg encoded buffer. The audio however is uncompressed and is just a digital representation of signal strength at high sample rate.
I use esp32 to host a webpage which contains <image> element and a JavaScript snippet, which constantly sends GET requests to a branching url for image data and updates the element. This approach is not very good, especially that now I've added audio capabilities to the circuit.
I'm curious if it would be possible to combine jpeg encoded frames and some audio data into a chunk of h264 and then send it directly as a response to a GET request making it a stream?
This not only would simplify the whole serving multiple webpages thing, but also remove the issues of syncing the audio and video if they are sent separately.
In particular I'm also curious how easy would it be to do on esp32 since it doesn't have a whole bunch of ram and computational power. It would be challenging to find or port large libraries which could help as well, so i guess I would have to code it myself.
I also am not sure if h264 is the best option. I know its supported on most browser out of the box and is using jpeg compression behind the scenes for the frames, but perhaps a simpler format exists which is also widely supported.
So to sum it up: Is h264 a best bet in the provided context? Is combining jpeg and uncompressed mono audio into h264 possible in the provided context? If an answer to either of previous questions is a no, what alternatives do i have if any?
I'm curious if it would be possible to combine jpeg encoded frames and some audio data into a chunk of h264 and then send it directly as a response to a GET request making it a stream?
H.264 is a video codec. It doesn't have anything to do with audio.
I know its supported on most browser out of the box and is using jpeg compression behind the scenes for the frames
No, this isn't true. H.264 is its own thing. It's far more powerful than JPEG and is specifically designed for motion, whereas JPEG was not.
You need a few things:
A video codec, to efficiently handle your frames. Most of these embedded camera libraries can give you an MJPEG stream. I'd use that if possible. I don't think your ESP32 has other video encoding capability, does it? H.264 is a good choice, but only if you can actually encode it.
A container format, to aid in streaming your audio and video streams together. ISOBMFF/MP4 is common, as is WebM/Matroska.
If you're only streaming to a single client (which seems likely given the limited horsepower of the board), and if you have enough capability to do the audio/video encoding, you can generate a WebM stream on the fly that is directly playable in a <video> element. This seems exactly what you are asking for.

Determining the quality of mp3 audio streams

I have built a source client using Portaudio and LAME which streams the microphone input to an Icecast server to be listened to online via the HTML5 tag. I have managed to (supposedly) get the quality of the stream to MP3 320kbps at 44.1kHz and am looking for a way to confirm this using tests and or benchmarks.
I have an indication that these stats are somewhat correct from looking at stream inspectors in software such as iTunes and VLC, but I am looking to get a more in-depth data set.
What I basically want is to be able to test how much of the original file is being lost over the stream and if or how much the quality changes depending on environmental conditions of the broadcaster or streamer.
Does anyone know of any tools, frameworks to get some hard numbers or representations of this data?
If VLC tells you the stream is 320kbit CBR, then it is.
It sounds like what you're looking for is a comparison of the actual audio content. This is highly subjective. MP3 is built to use features of how our hearing works to save bandwidth. For example, quiet sounds are masked by loud sounds. High frequencies are harder to hear and are simply rolled off.
You can compare the spectral analysis between the original PCM-sampled waveform and the MP3 decoded waveform, but this doesn't tell you how humans interpret that sound. For that, you would have to survey humans.

Voice Detection in C#

I'm looking for a simple C# real-time voice detection library.
The input should be an audio stream, and the output should be "human voice" or "not a human voice".
I have no knowledge in speech recognition or signal processing, and I'll appreciate any kind of assistance.
Take a look at the answer for "Detecting audio silence in WAV files using C#". I am assuming the input is a WAV file. If not please provide the format of the audio stream, or if you are intending on taking input from the microphone directly. If you can measure the amount of silence in an audio stream and you know the duration of the audio stream then you can calculate the amount of talk time. The link in the answer is dead, but if you go to codeproject.com and search on "C# wave form" you will get a hit on a number of projects that show you how to interpret and manipulate wav files. Detecting silence may be a little subjective if there is background noise. You will need to pick a minimum volume threshold for silence where anything below it is considered silence.

Best Voice Compression Algorithms/Formats

We have some raw voice audio that we need to distribute over the internet. We need decent quality, but it doesn't need to be of musical quality. Our main concern is usability by the consumer (i.e. what and where they can play it) and size of the download. My experience has shown that mp3s do not produce the best compression numbers for voice audio, but I am at a loss for what the best alternatives are. Ultimately we would like to automate the conversion process to allow the consumer to choose the quality vs. size level that they would like.
You should give Opus a try. Example compression command line:
ffmpeg -i x.wav -b:a 32k x.opus
Start here.
As you rightly point out, voice compression is different from general audio compression. You'll find many codecs dedicated to telephony applications, ranging from PCM and ADPCM through later packet based encodings such as CELP used on GSM cellular networks.
Still, VOIP voice encoding is slightly different from that due to the medium used. you can find a good, free (unencumbered and open source (BSD)) library for speech encoding/decoding in the Speex software library.
Again, which you choose depends on the speech you're encoding and the medium it's being transmitted over. Also note that many libraries have several algorithms they can use depending on the circumstances, and some will even switch on the fly based on conditions of the sound and network.
To get more help, narrow your question down.
-Adam
The most frequently used compression formats used in live voice audio (like VoIP telephony) are μ-Law (mu-Law/u-Law is used in the US) and a-Law (used in Europe, etc.) which, unlike Uncompressed PCM, don't support as wide of a frequency range (a smaller range of possible values ignores sounds outside of the necessary spectrum and requires less space to store).
For usability sake it is easiest to use mpeg compressions (mp2/3/4) for streaming to standard media players as the algorithms are readily available and typically quite fast and almost all media players should support it, but for voice you might try to specify a lower bitrate or do your conversion from a lower quality file in the first place (WAV can be at several sampling rates and voice requires a much lower sampling rate than music or effects, it's basically like frame-per-second on video). Alternatively you can use Real Media, WMA or other proprietary formats, but this would limit usability since the users would require specific third party software for playback, though WMA has an excellent compression ratio as well as compression options specific to voice audio.
Assuming your users will be running Windows, there is a WMA speech compression codec that you can use with the Windows Media Encoder SDK. Failing that, you can use ACM to use something like G723/G728, ADPCM, mu-law or a-law, some of which are installed as standard on Windows XP & above. These can be packaged inside WAV files. You'll need to experiment a little to find the right bitrate/quality (probably don't bother with mu-law or a-law). With voice data you can get away with quite low sample rates - e.g. 16000 or 8000, as there isn't much above 4Khz in the human spoken voice.
I think AMR is one of the best speech codecs. I was using it about a year ago and I remember that quality was very good and size levels were rather small.
One drawback, especially in your case is that, as far as I know, it isn't supported by wide range of media players. QuickTime and RealPlayer are two which I know to play .amr files.
Try speex ... unencumbered by patents, good performance both sizewise and CPU-wise. I've been having good luck using it on iPhone.

Resources