Related
I am confused on, is H. 264 a codec (software used to encode/decode) or standard/format given; and there are software codecs that uses H.264 to encode/decode.
As with most things in computing, the terminology is loosely used.
Looking at the basic definitions:
H.264 is a video encoding standard.
Codecs are software or hardware which encode or decode videos.
So a codec would encode or decode a video, usually following some standard or agreed encoding format.
In practice, people often speak of a H.264 codec or a H.265 codec. What this usually means is a codec that implements a H.264 or a H.265 standard encoding.
The codecs themselves can be either SW or HW based - more common codecs, like ones that implement H.264, are often implemented in HW to increase performance and reduce battery usage on devices.
IBM advises using Opus sound format for audio submitted to its Watson Speech to Text service. The idea being that Opus is designed specifically for speech.
Otherwise, it says you will get better quality transcription when submitting audio in flac format than in mp3 format. The latter has the obvious advantage of its small size. There is after all a 100Mb limit for file submissions. So you weigh the balance of your needs. That all makes sense so far.
But looking at conversions done on a source WAV file, the Opus file is size is comparable with mp3.
Downsampling a 366Mb wav file to 8k sample rate (one of two sample rates advised for using the service), created a wav file of 66.4Mb. Converting that to flac, wav and opus produced flac: 43.6Mb; mp3: 6.2Mb; opus: 9.8Mb.
So is opus really the best choice for getting the most accurate transcription? And how can that be when it is so small compared to flac?
Opus is designed to efficiently encode speech. The details are explained in the linked wiki article, but just to give you a gist, consider that human vocalisation range is rather limited, roughly from 80 to 260 Hz. On the other hand, or hearing range is far greater, up to 20000 Hz. Whereas music encoders (like mp3) have to work roughly within our hearing range, voice-specialised encoders (like Opus) can focus on what matters to efficiently encode human voice, with no interest what lies significantly above our vocalisation range. That I hope provides some intuition why Opus is so efficient.
Is it the best? It's somewhat opinionated, but yes, I think it's among the best choices out there. To cite after Wikipedia, Opus replaces both Vorbis and Speex for new applications, and several blind listening tests have ranked it higher-quality than any other standard audio format at any given bitrate.
I have recently started going through sound card drivers in Linux[ALSA].
Can a link or reference be suggested where I can get good basics of Audio like :
Sampling rate,bit size etc.
I want to know exactly how samples are stored in Audio files on a computer and reverse of this which is how samples(numbers) are played back.
The Audacity tutorial is a good place to start. Another introduction that covers similar ground. The PureData tutorial at flossmanuals is also a good starting point. Wikipedia is a good source once you have the basics down.
Audio is input into a computer via an analog-to-digital converter (ADC). Digital audio is output via a digital-to-analog converter (DAC).
Sample rate is the number of times per second at which the analog signal is measured and stored digitally. You can think of the sample rate as the time resolution of an audio signal. Bit size is the number of bits used to store each sample. You can think of it as analogous to the color depth of an image pixel.
David Cottle's SuperCollider book also has a great introduction to digital audio.
I was in the same situation, and certainly this kind of information is out there but you need to do some research first. This is what I have found:
Digital Audio processing is a branch of DSP (Digital Signal
Processing).
DSP is one of the most powerful technologies that will
shape science and engineering in the twenty-first century.
Revolutionary changes have already been made in a broad range of
fields: communications, medical imaging, radar & sonar, high fidelity
music reproduction, and oil prospecting, to name just a few. Each of
these areas has developed a deep DSP technology, with its own
algorithms, mathematics, and specialized techniques…
This quote was taken from a very helpful guide that covers every topic in depth called the “The Scientist and Engineer's Guide to
Digital Signal Processing”. And though you are not asking for DSP specifically there’s a chapter that covers all digital audio related topics with a very good explanation.
You can find it in the chapter 22 - Audio Processing, and covers all this topics:
Human Hearing: how the sound is perceived by our ears, this is the
basis of how then the sound is then generated artificially.
Timbre: explains the properties of sound, like loudness, pitch and
timbre.
Sound Quality vs. Data Rate: once you know the previous concepts
we start to translate it to the electronic side.
High Fidelity Audio: gives you a picture of how sound is then
processed digitally.
Companding: here you can find how sound is then processed and
compressed for telecommunications.
Speech Synthesis and Recognition: More processes applied to the
sound, like filters, synthesis, etc.
Nonlinear Audio Processing: this is more advanced but understandable,
for sound treatment and other topics.
It explains the basics of sound in the real world, in case you might want to take a look, and then it explains how the sound is processed in the computer including what you are asking for.
But there are other topics that can be found in wikipedia that are more specific, let’s say the “Digital audio” page that explains every detail of this topic, this site can be used as a reference for further research, just in the beginning you can find a few links to sample rate, sound waves, digital forms, standards, bit depth, telecommunications, etc. There are a few things you might need to study more, like the nyquist-shannon theorem, fourier transforms, complex numbers and so on, but this is only used in very specific and advanced topics that you might not review or use. But I mention it just in case you are interested. You can find information in both the DSP guide book and wikipedia although you need to study some math.
I’ve been using python to develop and study these subjects with code since it has a lot of useful libraries, like numpy, sound device, scipy, etc. And then you can start plating with sound. On youtube you can find lots of videos that also guide you on how to do this. I’ve found synthesis, filters, voice recognition, you can create wav files with just code, which is great. But also I’ve seen projects in C/C++, Javascript, and other languages, so it might help you to keep learning and coding fun things.
There are a few other references across the internet but you might need to know what you are looking for, this book and the wikipedia page would be the best starting points for me, since it gives you the basics and explains in depth every topic. Then depending on the goal you want to achieve you can then start looking for more information.
Information about JPEG-LS is easily found on Google and in a lot of DICOM chapters.
However, there are links/pages/readings that mention JPEG-LL, too. However, i have taken a deeper look at the DICOM standard, not a chapter has ever mentioned anything about JPEG-LL, but in a lot of other conformance statements/forums/articles, JPEG-LL has been mentioned...
So, what is the difference between JPEG-LS and JPEG-LL?
I suppose JPEG-LL is the lossless version of the JPEG (usually called JPEG Lossless), with transfer syntax 1.2.840.10008.1.2.4.70, and it's the most common lossless compression. Anyway, it's not the best possible (it does not support signed pixel values, and its compression rates are about 2-2.5).
JPEG-LS is a completely different standard with transfer syntax 1.2.840.10008.1.2.4.80, developped by HP, which is faster and have better ratios than standard JPEG. However, it's not widely used.
I supposed JPEG-LL refers to ITU-T T.81, ISO/IEC IS 10918-1 aka JPEG lossless. While JPEG-LS refers to ITU-T T.87, ISO/IEC IS 14495-1.
Technically speaking JPEG-LL is exactly the same standard as the usual 8bits lossy jpeg as found on internet website. JPEG-LS is much much less common.
We have some raw voice audio that we need to distribute over the internet. We need decent quality, but it doesn't need to be of musical quality. Our main concern is usability by the consumer (i.e. what and where they can play it) and size of the download. My experience has shown that mp3s do not produce the best compression numbers for voice audio, but I am at a loss for what the best alternatives are. Ultimately we would like to automate the conversion process to allow the consumer to choose the quality vs. size level that they would like.
You should give Opus a try. Example compression command line:
ffmpeg -i x.wav -b:a 32k x.opus
Start here.
As you rightly point out, voice compression is different from general audio compression. You'll find many codecs dedicated to telephony applications, ranging from PCM and ADPCM through later packet based encodings such as CELP used on GSM cellular networks.
Still, VOIP voice encoding is slightly different from that due to the medium used. you can find a good, free (unencumbered and open source (BSD)) library for speech encoding/decoding in the Speex software library.
Again, which you choose depends on the speech you're encoding and the medium it's being transmitted over. Also note that many libraries have several algorithms they can use depending on the circumstances, and some will even switch on the fly based on conditions of the sound and network.
To get more help, narrow your question down.
-Adam
The most frequently used compression formats used in live voice audio (like VoIP telephony) are μ-Law (mu-Law/u-Law is used in the US) and a-Law (used in Europe, etc.) which, unlike Uncompressed PCM, don't support as wide of a frequency range (a smaller range of possible values ignores sounds outside of the necessary spectrum and requires less space to store).
For usability sake it is easiest to use mpeg compressions (mp2/3/4) for streaming to standard media players as the algorithms are readily available and typically quite fast and almost all media players should support it, but for voice you might try to specify a lower bitrate or do your conversion from a lower quality file in the first place (WAV can be at several sampling rates and voice requires a much lower sampling rate than music or effects, it's basically like frame-per-second on video). Alternatively you can use Real Media, WMA or other proprietary formats, but this would limit usability since the users would require specific third party software for playback, though WMA has an excellent compression ratio as well as compression options specific to voice audio.
Assuming your users will be running Windows, there is a WMA speech compression codec that you can use with the Windows Media Encoder SDK. Failing that, you can use ACM to use something like G723/G728, ADPCM, mu-law or a-law, some of which are installed as standard on Windows XP & above. These can be packaged inside WAV files. You'll need to experiment a little to find the right bitrate/quality (probably don't bother with mu-law or a-law). With voice data you can get away with quite low sample rates - e.g. 16000 or 8000, as there isn't much above 4Khz in the human spoken voice.
I think AMR is one of the best speech codecs. I was using it about a year ago and I remember that quality was very good and size levels were rather small.
One drawback, especially in your case is that, as far as I know, it isn't supported by wide range of media players. QuickTime and RealPlayer are two which I know to play .amr files.
Try speex ... unencumbered by patents, good performance both sizewise and CPU-wise. I've been having good luck using it on iPhone.