The meaning of rate in ALSA - audio

I am trying to understand the meaning of "rate" as it applies to ALSA. It is always reported in units of Hz, and is often expanded in text as "sample rate". However, usage seems to indicate that it is actually frame rate or, possibly, byte rate of an audio stream.
The confusion may arise from what exactly is referred to by "sample". If each channel is sampling at a particular frequency, then that is the frame rate of the overall stream.
So, for example, if I have a rate of 44100 Hz on a 3-channel, 16-bit audio stream, am I processing 44,100 bytes per second, 88,200 bytes per second (44,100 samples per second), or 264,600 bytes per second (44,100 frames per second)?
Question rather related to [1] and [2], and was probably the motive behind [3].
Elaboration of ALSA's meaning of "frame" and "sample" at Introduction to Sound Programming with ALSA.

In ALSA, the rate is the frame rate.
Historically, this value is called "sample rate" because it is the rate at which samples arrive at each DAC. This view is correct only if each channel has its own DAC. Nowadays, most DAC chips have at least two channels, so the actual sample rate does not really occur anywhere in the system.

Related

How samples are aligned in the audio file?

I'm trying to better understand how samples are aligned in the audio file.
Let's say we have a 2s audio file with sampling rate = 3.
I think there are three possible ways to align those samples. Looking at the picture below, can you tell me which one is correct?
Also, is this a standard for all audio files or does different formats have different rules?
Cheers!
Sampling rate in audio typically tells you how many samples are in one second, a unit called Hertz. Strictly speaking, the correct answer would be (1), as you have 3 samples within one second. Assuming there's no latency, PCM and other formats dictate that audio starts at 0. Next "cycle" (next second) also starts at zero, same principle like with a clock.
To get total length of the audio (following question in the comment), you should simply take number of samples / rate. Example from a 30s WAV using soxi, one of canonical tools used in the community for sound manipulation:
Input File : 'book_00396_chp_0024_reader_11416_5_door_Freesound_validated_380721_0-door_Freesound_validated_381380_0-9IfN8dUgGaQ_snr10_fileid_1138.wav'
Channels : 1
Sample Rate : 16000
Precision : 16-bit
Duration : 00:00:30.00 = 480000 samples ~ 2250 CDDA sectors
File Size : 960k
Bit Rate : 256k
Sample Encoding: 16-bit Signed Integer PCM
480000 samples / (16000 samples / seconds) = 30 seconds exactly. Citing manual, duration is "Equivalent to number of samples divided by the sample-rate."

Difference between sampling rate, bit rate and bit depth

This is kind of a basic question which might sound too obvious to many of you , but I am getting confused so bad.
Here is what a Quora user says. Now It is clear to me what a Sampling rate is - The number of samples you take of a sound signal (in one second) is it's sampling rate.
Now my doubt here is - This rate should have nothing to do with the quantisation, right?
About bit-depth, Is the quantisation dependant on bit-depth? As in 32-bit (2^32 levels) and 64-bit (2^64 levels). Or is it something else?
and the bit-rate, is number of bits transferred in one second? If I an audio file says 320 kbps what does that really mean?
I assume the readers have got some sense on how I am panicking on where does the bit rate, and bit depth have significance?
EDIT: Also find this question if you have worked with linux OS and gstreamer framework.
Now my doubt here is - This rate should have nothing to do with the
quantisation, right?
Wrong. Sampling is a process that results in quantisation. Sampling, as the name implies, means taking samples (amplitudes) of a (usually) continuous signal (e.g audio) at regular time intervals and converting them to a different represantation thereof. In digital signal processing, this represantation is discrete (not continuous). An example of this process is a wave file (e.g recording your own voice and saving it as a wav).
About bit-depth, Is the quantisation dependant on bit-depth? As in
32-bit (2^32 levels) and 64-bit (2^64 levels). Or is it something
else?
Yes. The CD format, for example, has a bit depth of 16 (16 bits per sample). Bit depth is a part of the format of a sound (wave) file (along with the number of channels and sampling rate).
Since sound (think of a pure sine tone) has both positive and negative parts, I'd argue that you can represent (2^16 / 2) amplitude levels using 16 bits.
and the bit-rate, is number of bits transferred in one second? If I an
audio file says 320 kbps what does that really mean?
Yes. Bit rates are usually meaningful in the context of network transfers. 320 kbps == 320 000 bits per second. (for kilobit you multiply by 1000, rather than 1024)
Let's take a worked example 'Red-book' CD audio
The Bit depth is 16-bit. This is the number of bits used to represent each sample. This is intimately coupled with quantisation.
The Smaple-rate is 44.1kHz
The Frame-rate is 44.1kHz (two audio channels make up a stereo pair)
The Bit-rate is therefore 16 * 44100 * 2 = 1411200 bits/sec
There are a few twists with compressed audio streams such such as MP3 or AAC. In these, there is a non-linear relationship between bit-rate, sample-rate and bit-depth. The bit-rate is generally the maximum rate per-second and the efficiency of the codec is content dependant.

About audio record sample rate

We want to record stereo audio signals by AudioRecord as the below.
If we set sample rate to 44,100, are both stereo channels recorded
at 44,100Hz or 22,050Hz?
According to our implementation, it seems that half sampling frequency is applied to each channel
AudioRecord audioInputStream = new AudioRecord(Media.Recorder.CAMCORDER,
sampleRate, AudioFormat.CHANNEL_IN_STEREO, AudioFormat.ENCODING_PCM_16BIT,
samplesPerBuffer * bytesPerSample)
The sample rate is constant no matter what the number of channels. So 1 channel at 44.1k you get 44100 total samples per second and with 2 channels you would get 88200 total samples per second.
I don't really know about the API you are using but I can point to one possible area that arises from terminology. The is the difference between a sample and a frame. Usually you consider a sample to be a single value a frame to contain a single sample for each channel. So if you encounter any API that looks something like this: process(double* samples, int numChannels, int numFrames) just beware that the actual number of samples in the buffer is numChannels*numFrames. And misinterpreting something like that could definitely lead to consuming half as many samples as you expect. Also some APIs will confusingly use the term numSamples when they should have used numFrames, etc...

Relation between bandwidth and play time in a CD

I have recently read that uncompressed CD-quality audio has a bandwidth of 1.411 Mbps in case of stereo, does it mean a CD can be played to output audio at the rate of 1.411 Mbps, i mean does it play 1.411 Mbits of stereo audio every second..?
Two channels, each with 44,100 16-bit samples per second. That is 2 x 44100 x 16 = 1,411,200bps. That is 1.411Mbps. (176400 bytes per second)
Each second requires 1.411Mb. If you reduced the sample rate by half, you would double the number of seconds that can be recorded on a CD. Same if you dropped it to one channel, or 8-bit.
To imagine the impact of reducing the sample rate, lets suppose a technology that sampled every 1 second. This would be like pressing mute over and over, you would only catch parts.
Reducing the channel to one is easy to imagine, that's monaural.
Reducing to 8-bit is harder to describe. Imagine we reduced it to 1-bit. That would essentially mean the speaker has two states, fully centered and fully driven. That is not much variation. 16 bits gives 65536 positions.

About definition for terms of audio codec

When I was studying Cocoa Audio Queue document, I met several terms in audio codec. There are defined in a structure named AudioStreamBasicDescription.
Here are the terms:
1. Sample rate
2. Packet
3. Frame
4. Channel
I known about sample rate and channel. How I was confused by the other two. What do the other two terms mean?
Also you can answer this question by example. For example, I have an dual-channel PCM-16 source with a sample rate 44.1kHz, which means there are 2*44100 = 88200 Bytes PCM data per second. But how about packet and frame?
Thank you at advance!
You are already familiar with the sample rate defintion.
The sampling frequency or sampling rate, fs, is defined as the number of samples obtained in one second (samples per second), thus fs = 1/T.
So for a sampling rate of 44100 Hz, you have 44100 samples per second (per audio channel).
The number of frames per second in video is a similar concept to the number of samples per second in audio. Frames for our eyes, samples for our ears. Additional infos here.
If you have 16 bits depth stereo PCM it means you have 16*44100*2 = 1411200 bits per second => ~ 172 kB per second => around 10 MB per minute.
To the definition in reworded terms from Apple:
Sample: a single number representing the value of one audio channel at one point in time.
Frame: a group of one or more samples, with one sample for each channel, representing the audio on all channels at a single point on time.
Packet: a group of one or more frames, representing the audio format's smallest encoding unit, and the audio for all channels across a short amount of time.
As you can see there is a subtle difference between audio and video frame notions. In one second you have for stereo audio at 44.1 kHz: 88200 samples and thus 44100 frames.
Compressed format like MP3 and AAC pack multiple frames in packets (these packets can then be written in MP4 file for example where they could be efficiently interleaved with video content). You understand that dealing with large packets helps to identify bits patterns for better coding efficiency.
MP3, for example, uses packets of 1152 frames, which are the basic atomic unit of an MP3 stream. PCM audio is just a series of samples, so it can be divided down to the individual frame, and it really has no packet size at all.
For AAC you can have 1024 (or 960) frames per packet. This is described in the Apple document you pointed at:
The number of frames in a packet of audio data. For uncompressed audio, the value is 1. For variable bit-rate formats, the value is a larger fixed number, such as 1024 for AAC. For formats with a variable number of frames per packet, such as Ogg Vorbis, set this field to 0.
In MPEG-based file format a packet is referred to as a data frame (not to be
mingled with the previous audio frame notion). See Brad comment for more information on the subject.

Resources