I figured out that the default setting on my device for audio is kAudioFormatLinearPCM.
I get 4 bytes per sample in mData in the AudioBuffer.
Is each value an absolute amplitude value? Is it always a positive number?
You need to know the stream format. If the format is unsigned then the value is always positive. If the sample format is signed, then the value can be either positive or negative.
The value may also need to be byte-swapped, depending on the endianness of the format, the endianness of the processor (little-endian on ARM iOS), and how the value is read from the stream, for the value to be a linear amplitude value.
Is each value an absolute amplitude value?
Yes.
Is it always a positive number?
It's variable across the APIs and implementations you will encounter. You will have to refer to other fields of the AudioStreamBasicDescription to determine the sample format and stream precisely.
Related
This question already has answers here:
What do the bytes in a .wav file represent?
(6 answers)
Closed last year.
Each WAV files depends on a Sampling Rate and a Bit Depth. The former governs how many different samples are played per second, and the latter governs how many possibilities there are for each timeslot.
For sampling rate for example 1000 Hz and the bit depth is 8 then each 1/1000 of a second the audio device plays one of a possible $2^8$ different sounds.
Hence the bulk of the WAV file is a sequence of 8-bit numbers. There is also a header which contains the Sampling Rate and Bit Depth and other specifics of how the data should be read:
The above comes from running xxd on a wav file to view it in binary on the terminal. The first column is just increments of 6 in hexadecimal. The last one seems to say where the header ends. So the data looks like this:
Each of those 8-bit numbers is a sample. So the device reads left-to right and converts the samples in order into sounds. But how in principle can each number correspond to a sound. I would think each bit should somehow encode an amplitude and a pitch, with each coming from a finite range. But I can not find any reference to for example the first half of the bits being a pitch and the second being a frequency.
I have found references to the numbers encoding "signal strength" but I do not know what this means.Can anyone explain in principle how the data is read and converted to audio?
In your example, over the course of a second, 1000 values are sent to a DAC (Digital to Analog converter) where the discrete values are smoothed out into a waveform. The pitch is determined by the rate and pattern by which the stream of values (which get smoothed out to a wave) rise and fall.
Steve W. Smith gives some good diagrams and explanations in his chapter ADC and DCA from his very helpful book The Scientists and Engineers Guide to Digital Signal Processing.
For a project I am decoding wav files and am using the values in the data channel. I am using the node package "node-wav". From what I understand the values should be in the thousands, but I am seeing values that are scaled between -1 and 1. If I want the actual values do I need to multiply the scaled value by some number?
Part of the reason I am asking is that I still do not fully understand how WAV files store the necessary data.
I don't exactly know how node.js is but usually audio data is stored in float values so it makes sense to see it scaled between -1 and 1.
What I pulled from the website:
Data format
Data is always returned as Float32Arrays. While reading and writing 64-bit float WAV files is supported, data is truncated to 32-bit floats.
And endianness if you need it for some reason:
Endianness
This module assumes a little endian CPU, which is true for pretty much every processor these days (in particular Intel and ARM).
If you needed it to scale from float to fixed point integer, you'd multiply the value by the number of bits. For example, if you're trying to convert to 16 bit integers; y = (2^15 - 1) * x, where x is the data value, y is the scaled value.
This is probably a very silly question, but after searching for a while, I couldn't find a straight answer.
If a source filter (such as the LAV Audio codec) is processing a 24-bit integral audio stream, how are individual audio samples delivered to the graph?
(for simplicity lets consider a monophonic stream)
Are they stored individually on a 32-bit integer with the most-significant bits unused, or are they stored in a packed form, with the least significant bits of the next sample occupying the spare, most-significant bits of the current sample?
The format is similar to 16-bit PCM: the values are signed integers, little endian.
With 24-bit audio you normally define the format with the help of WAVEFORMATEXTENSIBLE structure, as opposed to WAVEFORMATEX (well, the latter is also possible in terms of being accepted by certain filters, but in general you are expected to use the former).
The structure has two values: number of bits per sample and number of valid bits per sample. So it's possible to have the 24-bit data represented as 24-bit values, and also as 24-bit meaningful bits of 32-bit values. The payload data should match the format.
There is no mix of bits of different samples within a byte:
However, wBitsPerSample is the container size and must be a multiple of 8, whereas wValidBitsPerSample can be any value not exceeding the container size. For example, if the format uses 20-bit samples, wBitsPerSample must be at least 24, but wValidBitsPerSample is 20.
To my best knowledge it's typical to have just 24-bit values, that is three bytes per PCM sample.
Non-PCM formats might define different packing and use "unused" bits more efficiently, so that, for example, to samples of 20-bit audio consume 5 bytes.
What do the values in the mData member represent? It looks like each value is a 4 byte integer...
I guess my question is, what does each sample supposed to represent and what does the mNumberChannels member represent?
If I had to apply some sort of transform on the sound pattern, can I treat these samples as discrete samples in time? If so, what time period does each 512 samples represent?
Thanks
Deshawn
The mData buffer array elements can represent 16-bit signed integers, stereo pairs of 16-bit signed integers, 32-bit 8.24/s7.24 scaled-integer or fixed-point values, or 32-bit floating-point values, etc., depending on the Audio Unit and how it was configured.
The buffer duration will be its length in frames divided by the audio sample rate, for instance 512/44100 is about 11.61 milliSeconds.
I'm thinking of using libsamplerate to resample audio files which seems fairly simple.
In the FAQ it states that after resampling that the audio should be normalised which I'm not sure how to do.
It states that the audio samples should be in the range (-1.0, 1.0).
Is it just a case of:
Finding the sample which lies the furthest from this range
Calculating the coefficient that will result in it's value being -1.0 or 1.0
Applying that coefficient to every sample in the audio file?
Basically yes, you have to find the sample of largest absolute value, and just divide all samples by this value, which ensures all samples will lie in the (-1.0,1.0) range. Of course it requires you have access to the whole audio data in advance (you cannot normalize a stream, since you do not know what samples you will be getting, e.g 3 seconds into the future).
Keep in mind though that this operation will probably result in a change of perceived loudness ('volume'). If you want the overall loudness to be preserved after resampling, you have to measure it before and after resampling, and apply a proper coefficient.