For a project I am decoding wav files and am using the values in the data channel. I am using the node package "node-wav". From what I understand the values should be in the thousands, but I am seeing values that are scaled between -1 and 1. If I want the actual values do I need to multiply the scaled value by some number?
Part of the reason I am asking is that I still do not fully understand how WAV files store the necessary data.
I don't exactly know how node.js is but usually audio data is stored in float values so it makes sense to see it scaled between -1 and 1.
What I pulled from the website:
Data format
Data is always returned as Float32Arrays. While reading and writing 64-bit float WAV files is supported, data is truncated to 32-bit floats.
And endianness if you need it for some reason:
Endianness
This module assumes a little endian CPU, which is true for pretty much every processor these days (in particular Intel and ARM).
If you needed it to scale from float to fixed point integer, you'd multiply the value by the number of bits. For example, if you're trying to convert to 16 bit integers; y = (2^15 - 1) * x, where x is the data value, y is the scaled value.
Related
I need to create a software can capture sound (from NOAA Satellite with RTL-SDR). The problem is not capture the sound, the problem is how I converted the audio or waves into an image. I read many things, Fourier Fast Transformed, Hilbert Transform, etc... but I don't know how.
If you can give me an idea it would be fantastic. Thank you!
Over the past year I have been writing code which makes FFT calls and have amassed 15 pages of notes so the topic is vast however I can boil it down
Open up your WAV file ... parse the 44 byte header and note the given bit depth and endianness attributes ... then read across the payload which is everything after that header ... understand notion of bit depth as well as endianness ... typically a WAV file has a bit depth of 16 bits so each point on the audio curve will be stored across two bytes ... typically WAV file is little endian not big endian ... knowing what that means you take the next two bytes then bit shift one byte to the left (if little endian) then bit OR that pair of bytes into an integer then convert that int which typically varies from 0 to (2^16 - 1) into its floating point equivalent so your audio curve points now vary from -1 to +1 ... do that conversion for each set of bytes which corresponds to each sample of your payload buffer
Once you have the WAV audio curve as a buffer of floats which is called raw audio or PCM audio then perform your FFT api call ... all languages have such libraries ... output of FFT call will be a set of complex numbers ... pay attention to notion of the Nyquist Limit ... this will influence how your make use of output of your FFT call
Now you have a collection of complex numbers ... the index from 0 to N of that collection corresponds to frequency bins ... the size of your PCM buffer will determine how granular your frequency bins are ... lookup this equation ... in general more samples in your PCM buffer you send to the FFT api call will give you finer granularity in the output frequency bins ... essentially this means as you walk across this collection of complex numbers each index will increment the frequency assigned to that index
To visualize this just feed this into a 2D plot where X axis is frequency and Y axis is magnitude ... calculate this magnitude for each complex number using
curr_mag = 2.0 * math.Sqrt(curr_real*curr_real+curr_imag*curr_imag) / number_of_samples
For simplicity we will sweep under the carpet the phase shift information available to you in your complex number buffer
This only scratches the surface of what you need to master to properly render a WAV file into a 2D plot of its frequency domain representation ... there are libraries which perform parts or all of this however now you can appreciate some of the magic involved when the rubber hits the road
A great explanation of trade offs between frequency resolution and number of audio samples fed into your call to an FFT api https://electronics.stackexchange.com/questions/12407/what-is-the-relation-between-fft-length-and-frequency-resolution
Do yourself a favor and checkout https://www.sonicvisualiser.org/ which is one of many audio workstations which can perform what I described above. Just hit menu File -> Open -> choose a local WAV file -> Layer -> Add Spectrogram ... and it will render the visual representation of the Fourier Transform of your input audio file as such
This is probably a very silly question, but after searching for a while, I couldn't find a straight answer.
If a source filter (such as the LAV Audio codec) is processing a 24-bit integral audio stream, how are individual audio samples delivered to the graph?
(for simplicity lets consider a monophonic stream)
Are they stored individually on a 32-bit integer with the most-significant bits unused, or are they stored in a packed form, with the least significant bits of the next sample occupying the spare, most-significant bits of the current sample?
The format is similar to 16-bit PCM: the values are signed integers, little endian.
With 24-bit audio you normally define the format with the help of WAVEFORMATEXTENSIBLE structure, as opposed to WAVEFORMATEX (well, the latter is also possible in terms of being accepted by certain filters, but in general you are expected to use the former).
The structure has two values: number of bits per sample and number of valid bits per sample. So it's possible to have the 24-bit data represented as 24-bit values, and also as 24-bit meaningful bits of 32-bit values. The payload data should match the format.
There is no mix of bits of different samples within a byte:
However, wBitsPerSample is the container size and must be a multiple of 8, whereas wValidBitsPerSample can be any value not exceeding the container size. For example, if the format uses 20-bit samples, wBitsPerSample must be at least 24, but wValidBitsPerSample is 20.
To my best knowledge it's typical to have just 24-bit values, that is three bytes per PCM sample.
Non-PCM formats might define different packing and use "unused" bits more efficiently, so that, for example, to samples of 20-bit audio consume 5 bytes.
SystemML comes packaged with a range of scripts that generate random input data files for use by the various algorithms. Each script accepts an option 'format' which determines whether the data files should be written in CSV or binary format.
I've taken a look at the binary files but they're not in any format I recognize. There doesn't appear to be documentation anywhere online. What is the binary format? What fields are in the header? For dense matrices, are the data contiguously packed at the end of the file (IEEE-754 32-bit float), or are there metadata fields spaced throughout the file?
Essentially, our binary format for matrices and frames are hadoop sequence files (single file or directory of part files) of type <MatrixIndexes,MatrixBlock> (with MatrixIndexes being a long-long pair for row/column block indexes) and <LongWritable,FrameBlock>, respectively. So anybody with hadoop io libraries and SystemML in the classpath can consume these files.
In detail, this binary blocked format is our internal tiled matrix representation (with default blocksize of 1K x 1K entries, and hence fixed logical but potentially variable physical size). Any external format provided to SystemML, such as csv or matrix market, is automatically converted into binary block format and all operations work over these binary intermediates. Depending on the backend, there are different representations, though:
For singlenode, in-memory operations and storage, the entire matrix is represented as a single block in deserialized form (where we use linearized double arrays for dense and MCSR, CSR, or COO for sparse).
For spark operations and storage, a matrix is represented as JavaPairRDD<MatrixIndexes, MatrixBlock> and we use MEMORY_AND_DISK (deserialized) as default storage level in aggregated memory.
For mapreduce operations and storage, matrices are actually persisted to sequence files (similar to inputs/outputs).
Furthermore, in serialized form (as written to sequence files or during shuffle), matrix blocks are encoded in one of the following: (1) empty (header: int rows, int cols, byte type), (2) dense (header plus serialized double values), (3) sparse (header plus for each row: nnz per row, followed by column index, value pairs), (4) ultra-sparse (header plus triples of row/column indexes and values, or pairs of row indexes and values for vectors). Note that we also redirect java serialization via writeExternal(ObjectOutput os) and readExternal(ObjectInput is) to the same serialization code path.
There are more details, especially with regard to the recently added compressed matrix blocks and frame blocks - so please ask if you're interested in anything specific here.
What is the "standard way" of working with 24-bit audio? Well, there are no 24-bit data types available, really. Here are the methods that come into my mind:
Represent 24-bit audio samples as 32-bit ints and ignore the upper eight bits.
Just like (1) but ignore the lower eight bits.
Represent 24-bit audio samples as 32-bit floats.
Represent the samples as structs of 3 bytes (acceptable for C/C++, but bad for Java).
How do you work this out?
Store them them as 32- or 64-bit signed ints or float or double unless you are space conscious and care about packing them into the smallest space possible.
Audio samples often appear as 24-bits to and from audio hardware since this is commonly the resolution of the DACs and ADCs - although on most computer hardware, don't be surprised to find the bottom 3 of 4 bits banging away randomly with noise.
Digital signal processing operations - which is what usually happens downstream from the acquisition of samples - all involve addition of weighted sums of samples. A sample stored in an integer type can be considered to be fixed-point binary with an implied binary point at some arbitrary point - the position of which you can chose strategically to maintain as many bits of precision as possible.
For instance, the sum of two 24-bit integer yields a result of 25 bits. After 8 such additions, the 32-bit type would overflow and you would need to re-normalize by rounding and shifting right.
Therefore, if you're using integer types to store your samples, use the largest you can and start with the samples in the least significant 24 bits.
Floating point types of course take care of this detail for you, although you get less choice about when renormalisation takes place. They are the usual choice for audio processing where hardware support is available. A single precision float has a 24-bit mantissa, so can hold a 24-bit sample without loss of precision.
Usually floating point samples are stored in the range -1.0f < x < 1.0f.
What do the values in the mData member represent? It looks like each value is a 4 byte integer...
I guess my question is, what does each sample supposed to represent and what does the mNumberChannels member represent?
If I had to apply some sort of transform on the sound pattern, can I treat these samples as discrete samples in time? If so, what time period does each 512 samples represent?
Thanks
Deshawn
The mData buffer array elements can represent 16-bit signed integers, stereo pairs of 16-bit signed integers, 32-bit 8.24/s7.24 scaled-integer or fixed-point values, or 32-bit floating-point values, etc., depending on the Audio Unit and how it was configured.
The buffer duration will be its length in frames divided by the audio sample rate, for instance 512/44100 is about 11.61 milliSeconds.