XNA Microphone audio buffer format? - audio

I'm working on an XNA script in which I want to read data from the microphone every couple of frames and estimate its pitch. I took input based almost exactly on this page (http://msdn.microsoft.com/en-us/library/ff827802.aspx).
Now I've got a buffer full bytes. What does it represent? I reset everything and look at my buffer every 10th frame, so it appears to be a giant array that has 9 instances of 1764 bytes at different points in time (The whole thing is 15876 bytes large). I'm assuming it's the time domain of sound pressure, because I can't find any information on the format of microphone input. Anybody know how this works? I have a friend who has an FFT up and running, but we're trying to learn as much as we can about that data I'm collecting before we attempt to plug it in.

The samples are in Little-Endian 16 bit Linear PCM. Convert each pair of bytes into a signed short as
short sample = (short)(buffer[i] | buffer[i+1] << 8);

Related

what is audio PCM's frame sync word to identify the beginning position

As title; for some compressed format such as EAC3, AC3 frame starts as a sync word.
So what's PCM (raw audio)'s sync word? How to identify the beginning of a PCM frame?
I met a problem where audio is concatenated by several audio segments and each of them has different frame size. I need to identify the start position.
Thanks in advance.
There is no such concept as a frame in PCM. The concept of a frame is to indicate points of random access. In PCM every single sample is a point of random access, hence start indicators are not required, and there are no standard frame size. It all up to you.
A PCM frame is different from the frames you're describing, in that a frame is just a single sample on all channels. That is, if I'm recording 16-bit stereo PCM audio, each frame is 4 bytes (32 bits) long.
There is no sync word, nor frame header in raw PCM. It's just a stream of data. You need to know the bit depth, channel count, and current offset if you want to sync to it. (Or, you need to do some simple heuristics. For example, apply several different formats and offsets to a small chunk of data and see which one has the least variance/randomness from sample to sample.)

Decoding incomplete audio file

I was given an uncompressed .wav audio file (360 mb) which seems to be broken. The file was recorded using a small usb recorder (I don't have more information about the recorder at this moment). It was unreadable by any player and I've tried GSpot (https://www.headbands.com/gspot/) to detect whether it was perhaps of a different format than wav but to no avail. The file is big, which hints at it being in some uncompressed format. It misses the RIFF-WAVE characters at the start of the file though, which can be an indication this is some other format or perhaps (more likely in this case) the header is missing.
I've tried converting the bytes of the file directly to audio and this creates a VERY noisy audio file, though voices could be made out and I was able to determine the sample rate was probably 22050hz (given a sample size of 8-bits) and a file length of about 4 hours and 45 minutes. Running it through some filters in Audition resulted in a file that was understandable in some places, but still way too noisy in others.
Next I tried running the data through some java code that produces an image out of the bytes, and it showed me lots of noise, but also 3 byte separations every 1024 bytes. First a byte close to either 0 or 255 (but not 100%), then a byte representing a number distributed somewhere around 25 (but with some variation), and then a 00000000 (always, 100%). The first 'chunk header' (as I suppose these are) is located at 513 bytes into the file, again close to a 2-power, like the chunk size. Seems a bit too perfect for coincidence, so I'm mentioning it as it could be important. https://imgur.com/a/sgZ0JFS, the first image shows a 1024x1024 image showing the first 1mb of the file (row-wise) and the second image shows the distribution of the 3 'chunk header' bytes.
Next to these headers, the file also has areas that clearly show structure, almost wave-like structures. I suppose this is the actual audio I'm after, but it's riddled with noise: https://imgur.com/a/sgZ0JFS, third image, showing a region of the file with audio structures.
I also created a histogram for the entire file (ignoring the 3-byte 'chunk headers'): https://imgur.com/a/sgZ0JFS, fourth image. I've flipped the lower half of the range as I think audio data should be centered around some mean value, but correct me if I'm wrong. Maybe the non-symmetric nature of the histogram has something to do with signed/unsigned data or two's-complement. Perhaps the data representation is in 8-bit floats or something similar, I don't know.
I've ran into a wall now. I have no idea what else I can try. Is there anyone out there that sees something I missed. Perhaps someone can give me some pointers what else to try. I would really like to extract the audio data out of this file, as it contains some important information.
Sorry for the bother. I've been able to track down the owner of the voice recorder and had him record me a minute of audio with it and send me that file. I was able to determine the audio was IMA 4-bit ADPCM encoded, 16-bit audio at 48000hz. Looking at the structure of the file I realized simple placing the header of the good file in front of the data of the bad file should be possible, and lo and behold I had a working file again :)
I'm still very much interested how that ADPCM works and if I can write my own decoder, but that's for another day when I'm strolling on wikipedia again. Have a great day everyone!

Realtime STFT and ISTFT in Julia for Audio Processing

I'm new to audio processing and dealing with data that's being streamed in real-time. What I want to do is:
listen to a built-in microphone
chunk together samples into 0.1second chunks
convert the chunk into a periodogram via the short-time Fourier transform (STFT)
apply some simple functions
convert back to time series data via the inverse STFT (ISTFT)
play back the new audio on headphones
I've been looking around for "real time spectrograms" to give me a guide on how to work with the data, but no dice. I have, however, discovered some interesting packages, including PortAudio.jl, DSP.jl and MusicProcessing.jl.
It feels like I'd need to make use of multiprocessing techniques to just store the incoming data into suitable chunks, whilst simultaneously applying some function to a previous chunk, whilst also playing another previously processed chunk. All of this feels overcomplicated, and has been putting me off from approaching this project for a while now.
Any help will be greatly appreciated, thanks.
As always start with a simple version of what you really need ... ignore for now pulling in audio from a microphone, instead write some code to synthesize a sin curve of a known frequency and use that as your input audio, or read in audio from a wav file - benefit here is its known and reproducible unlike microphone audio
this post shows how to use some of the libs you mention http://www.seaandsailor.com/audiosp_julia.html
You speak of "real time spectrogram" ... this is simply repeatedly processing a window of audio, so lets initially simplify that as well ... once you are able to read in the wav audio file then send it into a FFT call which will return back that audio curve in its frequency domain representation ... as you correctly state this freq domain data can then be sent into an inverse FFT call to give you back the original time domain audio curve
After you get above working then wrap it in a call which supplies a sliding window of audio samples to give you the "real time" benefit of being able to parse incoming audio from your microphone ... keep in mind you always use a power of 2 number of audio samples in your window of samples you feed into your FFT and IFFT calls ... lets say your window is 16384 samples ... your julia server will need to juggle multiple demands (1) pluck the next buffer of samples from your microphone feed (2) send a window of samples into your FFT and IFFT call ... be aware the number of audio samples in your sliding window will typically be wider than the size of your incoming microphone buffer - hence the notion of a sliding window ... over time add your mic buffer to the front of this window and remove same number of samples off from tail end of this window of samples

Can we get relation between time in second and bytes of audio file?

I want relation between time and bytes in ogg file. If I have 5 second ogg and it's length 68*1024 bytes. If I chunk from that ogg file and save it can I knew that size from before chunk? Like I knew it I want to chunk from 2.4 to 3.2.
And give some mathematical calculation and get accurate answer of bytes I can get. Can anyone tell me please if this is possible?
Bit rate 128kbps, 16 bit , sample rate - 44.1Khz, stereo
I used below logic but can't get accurate answer.
Click here
Any such direct mapping between file size and play time will work, but not if the codec uses variable bit rate (vbr) encoding ... meaning the compression algorithm is vbr if its success in compressing is dependent on the informational density of the source media ... repetitive audio is more efficiently compressed than say random noise ... vbr algorithms are typically more efficient since to maintain a constant bit rate the algo pads the buffer with filler data just so its throughput is in constant bytes per second

What do the bytes in a .wav file represent?

When I store the data in a .wav file into a byte array, what do these values mean?
I've read that they are in two-byte representations, but what exactly is contained in these two-byte values?
You will have heard, that audio signals are represented by some kind of wave. If you have ever seen this wave diagrams with a line going up and down -- that's basically what's inside those files. Take a look at this file picture from http://en.wikipedia.org/wiki/Sampling_rate
You see your audio wave (the gray line). The current value of that wave is repeatedly measured and given as a number. That's the numbers in those bytes. There are two different things that can be adjusted with this: The number of measurements you take per second (that's the sampling rate, given in Hz -- that's how many per second you grab). The other adjustment is how exact you measure. In the 2-byte case, you take two bytes for one measurement (that's values from -32768 to 32767 normally). So with those numbers given there, you can recreate the original wave (up to a limited quality, of course, but that's always so when storing stuff digitally). And recreating the original wave is what your speaker is trying to do on playback.
There are some more things you need to know. First, since it's two bytes, you need to know the byte order (big endian, little endian) to recreate the numbers correctly. Second, you need to know how many channels you have, and how they are stored. Typically you would have mono (one channel) or stereo (two), but more is possible. If you have more than one channel, you need to know, how they are stored. Often you would have them interleaved, that means you get one value for each channel for every point in time, and after that all values for the next point in time.
To illustrate: If you have data of 8 bytes for two channels and 16-bit number:
abcdefgh
Here a and b would make up the first 16bit number that's the first value for channel 1, c and d would be the first number for channel 2. e and f are the second value of channel 1, g and h the second value for channel 2. You wouldn't hear much there because that would not come close to a second of data...
If you take together all that information you have, you can calculate the bit rate you have, that's how many bits of information is generated by the recorder per second. In our example, you generate 2 bytes per channel on every sample. With two channels, that would be 4 bytes. You need about 44000 samples per second to represent the sounds a human beeing can normally hear. So you'll end up with 176000 bytes per second, which is 1408000 bits per second.
And of course, it is not 2-bit values, but two 2 byte values there, or you would have a really bad quality.
The first 44 bytes are commonly a standard RIFF header, as described here:
http://tiny.systems/software/soundProgrammer/WavFormatDocs.pdf
and here: http://www.topherlee.com/software/pcm-tut-wavformat.html
Apple/OSX/macOS/iOS created .wav files might add an 'FLLR' padding chunk to the header and thus increase the size of the initial header RIFF from 44 bytes to 4k bytes (perhaps for better disk or storage block alignment of the raw sample data).
The rest is very often 16-bit linear PCM in signed 2's-complement little-endian format, representing arbitrarily scaled samples at a rate of 44100 Hz.
The WAVE (.wav) file contain a header, which indicates the formatting information of the audio file's data. Following the header is the actual audio raw data. You can check their exact meaning below.
Positions Typical Value Description
1 - 4 "RIFF" Marks the file as a RIFF multimedia file.
Characters are each 1 byte long.
5 - 8 (integer) The overall file size in bytes (32-bit integer)
minus 8 bytes. Typically, you'd fill this in after
file creation is complete.
9 - 12 "WAVE" RIFF file format header. For our purposes, it
always equals "WAVE".
13-16 "fmt " Format sub-chunk marker. Includes trailing null.
17-20 16 Length of the rest of the format sub-chunk below.
21-22 1 Audio format code, a 2 byte (16 bit) integer.
1 = PCM (pulse code modulation).
23-24 2 Number of channels as a 2 byte (16 bit) integer.
1 = mono, 2 = stereo, etc.
25-28 44100 Sample rate as a 4 byte (32 bit) integer. Common
values are 44100 (CD), 48000 (DAT). Sample rate =
number of samples per second, or Hertz.
29-32 176400 (SampleRate * BitsPerSample * Channels) / 8
This is the Byte rate.
33-34 4 (BitsPerSample * Channels) / 8
1 = 8 bit mono, 2 = 8 bit stereo or 16 bit mono, 4
= 16 bit stereo.
35-36 16 Bits per sample.
37-40 "data" Data sub-chunk header. Marks the beginning of the
raw data section.
41-44 (integer) The number of bytes of the data section below this
point. Also equal to (#ofSamples * #ofChannels *
BitsPerSample) / 8
45+ The raw audio data.
I copied all of these from http://www.topherlee.com/software/pcm-tut-wavformat.html here
As others have pointed out, there's metadata in the wav file, but I think your question may be, specifically, what do the bytes (of data, not metadata) mean? If that's true, the bytes represent the value of the signal that was recorded.
What does that mean? Well, if you extract the two bytes (say) that represent each sample (assume a mono recording, meaning only one channel of sound was recorded), then you've got a 16-bit value. In WAV, 16-bit is (always?) signed and little-endian (AIFF, Mac OS's answer to WAV, is big-endian, by the way). So if you take the value of that 16-bit sample and divide it by 2^16 (or 2^15, I guess, if it's signed data), you'll end up with a sample that is normalized to be within the range -1 to 1. Do this for all samples and plot them versus time (and time is determined by how many samples/second is in the recording; e.g. 44.1KHz means 44.1 samples/millisecond, so the first sample value will be plotted at t=0, the 44th at t=1ms, etc) and you've got a signal that roughly represents what was originally recorded.
I suppose your question is "What do the bytes in data block of .wav file represent?" Let us know everything systematically.
Prelude:
Let us say we play a 5KHz sine wave using some device and record it in a file called 'sine.wav', and recording is done on a single channel (mono). Now you already know what the header in that file represents.
Let us go through some important definitions:
Sample: A sample of any signal means the amplitude of that signal at the point where sample is taken.
Sampling rate: Many such samples can be taken within a given interval of time. Suppose we take 10 samples of our sine wave within 1 second. Each sample is spaced by 0.1 second. So we have 10 samples per second, thus the sampling rate is 10Hz. Bytes 25th to 28th in the header denote sampling rate.
Now coming to the answer of your question:
It is not possible practically to write the whole sine wave to the file because there are infinite points on a sine wave. Instead, we fix a sampling rate and start sampling the wave at those intervals and record the amplitudes. (The sampling rate is chosen such that the signal can be reconstructed with minimal distortion, using the samples we are going to take. The distortion in the reconstructed signal because of the insufficient number of samples is called 'aliasing'.)
To avoid aliasing, the sampling rate is chosen to be more than twice the frequency of our sine wave (5kHz)(This is called 'sampling theorem' and the rate twice the frequency is called 'nyquist rate'). Thus we decide to go with sampling rate of 12kHz which means we will sample our sine wave, 12000 times in one second.
Once we start recording, if we record the signal, which is sine wave of 5kHz frequency, we will have 12000*5 samples(values). We take these 60000 values and put it in an array. Then we create the proper header to reflect our metadata and then we convert these samples, which we have noted in decimal, to their hexadecimal equivalents. These values are then written in the data bytes of our .wav files.
Plot plotted on : http://fooplot.com
Two bit audio wouldn't sound very good :) Most commonly, they represent sample values as 16-bit signed numbers that represent the audio waveform sampled at a frequency such as 44.1kHz.

Resources