GStreamer - Generate audio waveform from MP4 file - audio

I have a two-part question,
1) I have an MP4 file and want to generate it's audio waveform.
2) I have another MP4 file which has audio at channel [0] and channel [1] and a video track too, I want to generate waveforms for both channels as separate images.
How can I achieve both of the above by using GSteamer?

Is the raw audio in 16-bit or 32-bit format? What is the sample rate (44100 hz) and what is time duration?
Anyways assuming 44.1khz at 10 second duration... since you can't draw 44 thousand samples as pixel width so choose a final display size (eg: width = 800px and height = 600px) and do math :
//# is (samplerate / duration) / width...
(44100 / 10) / 800 = 551;
After reading first 2 values, you will jump ahead by 551 bytes and repeat until total of (.
So in your raw data starting from pos = 0;...
1) Check this and next sample, then multiply their values together (sample[pos] x sample[pos+1]).
2) Take that result and divide by 65335 (maximum value of 16-bits or 2 bytes). That's the final value of your first sample or point.
3) Draw a line to fit according to image height (eg: 600px) so if sample = 0.83 then:
line_height = (600 x 0.83); //# gives 498 as line height
line_count += 1; //# add plus 1 to line count (stop when it reaches 800)
4) Skip ahead from position [pos] by +551 bytes and repeat step (1) until line_count == 800;


Applying CNN to Short-time Fourier transform?

So I have a code which returns a Short-Time Fourier Transform spectrum of a .wav file. I want to be able to take, say a millisecond of the spectrum, and train a CNN on it.
I'm not quite sure how I would implement that. I know how to format the image data to feed into the CNN, and how to train the network, but I'm lost on how to take the FFT-data and divide it into small time-frames.
The FFT Code(Sorry for ultra long code):
rate, audio ='scale_a_lydian.wav')
audio = np.mean(audio, axis=1)
N = audio.shape[0]
L = N / rate
M = 1024
# Audio is 44.1 Khz, or ~44100 samples / second
# window function takes 1024 samples or 0.02 seconds of audio (1024 / 44100 = ~0.02 seconds)
# and shifts the window 100 over each time
# so there would end up being (total_samplesize - 1024)/(100) total steps done (or slices)
slices = util.view_as_windows(audio, window_shape=(M,), step=100) #slices overlap
win = np.hanning(M + 1)[:-1]
slices = slices * win #each slice is 1024 samples (0.02 seconds of audio)
slices = slices.T #transpose matrix -> make each column 1024 samples (ie. make each column one slice)
spectrum = np.fft.fft(slices, axis=0)[:M // 2 + 1:-1] #perform fft on each slice and then take the first half of each slice, and reverse
spectrum = np.abs(spectrum) #take absolute value of slices
# take SampleSize * Slices
# transpose into slices * samplesize
# Take the first row -> slice * samplesize
# transpose back to samplesize * slice (essentially get 0.01s of spectrum)
spectrum2 = spectrum.T
spectrum2 = spectrum2[:1]
spectrum2 = spectrum2.T
The following outputs an FFT spectrum:
N = spectrum2.shape[0]
L = N / rate
f, ax = plt.subplots(figsize=(4.8, 2.4))
S = np.abs(spectrum2)
S = 20 * np.log10(S / np.max(S))
ax.imshow(S, origin='lower', cmap='viridis',
extent=(0, L, 0, rate / 2 / 1000))
ax.set_ylabel('Frequency [kHz]')
ax.set_xlabel('Time [s]');
(Feel free to correct any theoretical errors that I put in the comments)
So from what I understand, I have a numpy array (spectrum) with each column being a slice with 510 samples (Cut in half, because half of each FFT slice is redundant (useless?)), with each sample having the list of frequencies?
EDIT: So the above code theoretically outputs 0.01s of audio as a spectrum, which is exactly what I need. Is this true, or am I not thinking right?

Working out sample rate and bit depth of aiff audio from file size

I need some help with Maths/logic here. Working with aif files.
I have written the following:
LnByte = FileLen(ToCheck) 'Returns Filesize in Bytes
LnBit = LnByte * 8 'Get filesize in Bits
Chan = 1 'Channels in audio: mono = 1
BDpth = 24 'Bit Detph
SRate = 48000 'Sample Rate
BRate = 1152000 'Expected Bit Rate
Time_Secs = LnBit / Chan / BDpth / SRate 'Size in Bits / Channels / Bit Depth / Sample Rate
FSize = (BRate / 8) * Time_Secs '(Bitrate / 8) * Length of file in seconds
ToCheck is the current file when looping through a folder of files.
So I'm finding the length of audio based on the file size in bits / channels / bit depth / sample rate. This assumes that the bit depth and sample rate are correct (I need the files to be 24-bit/48kHz).
Time_Secs = Length of the file in seconds.
FSize = File size based on 24/48kHz using the Time_Secs
Probably because the FSize uses Time_Secs, I can't work out how to, from this, work out if the file sample rate and/or bit depth are indeed correct...
Assuming 24/48k should give 144,000 Bytes per second
Assuming 16/48k should give 96,000 Bytes per second
If I check a file that is 16-bit/48 kHz using the above code it gives the incorrect time in secs (naturally) but the correct file size... even though the Bit Rate is 1,152,000 should be wrong.
-- It would seem that the difference in time is making up for the difference in Bit Rate - or I'm looking at it wrong.
How would I adapt my formula, or do the maths to work out if the sample rate/bit depth of a file is actually 48,000 Hz /24-bit? Or is there a different way entirely? Remembering that they are aif files, not wavs.
Hope that makes sense.
Many Thanks in advance!

What is the correct audio volume slider formula?

I'm building a VoIP application. If I take the slider value and just multiply audio samples by it, I get incorrect, nonlinear sounding results. What's the correct formula to get smooth results?
The correct formula is the decibel formula solved for Prms. Here's example code in C:
// level is 0 to 1, silence is dBFS at level 0
void AdjustVolume(int16_t* buffer, size_t length, float level, float silence = -96)
float factor = pow(10.0f, (1 - level) * silence / 20.0f);
for (size_t i = 0; i < length; i++)
buffer[i] = static_cast<int16_t>(buffer[i] * factor);
There's one tweakable: silence. It's the amount of noise when there's no sound. Or: the loudness level below which you can't hear the sound because of the background noise. The theoretical maximum silence for 16 bit audio samples is -96 dB (a sample with integer value of 1 out of 32767). In the real world however, there's background noise produced by the audio equipment and the surroundings of the listener, so you might want to pick a noisier silence level, like -30 dB or something. Picking the correct silence value will maximize the useful surface area of your volume slider, or minimize the amount of slider area where no perceptible change in volume occurs.

Amplitude of audio signal harmonics in Unity3D

I have managed to calculate the pitch of audio input from microphone using the GetSpectrumData function. But now I need to get the amplitudes of the first 7 harmonics of audio (Project requirement)
I have very less knowledge of Audio dsp. Only thing I understood is that harmonics are multiples of the fundamental frequency. But how will I get the amplitudes of the harmonics.
First you need to figure out which FFT bin your fundamental frequency is in. Say it resides in bin# 10. The harmonics will reside in integer multiples of that bin so the 2nd harmonic will be in bin 20, 3rd in bin 30 and so on. For each of these harmonic bins you need to compute the amplitude. Depending on the window function you used in the FFT you will need to include a small number of bins in the calculation (google spectral leakage if you're interested).
double computeAmpl(double[] spectrum, int windowHalfLen, int peakBin, int harmonic)
double sumOfSquares = 0.0;
for (int bin = peakBin-windowHalfLen; bin <= peakBin+windowHalfLen; bin++)
sumOfSquares += spectrum[bin] * spectrum[bin];
return sqrt(sumOfSquares);
As I mentioned the window half length depends on the window. Some common ones are:
blackman-harris 3 - 3
blackman-harris 4 - 4
flat top - 5
hann - 3

Finding number of samples in .wav file and Hex Editor

Need help with Hex Editor and audio files.I am having trouble figuring out the formula to get the number of samples in my .wav files.
I downloaded StripWav which tells me the number of samples in the .waves,but still cannot figure out the formula.
Can you please download these two .wavs,open them in a hex editor and tell me the formula to get the number of samples.
If you so kindly do this for me,pleas tell me the number of samples for each .wav so I can make sure the formula is correct.
Here is a problem I have two programs,
One reads the wav data and the other shows the numsamples
here is the data
RIFF 'WAVE' (wave file)
<fmt > (format description)
PCM format
2 channel
44100 frames per sec
176400 bytes per sec
4 bytes per frame
16 bits per sample
<data> (waveform data - 92252 bytes)
But the other program says NumSamples is
23,063 samples
One more thing I did the calculation with 2 files
This one is correct
92,296 bytes and num samples is 23,063`
But this other one is not coming out correctly it is over 2 megs i just subracted 44 bytes and I doing it wrong here? here is the filesize
2,473,696 bytes
But the correct numsamples is
WAVE format
You must read the fmt header to determine the number of channels and bits per sample, then read the size of the data chunk to determine how many bytes of data are in the audio. Then:
NumSamples = NumBytes / (NumChannels * BitsPerSample / 8)
There is no simple formula for determining the number of samples in a WAV file. A so-called "canonical" WAV file consists of a 44-byte header followed by the actual sample data. So, if you know that the file uses 2 bytes per sample, then the number of samples is equal to the size of the file in bytes, minus 44 (for the header), and then divided by 2 (since there are 2 bytes per sample).
Unfortunately, not all WAV files are "canonical" like this. A WAV file uses the RIFF format, so the proper way to parse a WAV file is to search through the file and locate the various chunks.
Here is a sample (not sure what language you need to do this in):
A WAVE's format chunk (fmt) has the 'bytes per sample frame' specified as wBlockAlign.
So: framesTotal = data.ck_size / fmt.wBlockAlign;
and samplesTotal = framesTotal * wChannels;
Thus, samplesTotal===FramesTotal IIF wChannels === 1!!
Note how the above answer elegantly avoided to explain that key-equations the spec (and answers based on them) are WRONG:
consider flor example a 2 channel 12 bits per second wave..
The spec explains we put each 12bps sample in a word:
note: t=point in time, chan = channel
| frame 1 | frame 2 | etc
| chan 1 # t1 | chan 2 # t1 | chan 1 # t2 | chan 2 # t2 | etc
| byte | byte | byte | byte | byte | byte | byte | byte | etc
So.. how many bytes does the sample-frame (BlockAlign) for a 2ch 12bps wave have according to spec?
<sarcasm> CEIL(wChannels * bps / 8) = 3 bytes.. </sarcasm>
Obviously the correct equation is: wBlockAlign=wChannels*CEIL(bps/8)
