How do I display a spectrogram from a wav file in C++? - audio

I am doing a project in which I want to embed images into a .wav file so that when one sees the spectrogram using certain parameters, they will see the hidden image. My question is, in C++, how can I use the data in a wav file to display a spectrogram without using any signal processing libraries?
An explanation of the math (especially the Hanning window) will also be of great help, I am fairly new to signal processing. Also, since this is a very broad question, detailed steps are preferable over actual code.
Example:
above: output spectrogram;
below: input audio waveform (.wav file)

Some of the steps (write C code for each):
Convert the data into a numeric sample array.
Chop sample array into some size of chunks, (usually) overlapped.
(usually) Window with some window function.
FFT each chunk.
Take the Magnitude.
(usually) Take the Log.
Assemble all the 1D FFT result vectors into a 2D matrix.
Scale.
Color the matrix.
Render the 2D bitmap.
(optional) (optimize by rolling some of the above into a loop.)
Add plot decorations (scale, grid marks, etc.)

Related

How do I iterate over an audio file to not miss a part that fits a class?

I have an audio file that contains a part that matches an audio class I trained, for instance the letter R in a speech.
I would set an arbitrary length, like 20ms. Then I would split the audio file in 20ms intervals, send each to the predictclass.py and take the part where the probability for my class is the highest. Yet with this method I could be exactly at the corner of the wanted area, it could be stretched(longer than the original file) etc..
How do I cut an audio file to present the right portions to my classifier?
The standard approach is to use overlap for your windows. Split the time-series into fixed-length analysis windows (ex window_length=10x20ms), but when computing the next window move it forward by a fraction of the window size. This size is usually called the 'hop length'. For example by 10% (hop_length=1x20ms). This means that the new window has 90% overlap with the previous.
librosa.util.frame is a convenient function to do this on audio. It can also be done on spectrograms.

readAudio -> cropAudio --> STFT == readAudio -> STFT -> cropAudio

Is the following the same?
Read Audio file, then crop it to a certain frame length and perform a Short Time Fourier Transform on the excerpt
Code: stft(cropAudio(readAudio(wav)))
Read Audio file, then perform the Short Time Fourier Transform on the whole Audio file and then crop the interesting part out
Code: cropAudio(stft(readAudio(wav)))
The first option is much more efficient, since the STFT is only performed on a small part of the file - though I'm wondering if the results are the same.
No they are not the same. In example 1 you are shortening the time domain waveform - reducing the duration of the signal. In example 2 the data that you are cropping is in the frequency domain so you are throwing away frequency information.

Direct3D vector output?

Is there any means to interpret Direct3D output as a series of vectors instead of a raster image? I am hoping I could use such a feature to generate a PDF file containing the rendered Direct3D output. Am I being too optimistic?
Well there is noting specifically stopping you from interpreting the input data as vectors. It is, however, fundamentally a rasteriser. Pixel shaders entirely stop making sense the moment you convert to vector data.
Still you know what your transforms are and you know what the vertex data is so you could output it as vector data in whatever format you want ...

What exactly is a "Sample"?

From the OpenAL documentation it looks like if an Sample is one single floating point value like lets say 1.94422
Is that correct? Or is a sample an array of a lot of values? What are audio programming dudes talking about when they say "Sample"? Is it the smallest possible snippet of an audio file?
I imagine an uncompressed audio file to look like a giant array with millions of floating point values, where every value is a point in a graph that forms the sound wave. So every little point is a sample?
Exactly. A sample is a value.
When you convert and analog signal to its digital representation, you convert a continuous function to a discrete and quantized one.
It means that you have a grid of vertical and horizontal lines and all the possible values lie on the intersection of the lines. The gap between vertical lines represents the distance between two consecutive samples, the gap between horizontal one is the minimum differences you may represent.
In every vertical line you have a sample, which (in linear encoding) is equal to n-times k where k is the quantum, minimum differences references above.
I imagine an uncompressed audio file
to look like a giant array with
millions of floating point values,
where every value is a point in a
graph that forms the sound wave. So
every little point is a sample?
Yes, that is right. A sample is the value calculated by your A/D converter for that particular point in time. There's a sample for each channel (e.g. left and right in stereo mode. Both samples form a frame.
According to the Wikipedia article on signal processing:
A sample refers to a value or set of values at a point in time and/or space.
So yes, it could just be a single floating point value. Although as Johannes pointed out, if there are multiple channels of audio (EG: right/left), you would expect one value for each channel.
In audio programming, the term "sample" does indeed refer to a single measurement value. Among audio engineers and producers, however, the term "sample" normally refers to an entire snippet of sound taken (or sampled) from a famous song or movie or some other original audio source.

Where can I learn how to work with audio data formats?

I'm working on an openGL project that involves a speaking cartoon face. My hope is to play the speech (encoded as mp3s) and animate its mouth using the audio data. I've never really worked with audio before so I'm not sure where to start, but some googling led me to believe my first step would be converting the mp3 to pcm.
I don't really anticipate the need for any Fourier transforms, though that could be nice. The mouth really just needs to move around when there's audio (I was thinking of basing it on volume).
Any tips on to implement something like this or pointers to resources would be much appreciated. Thanks!
-S
Whatever you do, you're going to need to decode the MP3s into PCM data first. There are a number of third-party libraries that can do this for you. Then, you'll need to analyze the PCM data and do some signal processing on it.
Automatically generating realistic lipsync data from audio is a very hard problem, and you're wise to not try to tackle it. I like your idea of simply basing it on the volume. One way you could compute the current volume is to use a rolling window of some size (e.g. 1/16 second), and compute the average power in the sound wave over that window. That is, at frame T, you compute the average power over frames [T-N, T], where N is the number of frames in your window.
Thanks to Parseval's theorem, we can easily compute the power in a wave without having to take the Fourier transform or anything complicated -- the average power is just the sum of the squares of the PCM values in the window, divided by the number of frames in the window. Then, you can convert the power into a decibel rating by dividing it by some base power (which can be 1 for simplicity), taking the logarithm, and multiplying by 10.

Resources