What exactly is a "Sample"? - audio

From the OpenAL documentation it looks like if an Sample is one single floating point value like lets say 1.94422
Is that correct? Or is a sample an array of a lot of values? What are audio programming dudes talking about when they say "Sample"? Is it the smallest possible snippet of an audio file?
I imagine an uncompressed audio file to look like a giant array with millions of floating point values, where every value is a point in a graph that forms the sound wave. So every little point is a sample?

Exactly. A sample is a value.
When you convert and analog signal to its digital representation, you convert a continuous function to a discrete and quantized one.
It means that you have a grid of vertical and horizontal lines and all the possible values lie on the intersection of the lines. The gap between vertical lines represents the distance between two consecutive samples, the gap between horizontal one is the minimum differences you may represent.
In every vertical line you have a sample, which (in linear encoding) is equal to n-times k where k is the quantum, minimum differences references above.

I imagine an uncompressed audio file
to look like a giant array with
millions of floating point values,
where every value is a point in a
graph that forms the sound wave. So
every little point is a sample?
Yes, that is right. A sample is the value calculated by your A/D converter for that particular point in time. There's a sample for each channel (e.g. left and right in stereo mode. Both samples form a frame.

According to the Wikipedia article on signal processing:
A sample refers to a value or set of values at a point in time and/or space.
So yes, it could just be a single floating point value. Although as Johannes pointed out, if there are multiple channels of audio (EG: right/left), you would expect one value for each channel.

In audio programming, the term "sample" does indeed refer to a single measurement value. Among audio engineers and producers, however, the term "sample" normally refers to an entire snippet of sound taken (or sampled) from a famous song or movie or some other original audio source.

Related

Efficient generation of sampled waveforms without aliasing artifacts

For a project of mine I am working with sampled sound generation and I need to create various waveforms at various frequencies. When the waveform is sinusoidal, everything is fine, but when the waveform is rectangular, there is trouble: it sounds as if it came from the eighties, and as the frequency increases, the notes sound wrong. On the 8th octave, each note sounds like a random note from some lower octave.
The undesirable effect is the same regardless of whether I use either one of the following two approaches:
The purely mathematical way of generating a rectangular waveform as sample = sign( secondsPerHalfWave - (timeSeconds % secondsPerWave) ) where secondsPerWave = 1.0 / wavesPerSecond and secondsPerHalfWave = secondsPerWave / 2.0
My preferred way, which is to describe one period of the wave using line segments and to interpolate along these lines. So, a rectangular waveform is described (regardless of sampling rate and regardless of frequency) by a horizontal line from x=0 to x=0.5 at y=1.0, followed by another horizontal line from x=0.5 to x=1.0 at y=-1.0.
From what I gather, the literature considers these waveform generation approaches "naive", resulting in "aliasing", which is the cause of all the undesirable effects.
What this all practically translates to when I look at the generated waveform is that the samples-per-second value is not an exact multiple of the waves-per-second value, so each wave does not have an even number of samples, which in turn means that the number of samples at level 1.0 is often not equal to the number of samples at level -1.0.
I found a certain solution here: https://www.nayuki.io/page/band-limited-square-waves which even includes source code in Java, and it does indeed sound awesome: all undesirable effects are gone, and each note sounds pure and at the right frequency. However, this solution is entirely unsuitable for me, because it is extremely computationally expensive. (Even after I have replaced sin() and cos() with approximations that are ten times faster than Java's built-in functions.) Besides, when I look at the resulting waveforms they look awfully complex, so I wonder whether they can legitimately be called rectangular.
So, my question is:
What is the most computationally efficient method for the generation of periodic waveforms such as the rectangular waveform that does not suffer from aliasing artifacts?
Examples of what the solution could entail:
The computer audio problem of generating correct sample values at discrete time intervals to describe a sound wave seems to me somewhat related to the computer graphics problem of generating correct integer y coordinates at discrete integer x coordinates for drawing lines. The Bresenham line generation algorithm is extremely efficient, (even if we disregard for a moment the fact that it is working with integer math,) and it works by accumulating a certain error term which, at the right time, results in a bump in the Y coordinate. Could some similar mechanism perhaps be used for calculating sample values?
The way sampling works is understood to be as reading the value of the analog signal at a specific, infinitely narrow point in time. Perhaps a better approach would be to consider reading the area of the entire slice of the analog signal between the last sample and the current sample. This way, sampling a 1.0 right before the edge of the rectangular waveform would contribute a little to the sample value, while sampling a -1.0 considerable time after the edge would contribute a lot, thus naturally yielding a point which is between the two extreme values. Would this solve the problem? Does such an algorithm exist? Has anyone ever tried it?
Please note that I have posted this question here as opposed to dsp.stackexchange.com because I do not want to receive answers with preposterous jargon like band-limiting, harmonics and low-pass filters, lagrange interpolations, DC compensations, etc. and I do not want answers that come from the purely analog world or the purely theoretical outer space and have no chance of ever receiving a practical and efficient implementation using a digital computer.
I am a programmer, not a sound engineer, and in my little programmer's world, things are simple: I have an array of samples which must all be between -1.0 and 1.0, and will be played at a certain rate (44100 samples per second.) I have arithmetic operations and trigonometric functions at my disposal, I can describe lines and use simple linear interpolation, and I need to generate the samples extremely efficiently because the generation of a dozen waveforms simultaneously and also the mixing of them together may not consume more than 1% of the total CPU time.
I'm not sure but you may have a few of misconceptions about the nature of aliasing. I base this on your putting the term in quotes, and from the following quote:
What this all practically translates to when I look at the generated
waveform is that the samples-per-second value is not an exact multiple
of the waves-per-second value, so each wave does not have an even
number of samples, which in turn means that the number of samples at
level 1.0 is often not equal to the number of samples at level -1.0.
The samples/sec and waves/sec don't have to be exact multiples at all! One can play back all pitches below the Nyquist. So I'm not clear what your thinking on this is.
The characteristic sound of a square wave arises from the presence of odd harmonics, e.g., with a note of 440 (A5), the square wave sound could be generated by combining sines of 440, 1320, 2200, 3080, 3960, etc. progressing in increments of 880. This begs the question, how many odd harmonics? We could go to infinity, theoretically, for the sharpest possible corner on our square wave. If you simply "draw" this in the audio stream, the progression will continue well beyond the Nyquist number.
But there is a problem in that harmonics that are higher than the Nyquist value cannot be accurately reproduced digitally. Attempts to do so result in aliasing. So, to get as good a sounding square wave as the system is able to produce, one has to avoid the higher harmonics that are present in the theoretically perfect square wave.
I think the most common solution is to use a low-pass filtering algorithm. The computations are definitely more cpu-intensive than just calculating sine waves (or doing FM synthesis, which was my main interest). I am also weak on the math for DSP and concerned about cpu expense, and so, avoided this approach for long time. But it is quite viable and worth an additional look, imho.
Another approach is to use additive synthesis, and include as many sine harmonics as you need to get the tonal quality you want. The problem then is that the more harmonics you add, the more computation you are doing. Also, the top harmonics must be kept track of as they limit the highest note you can play. For example if using 10 harmonics, the note 500Hz would include content at 10500 Hz. That's below the Nyquist value for 44100 fps (which is 22050 Hz). But you'll only be able to go up about another octave (doubles everything) with a 10-harmonic wave and little more before your harmonic content goes over the limit and starts aliasing.
Instead of computing multiple sines on the fly, another solution you might consider is to instead create a set of lookup tables (LUTs) for your square wave. To create the values in the table, iterate through and add the values from the sine harmonics that will safely remain under the Nyquist for the range in which you use the given table. I think a table of something like 1024 values to encode a single period could be a good first guess as to what would work.
For example, I am guestimating, but the table for the octave C4-C5 might use 10 harmonics, the table for C5-C6 only 5, the table for C3-C4 might have 20. I can't recall what this strategy/technique is called, but I do recall it has a name, it is an accepted way of dealing with the situation. Depending on how the transitions sound and the amount of high-end content you want, you can use fewer or more LUTs.
There may be other methods to consider. The wikipedia entry on Aliasing describes a technique it refers to as "bandpass" that seems to be intentionally using aliasing. I don't know what that is about or how it relates to the article you cite.
The Soundpipe library has the concept of a frequency table, which is a data structure that holds a precomputed waveform such as a sine. You can initialize the frequency table with the desired waveform and play it through an oscilator. There is even a module named oscmorph which allows you to morph between two or more wavetables.
This is an example of how to generate a sine wave, taken from Soundpipe's documentation.
int main() {
UserData ud;
sp_data *sp;
sp_create(&sp);
sp_ftbl_create(sp, &ud.ft, 2048);
sp_osc_create(&ud.osc);
sp_gen_sine(sp, ud.ft);
sp_osc_init(sp, ud.osc, ud.ft);
ud.osc->freq = 500;
sp->len = 44100 * 5;
sp_process(sp, &ud, write_osc);
sp_ftbl_destroy(&ud.ft);
sp_osc_destroy(&ud.osc);
sp_destroy(&sp);
return 0;
}

How to find what time a part of audio starts and ends in another audio?

I have two audio files in which a sentence is read (like singing a song) by two different people. So they have different lengths. They are just vocal, no instrument in it.
A1: Audio File 1
A2: Audio File 2
Sample sentence : "Lorem ipsum dolor sit amet, ..."
I know the time every word starts and ends in A1. And I need to find automatically that what time every word starts and ends in A2. (Any language, preferably Python or C#)
Times are saved in XML. So, I can split A1 file by word. So, how to find sound of a word in another audio that has different duration (of word) and different voice?
So from what I read, it seems you would want to use Dynamic Time Warping (DTW). Of course, I'll leave the explanation for wikipedia, but it is generally used to recognize speech patterns without getting noise from different pronunciation.
Sadly, I am more well versed in C, Java and Python. So I will be suggesting python Libraries.
fastdtw
pydtw
mlpy
rpy2
With rpy2 you can actually use R's library and use their implementation of DTW in your python code. Sadly, I couldn't find any good tutorials for this but there are good examples if you choose to use R.
Please let me know if that doesn't help, Cheers!
My approach for this would be to record the dB volume at a constant interval (such as every 100 milliseconds) store this volume in a list or array. I found a way of doing this on java here: Decibel values at specific points in wav file. It is possible in other languages. Meanwhile, take note of the max volume:
max = 0;
currentVolume = f(x)
if currentVolume > max
{
max = currentVolume
}
Then divide the maximum volume by an editable threshold, in my example I went for 7. Say the maximum volume is 21, 21/7 = 3dB, let's call this measure X.
We second threshold, such as 1 and multiply it by X. Whenever the volume is greater than this new value (1*x), we consider that to be the start of a word. When it is less than the given value, we consider it to be the end of a word.
Visual explanation
Without knowing how sophisticated your understanding of the problem space is it isn't easy to know whether to point you in a direction or provide detail on why this problem is non-trivial.
I'd suggest that you start with something like https://cloud.google.com/speech/ and try to convert the speech blocks to text and then perform a similarity comparison on these.
If you really want to try to do the processing yourself you could look at doing some spectrographic analysis. Take the wave form data and perform an FFT to get frequency distributions and look for marker patterns that align your samples.
With only single word comparison of different speakers you are probably not going to be able to apply any kind of neural network unless you are able to train them on the 2 speakers entire speech set and use the network to then try to compare the individual word chunks.
It's been a few years since I did any of this so maybe it's easier these days but my recollection is that although this sounds conceptually simple it might prove to be more difficult than you realise.
The Dynamic Time Warping looks like the most promising suggestion.
secret sauce of below : pointA - pointB is zero if both points have same value ... that is numerically do a pointA minus pointB ... below leverages this to identify at what file byte index offset gives us this zero value when comparing the raw audio curves from a pair of input files ... or an close to zero in a relative sense if both source audio are different even slightly
approach is open up both files and pluck out the raw audio curve of each file ... define two variables bestSum and currentSum, set both to MAX_INT_VALUE ( any arbitrary high value ) ... iterate across the both files simultaneously and obtain the integer value of the current raw audio curve level of file A do same on other file B ... for each such integer just subtract the integer from file A from integer from file B ... continue this loop until you have reached end of one file ... inside of above loop add to currentSum variable the current value of the above mentioned subtraction ... at bottom of above loop update bestSum to become currentSum if currentSum < bestSum also store current file index offset ...
create an outer loop which does a repeat all of above by introducing an offset in time of one file then relaunch above inner loop ... your common audio is when you are using the offset which has the minimum total sum value .. that is the offset when you encountered bestSum
do not start coding until you have gained intuition that above makes perfect sense
I highly encourage you to plot out the curve of the raw audio for one file to confirm you are accessing this sequence of integers ... do this before attempting above algorithm
it will help to visualize above by viewing each input source audio as a curve and you simply keep one curve steady as you slide the other audio curve left or right until you see the curve shapes match or get very close to matching

Changing frequency amplitude with RealFFT, flickering sound

i have been trying to modify the amplitude for specific frequencies. Here is what i have done:
I get the data 2048 as float array which have a value range of [-1,1]. It's raw data.
I use this RealFFT algorithm http://www.lomont.org/Software/Misc/FFT/LomontFFT.html
I divide the raw data into left and right channel (this works great).
I perform RealFFT (forward enable) on both left and right and i use this equation to find which index is the right frequency that i want: freq/(samplerate/sizeOfBuffer/2.0)
I modify the frequency that i want.
I perform RealFFT (forward disable) to go back to frequency domain.
Now when i play back, i hear the change tat i did to the frequency but there is a flickering noise ( kinda the same flickering when you play an old vinyl song).
Any idea what i might do wrong?
It was a while ago i took my signal processing course at my university so i might have forgot something.
Thanks in advance!
The comments may be confusing. Here are some clarifications.
The imaginary part is not the phase. The real and imaginary parts form a vector, think of a 2-d plot where real is on the x axis and imaginary on the y. The amplitude of a frequency is the length of the line formed from the origin to the point. So, the phase is the arctan of the real and imaginary parts divided. The magnitude is the square root of the sum of squares of the real and imaginary parts.
So. The first step is that you want to change the magnitude of the vector, you must scale both the real and imaginary parts.
That's easy. The second part is much more complicated. The Fourier transform's "view" of the world is that it is infinitely periodic - that is, it looks like the signal wraps from the end, back to the beginning. If you put a perfect sine tone into your algorithm, and say that the period of the sine tone is 4096 samples. The first sample into the FFT is +1, then the last sample into the FFT is -1. If you look at the spectrum in the FFT, it will appear as if there are lots of high frequencies, which are the harmonics of transforming a signal that has a jump from -1 to 1. The longer and longer the FFT, the closer that the FFT shows you the "real" view of the signal.
Techniques to smooth out the transitions between FFT blocks have been developed, by windowing and overlapping the FFT blocks, so that the transitions between the blocks are not so "discontinuous". A fairly common technique is to use a Hann window and overlap by a factor of 4. That is, for every 2048 samples, you actually do 4 FFTs, and every FFT overlaps the previous block by 1536. The Hann window gets mathy, but basically it has nice properties so that you can do overlaps like this and everything sums up nicely.
I found this pretty fun blog showing exactly the same learning pains that you're going through: http://www.katjaas.nl/FFTwindow/FFTwindow&filtering.html
This technique is different from another commenter who mentions Overlap-Save. This is a a method developed to use FFTs to do FIR filtering. However, designing the FIR filter will typically be done in a mathematical package like Matlab/Octave.
If you use a series of shorter FFTs to modify a longer signal, then one should zero-pad each window so that it uses a longer FFT (longer by the impulse response of the modification's spectrum), and combine the series of longer FFTs by overlap-add or overlap-save. Otherwise, waveform changes that should ripple past the end of each FFT/IFFT modification will , due to circular convolution, ripple around to the beginning of each window, and cause that periodic flickering distortion you hear.

Search different audio files for equal short samples

Consider multiple (at least two) different audio-files, like several different mixes or remixes. Naively I would say, it must be possible to detect samples, especially the vocals, that are almost equal in two or more of the files, of course only then, if the vocal samples aren't modified, stretched, pitched, reverbed too much etc.
So with what kind of algorithm or technique this could be done? Let's say, the user would try to set time markers in all files best possible, which describe the data windows to compare, containing the presumably equal sounds, vocals etc.
I know that no direct approach, trying to directly compare wav data in any way is useful. But even if I have the frequency domain data (e.g. from FFT), I would have to use a comparison algorithm that kind of shifts the comparing-windows through time scale, since I cannot assume the samples, I want to find, are time sync over all files.
Thanks in advance for any suggestions.
Hi this is possible !!
You can use one technique called LSH (locality sensitive hashing), is very robust.
another way to do this is try make spectrogram analysis in your audio files ...
Construct database song
1. Record your Full Song
2. Transform the sound to spectrum
3. slice your Spectrogram in chunk and get three or four high Frequencies
4. Store all the points
Match the song
1. Record one short sample.
2. Transform the sound into another spectrum
3. slice your Spectrogram in chunk and get three or four hight Frequencies
4. Compare the collected frequencies with your database song.
5. your match is the song with have the high hit !
you can see here how make ..
http://translate.google.com/translate?hl=EN&sl=pt&u=http://ederwander.wordpress.com/2011/05/09/audio-fingerprint-em-python/
ederwander

Determining Note Durations based on Onset Locations

I have a question regarding how to determine the Duration of notes given their Onset Locations.
So for example, I have an array of amplitude values (containing short) and another array of the same size, that contains a 1 if a note onset is detected, and a 0 if not. So basically, the distance between each 1 will be used to determine the duration.
How can I do this? I know that I have to use the Sample Rate and other attributes of the audio data, but is there a particular formula that I can use?
Thank you!
So you are starting with a list of ONSETS, what you are really looking for is a list of OFFSETS.
There are many methods for onset detection (here is a paper on it) https://adamhess.github.io/Onset_Detection_Nov302011.pdf
many of the same methods can be applied to Offset Detection:
Since the onset is marked by an INCREASE in spectral content you can measure a decrease in Spectral content.
take a reasonable time window before and after your onset. (.25-.5s)
Chop up the window into smaller segments and take 50% overlapping Fourier transforms.
compute the difference between the fourier co-efficient between two successive windows decreases and only allow negative changes in SD.
multiple your results by -1.
pick the peaks off of the results
Voila, offsets.
(look at page 7 of the paper listed above for more detail about spectrial difference function, you can apply a modified (as above) version of it_
Well, if your samplerate in Hz is fs, then the time between two nodes is equal to
1/fs * <number of zeros between the two node-ones>
Very simple :-)
Regards

Resources