I have two wave files for which I have the digital samples extracted. I need to play both at the same time. How do I combine the two samples to produce the output sample that is both sounds playing together. How is this done for N simultaneous samples? Is it as simple as adding the samples and taking the average?
Combining sounds (at the same sample rate) just involves an element-wise addition of the two arrays. You do not need to divide by N unless you have an issue with headroom. If the value of the sum exceeds the maximum output level, this will result in clipping, giving an audible distortion.
Unless you have a large N, or a small N where each of your source sounds are normalised to the maximum output level, you should not have a problem with clipping. If you know the waveforms of the signals in advance, you can simply scale each waveform by the same scalar value beforehand so that the output does not clip. Alternatively, if you are rendering the sound offline, you can just sum your waveforms and then normalise the composite signal so it does not clip.
If you are dealing with a live input stream of N sources, you can minimise clipping using a limiter.
http://en.wikipedia.org/wiki/Dynamic_range_compression#Limiting
Yes, you can simply sum the two, and divide by two.
Indeed, that's the average.
When both samples have the same sample-rate it's really as straightforward as that.
Combine digital audio by adding the individual samples together.
There will be a loudness increase when combining several uncorrelated sound sources, but the relationship between loudness and N number of sources is not linear. Four simultaneous sounds will be approximately twice as loud as one, not four times as loud. (That's a 6dB increase.)
As you suspected you do need to keep in mind the final output volume when playing back multiple sounds simultaneously but dividing by N when combining N simultaneous sources is not the right way to do so.
The easiest way is to add a volume control to your application. The user will turn down your application when it's too loud. This is simple and usually the correct approach when combining a small number of sounds.
A manual volume control is not the right solution for all problems. For example a first person shooter. Imagine running from a quiet corridor out into a raging gun battle. The sound environment will go from very quiet with a few sound sources to very loud with lots of sound sources. In these cases you'll likely need some form of automatic gain control.
Related
I am looking for scaling a PNG file according to an audio provided, a frequency range (20hz-1000hz for example) and a threshold, for a smooth effect.
For example, when there is a kick, scale go to 120% smoothly, I would like to make those audio visualizers such as dubstep, etc... where when kicks comes in, their image are "pumping".
First, is it doable with ffmpeg?
Where to start?
I found showcqt that takes frequencies in input etc., but its output is a video so I don't think I can use it in my case. Any help appreciated.
If you are able to read the PCM values as they are being output, then you might consider using a rolling RMS average in order to get a continuous stream of amplitudes. IDK the best length of the array. Perhaps it should correspond to the number of audio frames that would give you an update for each visual frame? The folks at the DSP site would have the best insights.
If you do a rolling average, computations are not terribly expensive. You'd do the square on the incoming and add that to a ring buffer (circular queue) and drop the outgoing. Only those data points need be added to the rolling average when computing the new rolling average, since the denominator is fixed and known. I found a video that describes the basic RMS math here using Matlab.
It might be necessary to add some smoothing to visualizer that is receiving the volume updates. Also, handing off data from the audio thread should likely employ some form of loose coupling. It would not be good if the thread that is processing the audio was also handling graphics.
I'm a little over my head, but I think this is what is generally done for visualizers.
For a project of mine I am working with sampled sound generation and I need to create various waveforms at various frequencies. When the waveform is sinusoidal, everything is fine, but when the waveform is rectangular, there is trouble: it sounds as if it came from the eighties, and as the frequency increases, the notes sound wrong. On the 8th octave, each note sounds like a random note from some lower octave.
The undesirable effect is the same regardless of whether I use either one of the following two approaches:
The purely mathematical way of generating a rectangular waveform as sample = sign( secondsPerHalfWave - (timeSeconds % secondsPerWave) ) where secondsPerWave = 1.0 / wavesPerSecond and secondsPerHalfWave = secondsPerWave / 2.0
My preferred way, which is to describe one period of the wave using line segments and to interpolate along these lines. So, a rectangular waveform is described (regardless of sampling rate and regardless of frequency) by a horizontal line from x=0 to x=0.5 at y=1.0, followed by another horizontal line from x=0.5 to x=1.0 at y=-1.0.
From what I gather, the literature considers these waveform generation approaches "naive", resulting in "aliasing", which is the cause of all the undesirable effects.
What this all practically translates to when I look at the generated waveform is that the samples-per-second value is not an exact multiple of the waves-per-second value, so each wave does not have an even number of samples, which in turn means that the number of samples at level 1.0 is often not equal to the number of samples at level -1.0.
I found a certain solution here: https://www.nayuki.io/page/band-limited-square-waves which even includes source code in Java, and it does indeed sound awesome: all undesirable effects are gone, and each note sounds pure and at the right frequency. However, this solution is entirely unsuitable for me, because it is extremely computationally expensive. (Even after I have replaced sin() and cos() with approximations that are ten times faster than Java's built-in functions.) Besides, when I look at the resulting waveforms they look awfully complex, so I wonder whether they can legitimately be called rectangular.
So, my question is:
What is the most computationally efficient method for the generation of periodic waveforms such as the rectangular waveform that does not suffer from aliasing artifacts?
Examples of what the solution could entail:
The computer audio problem of generating correct sample values at discrete time intervals to describe a sound wave seems to me somewhat related to the computer graphics problem of generating correct integer y coordinates at discrete integer x coordinates for drawing lines. The Bresenham line generation algorithm is extremely efficient, (even if we disregard for a moment the fact that it is working with integer math,) and it works by accumulating a certain error term which, at the right time, results in a bump in the Y coordinate. Could some similar mechanism perhaps be used for calculating sample values?
The way sampling works is understood to be as reading the value of the analog signal at a specific, infinitely narrow point in time. Perhaps a better approach would be to consider reading the area of the entire slice of the analog signal between the last sample and the current sample. This way, sampling a 1.0 right before the edge of the rectangular waveform would contribute a little to the sample value, while sampling a -1.0 considerable time after the edge would contribute a lot, thus naturally yielding a point which is between the two extreme values. Would this solve the problem? Does such an algorithm exist? Has anyone ever tried it?
Please note that I have posted this question here as opposed to dsp.stackexchange.com because I do not want to receive answers with preposterous jargon like band-limiting, harmonics and low-pass filters, lagrange interpolations, DC compensations, etc. and I do not want answers that come from the purely analog world or the purely theoretical outer space and have no chance of ever receiving a practical and efficient implementation using a digital computer.
I am a programmer, not a sound engineer, and in my little programmer's world, things are simple: I have an array of samples which must all be between -1.0 and 1.0, and will be played at a certain rate (44100 samples per second.) I have arithmetic operations and trigonometric functions at my disposal, I can describe lines and use simple linear interpolation, and I need to generate the samples extremely efficiently because the generation of a dozen waveforms simultaneously and also the mixing of them together may not consume more than 1% of the total CPU time.
I'm not sure but you may have a few of misconceptions about the nature of aliasing. I base this on your putting the term in quotes, and from the following quote:
What this all practically translates to when I look at the generated
waveform is that the samples-per-second value is not an exact multiple
of the waves-per-second value, so each wave does not have an even
number of samples, which in turn means that the number of samples at
level 1.0 is often not equal to the number of samples at level -1.0.
The samples/sec and waves/sec don't have to be exact multiples at all! One can play back all pitches below the Nyquist. So I'm not clear what your thinking on this is.
The characteristic sound of a square wave arises from the presence of odd harmonics, e.g., with a note of 440 (A5), the square wave sound could be generated by combining sines of 440, 1320, 2200, 3080, 3960, etc. progressing in increments of 880. This begs the question, how many odd harmonics? We could go to infinity, theoretically, for the sharpest possible corner on our square wave. If you simply "draw" this in the audio stream, the progression will continue well beyond the Nyquist number.
But there is a problem in that harmonics that are higher than the Nyquist value cannot be accurately reproduced digitally. Attempts to do so result in aliasing. So, to get as good a sounding square wave as the system is able to produce, one has to avoid the higher harmonics that are present in the theoretically perfect square wave.
I think the most common solution is to use a low-pass filtering algorithm. The computations are definitely more cpu-intensive than just calculating sine waves (or doing FM synthesis, which was my main interest). I am also weak on the math for DSP and concerned about cpu expense, and so, avoided this approach for long time. But it is quite viable and worth an additional look, imho.
Another approach is to use additive synthesis, and include as many sine harmonics as you need to get the tonal quality you want. The problem then is that the more harmonics you add, the more computation you are doing. Also, the top harmonics must be kept track of as they limit the highest note you can play. For example if using 10 harmonics, the note 500Hz would include content at 10500 Hz. That's below the Nyquist value for 44100 fps (which is 22050 Hz). But you'll only be able to go up about another octave (doubles everything) with a 10-harmonic wave and little more before your harmonic content goes over the limit and starts aliasing.
Instead of computing multiple sines on the fly, another solution you might consider is to instead create a set of lookup tables (LUTs) for your square wave. To create the values in the table, iterate through and add the values from the sine harmonics that will safely remain under the Nyquist for the range in which you use the given table. I think a table of something like 1024 values to encode a single period could be a good first guess as to what would work.
For example, I am guestimating, but the table for the octave C4-C5 might use 10 harmonics, the table for C5-C6 only 5, the table for C3-C4 might have 20. I can't recall what this strategy/technique is called, but I do recall it has a name, it is an accepted way of dealing with the situation. Depending on how the transitions sound and the amount of high-end content you want, you can use fewer or more LUTs.
There may be other methods to consider. The wikipedia entry on Aliasing describes a technique it refers to as "bandpass" that seems to be intentionally using aliasing. I don't know what that is about or how it relates to the article you cite.
The Soundpipe library has the concept of a frequency table, which is a data structure that holds a precomputed waveform such as a sine. You can initialize the frequency table with the desired waveform and play it through an oscilator. There is even a module named oscmorph which allows you to morph between two or more wavetables.
This is an example of how to generate a sine wave, taken from Soundpipe's documentation.
int main() {
UserData ud;
sp_data *sp;
sp_create(&sp);
sp_ftbl_create(sp, &ud.ft, 2048);
sp_osc_create(&ud.osc);
sp_gen_sine(sp, ud.ft);
sp_osc_init(sp, ud.osc, ud.ft);
ud.osc->freq = 500;
sp->len = 44100 * 5;
sp_process(sp, &ud, write_osc);
sp_ftbl_destroy(&ud.ft);
sp_osc_destroy(&ud.osc);
sp_destroy(&sp);
return 0;
}
Consider multiple (at least two) different audio-files, like several different mixes or remixes. Naively I would say, it must be possible to detect samples, especially the vocals, that are almost equal in two or more of the files, of course only then, if the vocal samples aren't modified, stretched, pitched, reverbed too much etc.
So with what kind of algorithm or technique this could be done? Let's say, the user would try to set time markers in all files best possible, which describe the data windows to compare, containing the presumably equal sounds, vocals etc.
I know that no direct approach, trying to directly compare wav data in any way is useful. But even if I have the frequency domain data (e.g. from FFT), I would have to use a comparison algorithm that kind of shifts the comparing-windows through time scale, since I cannot assume the samples, I want to find, are time sync over all files.
Thanks in advance for any suggestions.
Hi this is possible !!
You can use one technique called LSH (locality sensitive hashing), is very robust.
another way to do this is try make spectrogram analysis in your audio files ...
Construct database song
1. Record your Full Song
2. Transform the sound to spectrum
3. slice your Spectrogram in chunk and get three or four high Frequencies
4. Store all the points
Match the song
1. Record one short sample.
2. Transform the sound into another spectrum
3. slice your Spectrogram in chunk and get three or four hight Frequencies
4. Compare the collected frequencies with your database song.
5. your match is the song with have the high hit !
you can see here how make ..
http://translate.google.com/translate?hl=EN&sl=pt&u=http://ederwander.wordpress.com/2011/05/09/audio-fingerprint-em-python/
ederwander
I have a bunch of different audio recordings in WAV format (all different instruments and pitches), and I want to "normalize" them so that they all sound approximately the same volume when played.
I've tried measuring the average sample magnitude (the sum of all absolute values divided by the number of samples), but normalizing by this measurement doesn't work very well. I think this method isn't working because it doesn't take into account the frequency of the sounds, and I know that higher-frequency recordings sound louder than lower-frequency sounds of the same amplitude.
Does anyone know a good method for measuring the loudness of a sound?
Root Mean Square is often used to estimate the loudness of sound files. This is because a sound that is very loud might not be perceived that way if it is very short. Also remember that power increases exponentially with the square of amplitude.
The audio geeks at Hydrogen Audio know a ton about this stuff...check out their free Replay Gain software. You may not need to do any programming at all.
EDIT: Included comment feedback on power vs. amplitude.
To add to PeterAllenWebb's response:
Before you calculate the RMS, you should "center" your sample first (think of a 5-minute .wav where each sample has the maximum +amplitude). The best way to do that is to use a highpass filter at a subsonic frequency.
That would still not take the frequencies that humans are sensitive to in count. To do that, you could use A-weighting. There's a page where you can calculate it online:
http://www.diracdelta.co.uk/science/source/a/w/aweighting/source.html
The code seems to be here:
http://www.diracdelta.co.uk/science/source/a/w/aweighting/multicalc.js
Well not being an expert on audio and adding to the previous comment, you should figure out what you define as the "shortest amount of time for peak power" and then just convert the wave to raw floating point and use RMS over the stretch of time and continuously take chunks of that length of time, find the MAX and there you have your highest peak power.
To reiterate what some other people have said, use RMS value to estimate the "loudness" of a passage of sound.
But, if you're dealing with impulsive sounds like plucking or drum hits, you'd want to do a sliding RMS value and pick out only the peak RMS value. Measure 100 ms of the sound, slide the window, measure again, etc. and then normalize according to the largest value you find.
Definitely remove any DC value before doing the RMS, and A-weighting will make it more like how we hear. Here's code for A-weighting in MATLAB/Octave and Python.
I might be way off here, but, if you have wavepad you can load in multiple files and mess with the volumes a little bit so they are all the same. Also, if you have certain sections of a file that are louder, you can select that section and lower the volume for that one section.
EDIT: And sorry, it;s not really a "method" for measuring volume, but if you just need to make them all the same this should work fine.
I'm working on an openGL project that involves a speaking cartoon face. My hope is to play the speech (encoded as mp3s) and animate its mouth using the audio data. I've never really worked with audio before so I'm not sure where to start, but some googling led me to believe my first step would be converting the mp3 to pcm.
I don't really anticipate the need for any Fourier transforms, though that could be nice. The mouth really just needs to move around when there's audio (I was thinking of basing it on volume).
Any tips on to implement something like this or pointers to resources would be much appreciated. Thanks!
-S
Whatever you do, you're going to need to decode the MP3s into PCM data first. There are a number of third-party libraries that can do this for you. Then, you'll need to analyze the PCM data and do some signal processing on it.
Automatically generating realistic lipsync data from audio is a very hard problem, and you're wise to not try to tackle it. I like your idea of simply basing it on the volume. One way you could compute the current volume is to use a rolling window of some size (e.g. 1/16 second), and compute the average power in the sound wave over that window. That is, at frame T, you compute the average power over frames [T-N, T], where N is the number of frames in your window.
Thanks to Parseval's theorem, we can easily compute the power in a wave without having to take the Fourier transform or anything complicated -- the average power is just the sum of the squares of the PCM values in the window, divided by the number of frames in the window. Then, you can convert the power into a decibel rating by dividing it by some base power (which can be 1 for simplicity), taking the logarithm, and multiplying by 10.