Basic math behind mixing audio channels

Basic math behind mixing audio channels - audio

I have an app where I flick the touchscreen and unleash a dot which animates across the screen, reads the pixel color under is, and converts that to audio based on some parameters. This is working great for the most part.
Currently I'm creating one audio channel per dot (iPhone AudioComponent). This works good till I get up to about 15 dots then starts getting "choppy". Dropping audio in/out, etc...
I think if I were to mix the waveform of all of these channels together, then send that waveform out to maybe one or two channels, I could get much better performance for high numbers of dots. This is where I'm looking for advice.
I am assuming for any time t, I can take ((f1(x) + f2(x)) / 2.0). Is this a typical approach to mixing audio signals? This way I can never exceed (normalized) 1.0 .. -1.0, however I'm worried that I'll get the opposite of that; quiet audio. Maybe it won't matter so much if there are so many dots.
If someone can drop the name of any technique for this, I'll go read up on it. Or, any links would be great.

Yes, just adding the waveforms together will mix them. And as you say, if you then divide by the number of waveforms then you'll make sure you don't clip on the resulting waveform. You'll obviously get a drop in the volume of the individual waveforms, but what you suggest is the most straightforward method.
There are more sophisticated methods of mixing multiple sources together to try and get a consistent volume output which calculate RMS/peak type parameters to vary the output gain. If you want to find out more about this, do a search on automixers.

I know this is way too late to answer this but someone may be doing something similar and looking to these responses to help them.
There are classically two answers to the challenge of getting the levels right when mixing (summing) multiple audio sources. This is because it's a vector problem and the answer is different depending on whether the sounds are coherent or not.
If the two sources are coherent, then you would divide by the number of channels. In other words, for ten channels you sum them all and divide by 10 (attenuate by 20dB). For all ten channels to be coherent though, they all have to be carrying the same signal. Generally, that makes no sense - why would ten channels carry the same signal?
There is one case though where coherence is common, where you are summing left and right from a stereo pair. In many cases these two separate signals are closer to coherent, closer to identical, than not.
If the channels are not coherent, then the volume will increase not by the number of sources, but by the square root of the number of sources. For ten sources this means the sum would be 3.16 times as big as each of the sources (assuming that they are all the same level). This corresponds to an attenuation of 10dB. So, to sum 10 channels of different sounds (all of the same loudness) you should attenuate everything by 10dB.
10dB = 20 x log(3.16) where 3.16 is the square root of 10.
There's a practical part to this as well. We assumed that the channels are all equally loud, but what if they aren't? Quite often you have some channels that are similar and others that are quieter. Like say adding voices plus background music - where the music is quieter than the voices. As a rule of thumb, you can ignore the quieter channels. So, assume there are four voice channels and two quieter music channels. We start by ignoring the music channels which leaves four incoherent voice channels. The square root of four is two, so in this case we halve the audio level - attenuate it by 6dB.

You can use an AGC (automatic gain control or automatic limiter) algorithm or process on the output of the mixer to prevent clipping at less quiet volume mix levels.

Related

Synthesized polyphonic sound completely different from the "real" one

I am making a software audio synthesizer and so far i've managed to play a single tone at once.
My goal was to make it polyphonic, i.e when i press 2 keys both are active and produce sound (i'm aware that a speaker can only output one waveform at a time).
From what i've read so far, to achieve a pseudo-polyphonic effect what you are supposed do, is to add the tones to each other with different amplitudes.
The code i have is too big to post in it's entirety but i've tested it and it's correct (it implements what i described above, as for whenever it's the correct thing to do i'm not so sure anymore)
Here is some pseudo-code of my mixing
sample = 0.8 * sin(2pi * freq[key1] * time) + 0.2 * sin(2pi * freq[key2] * time)
The issue i have with this approach is that when i tried to play C C# it resulted in a wierd wobble like sound with distortions, it appears to make the entire waveform oscillate at around 3-5 Hz.
I'm also aware that this is the "correct" behavior because i graphed a scenario like this and the waveform is very similar to what i'm experiencing here.
I know this is the beat effect and that's what happens when you add two tones close in frequency but that's not what happens when you press 2 keys on a piano, which means this approach is incorrect.
Just for test i made a second version that uses stereo configuration and when a second key is pressed it plays the second tone on a different channel and it produces the exact effect i was looking for.
Here is a comparison
Normal https://files.catbox.moe/2mq7zw.wav
Stereo https://files.catbox.moe/rqn2hr.wav
Any help would be appreciated, but don't say it's impossible because all of the serious synthesizers can achieve this effect

Working backwards from the sound, the "beating" sound is one that would arise from two pitches in the vicinity of 5 or 6 Hz apart. (It was too short for me to count the exact number of beats per second.) Are you playing Midi 36 (C2) = 65.4Hz and Midi 37 (C#2) 69.3Hz? These could be expected to beat at roughly 4 x per sec. Midi 48 & 49 would be closer to 8 times a second.
The pitch I'm hearing sounds more like an A than a C. And A2 (110) + A#2 (116.5) would have beat rate that plausibly matches what's heard.
I would double check that the code you are using in the two scenarios (mono and stereo) are truly sending the frequencies that you think you are.
What sample rate are you using? I wonder if the result could be an artifact due to an abnormally low number of samples per second in your data generation. The tones I hear have a lot of overtones for being sine functions. I'm assuming the harmonics are due to a lack of smoothness due to there being relatively few steps (a very "blocky" looking signal).
I'm not sure my reasoning is right here, but maybe this is a plausible scenario. Let's assume your computer is able to send out signals at 44100 fps. This should be able to decode a rather "blocky" sine (with lots of harmonics) pretty well. There might be some aliasing due to high frequency content (over the Nyquist value) arising from the blockiness.
Let's further assume that your addition function is NOT occurring at 44100 fps, but at a much lower sample rate. This would lower the Nyquist and increase the aliasing. Thus the mixed sounds would be more subject to aliasing-related distortion than the scenario where the signals are output separately.

Finding the "noise level" of an audio recording programmatically

I am tasked with something seemingly trivial which is to
find out how "noisy" a given recording is.
This recording came about via a voice recorder, a
OLYMPUS VN-733 PC which was fairly cheap (I am not doing
advertisement, I merely mention this because I in no way
aim to do anything "professional" here, I simply need to
solve a seemingly simple problem).
To preface this, I have already obtained several datasets
from different outside locations, in particular parks or
near-road recordings. That is, the noise that exists at
these specific locations, and to then compare this noise,
on average, with the other locations.
In other words:
I must find out how noisy location A is compared to location
B and C.
I have made 1 minute recordings each so that at the
least the time span of a recording can be compared
to the other locations (and I was using the very
same voice record at all positions, in the same
height etc...).
A sample file can be found at:
http://shevegen.square7.ch/test.mp3
(This may eventually be moved lateron, it just serves as
example how these recordings may sound right now. I am
unhappy about the initial noisy clipping-sound, ideally
I'd only capture the background noise of the cars etc..
but for now this must suffice.)
Now my specific question is, how can I find out how "noisy"
or "loud" this is?
The primary goal is to compare them to the other .mp3
files, which would suffice for my purpose just fine.
But ideally it would be nice to calculate on average
how "loud" every individual .mp3 is and then compared
it to the other ones (there are several recordings
per given geolocation, so I could even merge them
together).
There are some similar questions but not one in particular
that I was able to find that could answer this in a
objective manner, or perhaps I did not understand the
problem at hand. I have all the audio datasets already
but I have no idea how to find out how "loud" any one
of them is individually; there are some apps on smartphones
that claim that they can do this automatically but since
I do not have any smartphone, this is a dead end for me.
Any general advice will be much appreciated.

Noise is a notion difficult to define. Then, I will focus on loudness.
You could compute the energy of each files. For that, you need to access the samples of the audio signal (generally from a built-in function of you programming language). Then you could compute the RMS energy of the signal.
That could be the more basic processing.

audio mixing wrt time

I've read through many questions on stack overflow which states that to mix the audios, you just have to add the byte frames together (and make sure to clip when necessary). But what should I do if I want to say mix an audio with an another with some offset. For example, I want to mix second audio into the first one when the first audio reaches 5th second.
Any help would be appreciated!

Typically when working with audio on a computer, you will be working with audio in the time domain, in the format of PCM samples. That is, many times per second, the pressure level at that point of time will be measured an quantified into a number. If you are working with CD-quality audio, 44,1000 samples per second is the sample rate. The number is often quantified into 16-bit integers. (-32,767 to 32,768). (Other sample rates, bit depths, and quantization are out there and often used, this is just an example.)
If you want to mix two audio streams of the same sample rate, it is possible to simply add the values of each sample together. If you think about it, if you were to hear sound from two sources, their pressure levels would affect each other in much the same way. Sometimes they will cancel each other out, sometimes they will add to each other. You mentioned clipping... you can do this, but you will be introducing distortion into the mix. When a sound is too loud to be quantified, it is clipped at the maximum and minimums of the quantifiable range, causing audible clicks, pops, and poor quality sound. If you want to avoid this problem, you can cut the level of each in half, guaranteeing that even with both streams at their maximum level, they will be within the appropriate range.
Now, your question is about mixing audio with offset. It's absolutely no different. If you want to start mixing 5 seconds in, then 5 * 44,100 = 220500, meaning align sample zero of one stream to sample 220500 of the other stream and mix.

Frequency differences from MP3 to mic

I'm trying to compare sound clips based on microphone recording. Simply put I play an MP3 file while recording from the speakers, then attempt to match the two files. I have the algorithms in place that works, but I'm seeing a slight difference I'd like to sort out to get better accuracy.
The microphone seem to favor some frequencies (add amplitude), and be slightly off on others (peaks are wider on the mic).
I'm wondering what the cause of this difference is, and how to compensate for it.
Background:
Because of speed issues in how I'm doing comparison I select certain frequencies with certain characteristics. The problem is that a high percentage of these (depending on how many I choose) don't match between MP3 and mic.

It's called the response characteristic of the microphone. Unfortunately, you can't easily get around it without buying a different, presumably more expensive, microphone.

If you can measure the actual microphone frequency response by some method (which generally requires having some etalon acoustic system and an anechoic chamber), you can compensate for it by applying an equaliser tuned to exactly inverse characteristic, like discussed here. But in practice, as Kilian says, it's much simpler to get a more precise microphone. I'd recommend a condenser or an electrostatic one.

How to mix audio samples?

My question is not completely programming-related, but nevertheless I think SO is the right place to ask.
In my program I generate some audio data and save the track to a WAV file. Everything works fine with one sound generator. But now I want to add more generators and mix the generated audio data into one file. Unfortunately it is more complicated than it seems at first sight.
Moreover I didn't find much useful information on how to mix a set of audio samples.
So is there anyone who can give me advice?
edit:
I'm programming in C++. But it doesn't matter, since I was interested in the theory behind mixing two audio tracks. The problem I have is that I cannot just sum up the samples, because this often produces distorted sound.

I assume your problem is that for every audio source you're adding in, you're having to lower the levels.
If the app gives control to a user, just let them control the levels directly. Hotness is their responsibility, not yours. This is "summing."
If the mixing is automated, you're about to go on a journey. You'll probably need compression, if not limiting. (Limiting is an extreme version of compression.)
Note that anything you do to the audio (including compression and limiting) is a form of distortion, so you WILL have coloration of the audio. Your choice of compression and limiting algorithms will affect the sound.
Since you're not generating the audio in real time, you have the possibility of doing "brick wall" limiting. That's because you have foreknowledge of the levels. Realtime limiting is more limited because you can't know what's coming up--you have to be reactive.
Is this music, sound effects, voices, what?
Programmers here deal with this all the time.

Mixing audio samples means adding them together, that's all. Typically you do add them into a larger data type so that you can detect overflow and clamp the values before casting back into your destination buffer. If you know beforehand that you will have overflow then you can scale their amplitudes prior to addition - simply multiply by a floating point value between 0 and 1, again keeping in mind the issue of precision, perhaps converting to a larger data type first.
If you have a specific problem that is not addressed by this, feel free to update your original question.

dirty mix of two samples
mix = (a + b) - a * b * sign(a + b)

You never said what programming language and platform, however for now I'll assume Windows using C#.
http://www.codeplex.com/naudio
Great open source library that really covers off lots of the stuff you'd encounter during most audio operations.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string