Synthesized polyphonic sound completely different from the "real" one - audio

I am making a software audio synthesizer and so far i've managed to play a single tone at once.
My goal was to make it polyphonic, i.e when i press 2 keys both are active and produce sound (i'm aware that a speaker can only output one waveform at a time).
From what i've read so far, to achieve a pseudo-polyphonic effect what you are supposed do, is to add the tones to each other with different amplitudes.
The code i have is too big to post in it's entirety but i've tested it and it's correct (it implements what i described above, as for whenever it's the correct thing to do i'm not so sure anymore)
Here is some pseudo-code of my mixing
sample = 0.8 * sin(2pi * freq[key1] * time) + 0.2 * sin(2pi * freq[key2] * time)
The issue i have with this approach is that when i tried to play C C# it resulted in a wierd wobble like sound with distortions, it appears to make the entire waveform oscillate at around 3-5 Hz.
I'm also aware that this is the "correct" behavior because i graphed a scenario like this and the waveform is very similar to what i'm experiencing here.
I know this is the beat effect and that's what happens when you add two tones close in frequency but that's not what happens when you press 2 keys on a piano, which means this approach is incorrect.
Just for test i made a second version that uses stereo configuration and when a second key is pressed it plays the second tone on a different channel and it produces the exact effect i was looking for.
Here is a comparison
Normal https://files.catbox.moe/2mq7zw.wav
Stereo https://files.catbox.moe/rqn2hr.wav
Any help would be appreciated, but don't say it's impossible because all of the serious synthesizers can achieve this effect

Working backwards from the sound, the "beating" sound is one that would arise from two pitches in the vicinity of 5 or 6 Hz apart. (It was too short for me to count the exact number of beats per second.) Are you playing Midi 36 (C2) = 65.4Hz and Midi 37 (C#2) 69.3Hz? These could be expected to beat at roughly 4 x per sec. Midi 48 & 49 would be closer to 8 times a second.
The pitch I'm hearing sounds more like an A than a C. And A2 (110) + A#2 (116.5) would have beat rate that plausibly matches what's heard.
I would double check that the code you are using in the two scenarios (mono and stereo) are truly sending the frequencies that you think you are.
What sample rate are you using? I wonder if the result could be an artifact due to an abnormally low number of samples per second in your data generation. The tones I hear have a lot of overtones for being sine functions. I'm assuming the harmonics are due to a lack of smoothness due to there being relatively few steps (a very "blocky" looking signal).
I'm not sure my reasoning is right here, but maybe this is a plausible scenario. Let's assume your computer is able to send out signals at 44100 fps. This should be able to decode a rather "blocky" sine (with lots of harmonics) pretty well. There might be some aliasing due to high frequency content (over the Nyquist value) arising from the blockiness.
Let's further assume that your addition function is NOT occurring at 44100 fps, but at a much lower sample rate. This would lower the Nyquist and increase the aliasing. Thus the mixed sounds would be more subject to aliasing-related distortion than the scenario where the signals are output separately.

Related

Addition of PCM Audio Files - Mixing Audio

I am posed with the task of mixing raw data from audio files. I am currently struggling to get a clean sound from mixing the data, I keep getting distortion or white noise.
Lets say that I have a two byte array of data from two AudioInputStream's. The AIS is used to stream a byte array from a given audio file. Here I can playback single audio files using SourceDataLine's write method. I want to play two audio files simultaneously, therefore I am aware that I need to perform some sort of PCM addition.
Can anyone recommend whether this addition should be done with float values or byte values? Also, when it comes to adding 3,4 or more audio files, I am guessing my problem will be even harder! Do I need to divide by a certain amount to avoid this overflow? Lets say I am adding two 16-bit audio files (min -32,768, max 32,767).
I admit, I have had some advice on this before but can't seem to get it working! I have code of what I have tried but not with me!
Any advice would be great.
Thanks
First off, I question whether you are actually working with fully decoded PCM data values. If you are directly adding bytes, that would only make sense if the sound was recorded at 8-bit resolution, which is done less and less. These days, audio is recorded more commonly as 16-bit values, or more. I think there are some situations that don't require as much frequency content, but with current systems, the cpu savings aren't as critical so people opt to keep at least "CD Quality" (16-bit resolution, stereo, 41000 fps).
So step one, you have to make sure that you are properly converting the byte streams to valid PCM. For example, if 16-bit encoding, the two bytes have to be appended in the correct order (may be either big-endian or little-endian), and the resulting value used.
Once that is properly handled, it is usually sufficient to simply add the values and maybe impose a min and max filter to ensure the signal doesn't go beyond the defined range. I can think of two reasons why this works: (a) audio is usually recorded at a low enough volume that summing will not cause overflow, (b) the signals are random enough, with both positive and negative values, that moments where all the contributors line up in either the positive or negative direction are rare and short-lived.
Using a min and max will "clip" the signals, and can introduce some audible distortion, but it is a much less horrible sound than overflow! If your sources are routinely hitting the min and max, you can simply multiply a volume factor (within the range 0 to 1) to one or more of the contributing signals as a whole, to bring the audio values down.
For 16-bit data, it works to perform operations directly on the signed integers that result from appending the two bytes together (-32768 to 32767). But it is a more common practice to "normalize" the values, i.e., convert the 16-bit integers to floats ranging from -1 to 1, perform operations at that level, and then convert back to integers in the range -32768 to 32767 and break those integers into byte pairs.
There is a free book on digital signal processing that is well worth reading: Steven Smith's "The Scientists and Engineers Guide to Digital Signal Processing." It will give much more detail and background.

audio mixing wrt time

I've read through many questions on stack overflow which states that to mix the audios, you just have to add the byte frames together (and make sure to clip when necessary). But what should I do if I want to say mix an audio with an another with some offset. For example, I want to mix second audio into the first one when the first audio reaches 5th second.
Any help would be appreciated!
Typically when working with audio on a computer, you will be working with audio in the time domain, in the format of PCM samples. That is, many times per second, the pressure level at that point of time will be measured an quantified into a number. If you are working with CD-quality audio, 44,1000 samples per second is the sample rate. The number is often quantified into 16-bit integers. (-32,767 to 32,768). (Other sample rates, bit depths, and quantization are out there and often used, this is just an example.)
If you want to mix two audio streams of the same sample rate, it is possible to simply add the values of each sample together. If you think about it, if you were to hear sound from two sources, their pressure levels would affect each other in much the same way. Sometimes they will cancel each other out, sometimes they will add to each other. You mentioned clipping... you can do this, but you will be introducing distortion into the mix. When a sound is too loud to be quantified, it is clipped at the maximum and minimums of the quantifiable range, causing audible clicks, pops, and poor quality sound. If you want to avoid this problem, you can cut the level of each in half, guaranteeing that even with both streams at their maximum level, they will be within the appropriate range.
Now, your question is about mixing audio with offset. It's absolutely no different. If you want to start mixing 5 seconds in, then 5 * 44,100 = 220500, meaning align sample zero of one stream to sample 220500 of the other stream and mix.

How can I detect the sound in a raw sound file

I am developing a software which can auto record and extract every words in my voice. I used portaudio library to solve it. But I am stuck on detecting the sound: I set the silence's value is zero so if there is a sample which is zero, it must be a start or end point of a sound. But when I ran it, the program created many words. I think because the value I read by portaudio is raw data, so it can't be processed like that. Am I right? How can I fix it? By the way, I am coding in C++ :D
To detect the presence of a signal in a PCM stream you be able to detect it. As dprogramz put said, the noise floor of your soundcard is probably not perfect and so there will be some noise signal recorded (even with no mic connected).
The solution is to use a VOX or VAD algorithm to detect the presence of your voice. VOX can be tricky, since in most consumer grade electronics the noise floor is just low enough to be "silence" to the human ear, relative to the signal. This means that the difference on amplitude between the noise floor and signal may be slight. If your sound card has AGC turned on this can make it even more difficult, since the noise floor may move. Having said that, VOX can be implemented successfully on consumer grade equipment. It just takes more effort to establish the threshold. When done best the threshold is calculated periodically while the stream is active.
If I were doing this I'd implement a VAD algorithm. Since your objective is to detect your voice this should provide a reliable result regardless of the equipment you use.
I don't think it's because it is a RAW value. RAW sound files are a bitstream of frequency and volume information.
However, the value will rarely (if ever) be zero. You have to take into account there is a small amount of electrical noise that is made by the mic. Figure out the "idle" dB of your mic (just test the level when you aren't talking into it). You Then need to set a silence threshold (below a certain dB level for a certain number of samples) to detect the beginning/end. Attempting to detect a zero value is gonna be near impossible.

Basic math behind mixing audio channels

I have an app where I flick the touchscreen and unleash a dot which animates across the screen, reads the pixel color under is, and converts that to audio based on some parameters. This is working great for the most part.
Currently I'm creating one audio channel per dot (iPhone AudioComponent). This works good till I get up to about 15 dots then starts getting "choppy". Dropping audio in/out, etc...
I think if I were to mix the waveform of all of these channels together, then send that waveform out to maybe one or two channels, I could get much better performance for high numbers of dots. This is where I'm looking for advice.
I am assuming for any time t, I can take ((f1(x) + f2(x)) / 2.0). Is this a typical approach to mixing audio signals? This way I can never exceed (normalized) 1.0 .. -1.0, however I'm worried that I'll get the opposite of that; quiet audio. Maybe it won't matter so much if there are so many dots.
If someone can drop the name of any technique for this, I'll go read up on it. Or, any links would be great.
Yes, just adding the waveforms together will mix them. And as you say, if you then divide by the number of waveforms then you'll make sure you don't clip on the resulting waveform. You'll obviously get a drop in the volume of the individual waveforms, but what you suggest is the most straightforward method.
There are more sophisticated methods of mixing multiple sources together to try and get a consistent volume output which calculate RMS/peak type parameters to vary the output gain. If you want to find out more about this, do a search on automixers.
I know this is way too late to answer this but someone may be doing something similar and looking to these responses to help them.
There are classically two answers to the challenge of getting the levels right when mixing (summing) multiple audio sources. This is because it's a vector problem and the answer is different depending on whether the sounds are coherent or not.
If the two sources are coherent, then you would divide by the number of channels. In other words, for ten channels you sum them all and divide by 10 (attenuate by 20dB). For all ten channels to be coherent though, they all have to be carrying the same signal. Generally, that makes no sense - why would ten channels carry the same signal?
There is one case though where coherence is common, where you are summing left and right from a stereo pair. In many cases these two separate signals are closer to coherent, closer to identical, than not.
If the channels are not coherent, then the volume will increase not by the number of sources, but by the square root of the number of sources. For ten sources this means the sum would be 3.16 times as big as each of the sources (assuming that they are all the same level). This corresponds to an attenuation of 10dB. So, to sum 10 channels of different sounds (all of the same loudness) you should attenuate everything by 10dB.
10dB = 20 x log(3.16) where 3.16 is the square root of 10.
There's a practical part to this as well. We assumed that the channels are all equally loud, but what if they aren't? Quite often you have some channels that are similar and others that are quieter. Like say adding voices plus background music - where the music is quieter than the voices. As a rule of thumb, you can ignore the quieter channels. So, assume there are four voice channels and two quieter music channels. We start by ignoring the music channels which leaves four incoherent voice channels. The square root of four is two, so in this case we halve the audio level - attenuate it by 6dB.
You can use an AGC (automatic gain control or automatic limiter) algorithm or process on the output of the mixer to prevent clipping at less quiet volume mix levels.

How can I look for certain sounds in a live sound input?

I've combed StackOverflow and the web for many questions on whistle detection, etc, and many people did explain as much as they could as to how they can go about detecting their stuff.
capturing sound for analysis and visualizing frequences in android
analyzing whistle sound for pitch note
But what I don't get is how does FFT help you to detect certain sounds in a given sample audio data?
Here's what I understand so far from some stuff I found here and there.
-The sine wave is more or less the building block of ALL signals, musical or not
-Three parameters - FREQUENCY, AMPLITUDE, and INITIAL PHASE, characterize every steady sine wave completely.
-They make each and any kind of wave unique.
-Fourier transform can be used to inspect what kinds of sine waves there are in a signal
SOURCE -- [Audio signal processing basics][3]
Audio data that the computer generates as received from the mic or other input source, for live processing, is an array of amplitudes processed (or stored or taken) at a particular sample rate.
So how does one go from that to detecting whistles and claps?
And complex things such as say, a short period of whistling to a particular song?
My theory of detecting is that we test our whistles in a spectogram, and record the particular frequency and amplitude characteristics. And then if those particular characteristics are repeated again in the input, we've detected a whistle.
Am I right or wrong?
This sound processing stuff is a little complicated.
Forgot to mention this - I'm using Python. Java is also okay, since most of the examplar code I found was for Android which is in Java. And I can work in Java too. Any mention of any libraries or APIs would be helpful too.

Resources