I've read through many questions on stack overflow which states that to mix the audios, you just have to add the byte frames together (and make sure to clip when necessary). But what should I do if I want to say mix an audio with an another with some offset. For example, I want to mix second audio into the first one when the first audio reaches 5th second.
Any help would be appreciated!
Typically when working with audio on a computer, you will be working with audio in the time domain, in the format of PCM samples. That is, many times per second, the pressure level at that point of time will be measured an quantified into a number. If you are working with CD-quality audio, 44,1000 samples per second is the sample rate. The number is often quantified into 16-bit integers. (-32,767 to 32,768). (Other sample rates, bit depths, and quantization are out there and often used, this is just an example.)
If you want to mix two audio streams of the same sample rate, it is possible to simply add the values of each sample together. If you think about it, if you were to hear sound from two sources, their pressure levels would affect each other in much the same way. Sometimes they will cancel each other out, sometimes they will add to each other. You mentioned clipping... you can do this, but you will be introducing distortion into the mix. When a sound is too loud to be quantified, it is clipped at the maximum and minimums of the quantifiable range, causing audible clicks, pops, and poor quality sound. If you want to avoid this problem, you can cut the level of each in half, guaranteeing that even with both streams at their maximum level, they will be within the appropriate range.
Now, your question is about mixing audio with offset. It's absolutely no different. If you want to start mixing 5 seconds in, then 5 * 44,100 = 220500, meaning align sample zero of one stream to sample 220500 of the other stream and mix.
Related
I am making a software audio synthesizer and so far i've managed to play a single tone at once.
My goal was to make it polyphonic, i.e when i press 2 keys both are active and produce sound (i'm aware that a speaker can only output one waveform at a time).
From what i've read so far, to achieve a pseudo-polyphonic effect what you are supposed do, is to add the tones to each other with different amplitudes.
The code i have is too big to post in it's entirety but i've tested it and it's correct (it implements what i described above, as for whenever it's the correct thing to do i'm not so sure anymore)
Here is some pseudo-code of my mixing
sample = 0.8 * sin(2pi * freq[key1] * time) + 0.2 * sin(2pi * freq[key2] * time)
The issue i have with this approach is that when i tried to play C C# it resulted in a wierd wobble like sound with distortions, it appears to make the entire waveform oscillate at around 3-5 Hz.
I'm also aware that this is the "correct" behavior because i graphed a scenario like this and the waveform is very similar to what i'm experiencing here.
I know this is the beat effect and that's what happens when you add two tones close in frequency but that's not what happens when you press 2 keys on a piano, which means this approach is incorrect.
Just for test i made a second version that uses stereo configuration and when a second key is pressed it plays the second tone on a different channel and it produces the exact effect i was looking for.
Here is a comparison
Normal https://files.catbox.moe/2mq7zw.wav
Stereo https://files.catbox.moe/rqn2hr.wav
Any help would be appreciated, but don't say it's impossible because all of the serious synthesizers can achieve this effect
Working backwards from the sound, the "beating" sound is one that would arise from two pitches in the vicinity of 5 or 6 Hz apart. (It was too short for me to count the exact number of beats per second.) Are you playing Midi 36 (C2) = 65.4Hz and Midi 37 (C#2) 69.3Hz? These could be expected to beat at roughly 4 x per sec. Midi 48 & 49 would be closer to 8 times a second.
The pitch I'm hearing sounds more like an A than a C. And A2 (110) + A#2 (116.5) would have beat rate that plausibly matches what's heard.
I would double check that the code you are using in the two scenarios (mono and stereo) are truly sending the frequencies that you think you are.
What sample rate are you using? I wonder if the result could be an artifact due to an abnormally low number of samples per second in your data generation. The tones I hear have a lot of overtones for being sine functions. I'm assuming the harmonics are due to a lack of smoothness due to there being relatively few steps (a very "blocky" looking signal).
I'm not sure my reasoning is right here, but maybe this is a plausible scenario. Let's assume your computer is able to send out signals at 44100 fps. This should be able to decode a rather "blocky" sine (with lots of harmonics) pretty well. There might be some aliasing due to high frequency content (over the Nyquist value) arising from the blockiness.
Let's further assume that your addition function is NOT occurring at 44100 fps, but at a much lower sample rate. This would lower the Nyquist and increase the aliasing. Thus the mixed sounds would be more subject to aliasing-related distortion than the scenario where the signals are output separately.
I am posed with the task of mixing raw data from audio files. I am currently struggling to get a clean sound from mixing the data, I keep getting distortion or white noise.
Lets say that I have a two byte array of data from two AudioInputStream's. The AIS is used to stream a byte array from a given audio file. Here I can playback single audio files using SourceDataLine's write method. I want to play two audio files simultaneously, therefore I am aware that I need to perform some sort of PCM addition.
Can anyone recommend whether this addition should be done with float values or byte values? Also, when it comes to adding 3,4 or more audio files, I am guessing my problem will be even harder! Do I need to divide by a certain amount to avoid this overflow? Lets say I am adding two 16-bit audio files (min -32,768, max 32,767).
I admit, I have had some advice on this before but can't seem to get it working! I have code of what I have tried but not with me!
Any advice would be great.
Thanks
First off, I question whether you are actually working with fully decoded PCM data values. If you are directly adding bytes, that would only make sense if the sound was recorded at 8-bit resolution, which is done less and less. These days, audio is recorded more commonly as 16-bit values, or more. I think there are some situations that don't require as much frequency content, but with current systems, the cpu savings aren't as critical so people opt to keep at least "CD Quality" (16-bit resolution, stereo, 41000 fps).
So step one, you have to make sure that you are properly converting the byte streams to valid PCM. For example, if 16-bit encoding, the two bytes have to be appended in the correct order (may be either big-endian or little-endian), and the resulting value used.
Once that is properly handled, it is usually sufficient to simply add the values and maybe impose a min and max filter to ensure the signal doesn't go beyond the defined range. I can think of two reasons why this works: (a) audio is usually recorded at a low enough volume that summing will not cause overflow, (b) the signals are random enough, with both positive and negative values, that moments where all the contributors line up in either the positive or negative direction are rare and short-lived.
Using a min and max will "clip" the signals, and can introduce some audible distortion, but it is a much less horrible sound than overflow! If your sources are routinely hitting the min and max, you can simply multiply a volume factor (within the range 0 to 1) to one or more of the contributing signals as a whole, to bring the audio values down.
For 16-bit data, it works to perform operations directly on the signed integers that result from appending the two bytes together (-32768 to 32767). But it is a more common practice to "normalize" the values, i.e., convert the 16-bit integers to floats ranging from -1 to 1, perform operations at that level, and then convert back to integers in the range -32768 to 32767 and break those integers into byte pairs.
There is a free book on digital signal processing that is well worth reading: Steven Smith's "The Scientists and Engineers Guide to Digital Signal Processing." It will give much more detail and background.
I have a .MP3 file stored on my server, and I'd like to modify it to be a bit lower in pitch. I know this can be achieved by increasing the length of the audio, however, I don't know of any libraries in node that can do this.
I've tried using the node web audio api, and soundbank-pitch-shift, but the former doesn't seem to have the capabilities of pitch shifting (AFAIK), and the latter seems designed toward client
I need the solution within the realm of node ONLY- that means no external programs, etc., and it needs to be automated as well, so I can't manually pitch shift.
An ideal solution would be a function that takes a file/filepath as an input, and then creates (or overwrites) another MP3 file but with the pitch shifted by x amount, but really, any solution that produces something with a lower pitch than the original, works.
I'm totally lost here. Please help.
An audio file is basically a list of numbers. Those numbers are read one at a time at a particular speed called the 'sample rate'. The sample rate is otherwise defined as the number of audio samples read every second e.g. if an audio files sample rate is 44100, then there are 44100 samples (or numbers) read every second.
If you are with me so far, the simplest way to lower the pitch of an audio file is to play the file back at a lower sample rate (which is normally fixed in place). In most cases you wont be able to do this, so you need to achieve the same effect by resampling the file i.e adding new samples to the file in between the old samples to make it literally longer. For this you would need to understand interpolation.
The drawback to this technique in either case is that the sound will also play back at a slower speed, as well as at a lower pitch. If it is a problem that the sound has slowed down as well as lowered in pitch as a result of your processing, then you will also have to use a timestretching algorithm to fix the playback speed.
You may also have problems doing this using MP3 files. In this case you may have to uncompress the data in the MP3 file before you can operate on it in such a way that changes the pitch of the file. WAV files are more ideal in audio processing. In any case, you essentially need to turn the file into a list of floating point numbers, and change those numbers to be effectively read back at a slower rate.
Other methods of pitch shifting would probably need to involve the use of ffts, and would be a more complicated affair to say the least.
I am not familiar with nodejs I'm afraid.
I managed to get it working with help from Ollie M's answer and node-lame.
I hadn't known previously that sample rate could affect the speed, but thanks to Ollie, suddenly this problem became a lot more simple.
Using node-lame, all I did was take one of the examples (mp32wav.js), and make it so that I change the parameter sampleRate of the format object, so that it is lower than the base sample rate, which in my application was always a static 24,000. I could also make it dynamic since node-lame can grab the parameters of the input file in the format object.
Ollie, however perfectly describes the drawback with this method
The drawback to this technique in either case is that the sound will
also play back at a slower speed, as well as at a lower pitch. If it
is a problem that the sound has slowed down as well as lowered in
pitch as a result of your processing, then you will also have to use a
timestretching algorithm to fix the playback speed.
I don't have a particular need to implement a time stretching algorithm at the moment (thankfully, because that's a whole other can of worms), since I have the ability to change the initial speed of the file, but others may in the future.
See https://www.npmjs.com/package/audio-decode, https://github.com/audiojs/audio-buffer, and related linked at bottom of audio-buffer readme.
I have an app where I flick the touchscreen and unleash a dot which animates across the screen, reads the pixel color under is, and converts that to audio based on some parameters. This is working great for the most part.
Currently I'm creating one audio channel per dot (iPhone AudioComponent). This works good till I get up to about 15 dots then starts getting "choppy". Dropping audio in/out, etc...
I think if I were to mix the waveform of all of these channels together, then send that waveform out to maybe one or two channels, I could get much better performance for high numbers of dots. This is where I'm looking for advice.
I am assuming for any time t, I can take ((f1(x) + f2(x)) / 2.0). Is this a typical approach to mixing audio signals? This way I can never exceed (normalized) 1.0 .. -1.0, however I'm worried that I'll get the opposite of that; quiet audio. Maybe it won't matter so much if there are so many dots.
If someone can drop the name of any technique for this, I'll go read up on it. Or, any links would be great.
Yes, just adding the waveforms together will mix them. And as you say, if you then divide by the number of waveforms then you'll make sure you don't clip on the resulting waveform. You'll obviously get a drop in the volume of the individual waveforms, but what you suggest is the most straightforward method.
There are more sophisticated methods of mixing multiple sources together to try and get a consistent volume output which calculate RMS/peak type parameters to vary the output gain. If you want to find out more about this, do a search on automixers.
I know this is way too late to answer this but someone may be doing something similar and looking to these responses to help them.
There are classically two answers to the challenge of getting the levels right when mixing (summing) multiple audio sources. This is because it's a vector problem and the answer is different depending on whether the sounds are coherent or not.
If the two sources are coherent, then you would divide by the number of channels. In other words, for ten channels you sum them all and divide by 10 (attenuate by 20dB). For all ten channels to be coherent though, they all have to be carrying the same signal. Generally, that makes no sense - why would ten channels carry the same signal?
There is one case though where coherence is common, where you are summing left and right from a stereo pair. In many cases these two separate signals are closer to coherent, closer to identical, than not.
If the channels are not coherent, then the volume will increase not by the number of sources, but by the square root of the number of sources. For ten sources this means the sum would be 3.16 times as big as each of the sources (assuming that they are all the same level). This corresponds to an attenuation of 10dB. So, to sum 10 channels of different sounds (all of the same loudness) you should attenuate everything by 10dB.
10dB = 20 x log(3.16) where 3.16 is the square root of 10.
There's a practical part to this as well. We assumed that the channels are all equally loud, but what if they aren't? Quite often you have some channels that are similar and others that are quieter. Like say adding voices plus background music - where the music is quieter than the voices. As a rule of thumb, you can ignore the quieter channels. So, assume there are four voice channels and two quieter music channels. We start by ignoring the music channels which leaves four incoherent voice channels. The square root of four is two, so in this case we halve the audio level - attenuate it by 6dB.
You can use an AGC (automatic gain control or automatic limiter) algorithm or process on the output of the mixer to prevent clipping at less quiet volume mix levels.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I already asked about audio volume normalization. On most methods (e.g. ReplayGain, which I am most interested in), I might get peaks that exceed the PCM limit (as can also be read here).
Simple clipping would probably be the worst thing I can do. As Wikipedia suggests, I should do some form of dynamic range compression.
I am speaking about the function which I'm applying on each individual PCM sample value. On another similar question, one answer suggests that doing this is not enough or not the thing I should do. However, I don't really understand that as I still have to handle the clipping case. Does the answer suggest to do the range compression on multiple samples at once and do to simple hard clipping in addition on every sample?
Leaving that aside, the functions discussed in the Wikipedia article seem to be somewhat not what I want (in many cases, I would still have the clipping case in the end). I am thinking about using something like tanh. Is that a bad idea? It would reduce the volume slightly but guarantee that I don't get any clipping.
My application is a generic music player. I am searching for a solution which mostly works best for everyone so that I can always turn it on and the user very likely does not want to turn this off.
Using any instantaneous dynamic range processing (such as clipping or tanh non-linearity) will introduce audible distortion. Put a sine wave into an instantaneous non-linear function and you no longer have a sine wave. While useful for certain audio applications, it sounds like you do not want these artefacts.
Normalization does not effect the dynamics (in terms of min/max ratio) of a waveform. Normalization involves element-wise multiplication of a waveform by a constant scalar value to ensure no samples ever exceed a maximum value. This process can only by done off-line, as you need to analyse the entire signal before processing. Normalization is also a bad idea if your waveform contains any intense transients. Your entire signal will be attenuated by the ratio of the transient peak value divided by the clipping threshold.
If you just want to protect the output from clipping you are best off using a side chain type compressor. A specific form of this is the limiter (infinite compression ratio above a threshold with zero attack time). A side-chain compressor calculates the smoothed energy envelope of a signal and then applies a varying gain according to that function. They are not instantaneous, so you reduce audible distortion that you'd get from the functions you mention. A limiter can have instantaneous attack to prevent from clipping, but you allow a release time so that the limiter remains attenuating for subsequent waveform peaks, the subsequent waveform is just turned down and so there is no distortion. After the intense sound, the limiter recovers.
You can get a pumping type sound from this type of processing if there are a lot of high intensity peaks in the waveform. If this becomes problematic, you can then move to the next level and do the dynamics processing within sub-bands. This way, only the offending parts of the frequency spectrum will be attenuated, leaving the rest of the sound unaffected.
The general solution is to normalize to some gain level significantly below 1 such that very few songs require adding gain. In other words, most of the time you will be lowering the volume of signal rather than increasing. Experiment with a wide variety of songs in different styles to figure out what this level is.
Now, occasionally, you'll still come across a song that requires enough gain that, that, at some point, it would clip. You have two options: 1. don't add that much gain. This one song will sound a bit quieter. C'est la vie. (this is a common approach), or 2. apply a small amount of dynamic range compression and/or limiting. Of course, you can also do some combination 1 and 2. I believe iTunes uses a combination of 1 and 2, but they've worked very hard on #2, and they apply very little.
Your suggestion, using a function like tanh, on a sample-by-sample basis, will result in audible distortion. You don't want to do this for a generic music player. This is the sort of thing that's done in guitar amp simulators to make them sound "dirty" and "grungy". It might not be audible in rock, pop, or other modern music which is heavy on distortion already, but on carefully recorded choral, jazz or solo violin music people will be upset. This has nothing to do with the choice of tanh, by the way, any nonlinear function will produce distortion.
Dynamic range compression uses envelopes that are applied over time to the signal: http://en.wikipedia.org/wiki/Dynamic_range_compression
This is tricky to get right, and you can never create a compressor that is truly "transparent". A limiter can be thought of as an extreme version of a compressor that (at least in theory) prevents signal from going above a certain level. A digital "lookahead" limiter can do so without noticeable clipping. When judiciously used, it is pretty transparent.
If you take this approach, make sure that this feature can be turned off, because no matter how transparent you think it is, someone will hear it and not like it.