Combing two digital sound files - audio

Why Combing two digital sound files into one digital sound file often involves adjustment of dB level
?
Mathematically, a 5.1 sound file is much larger than a stereo sound file. why?

When two signals are summed on a normalized scale of 0 to 1 as it usually is in the blocks in memory, often directly summing them will make for an increase in overall sound level that does not fit in with the logarithmic decibel scale. In addition, the levels need to be corrected to select the relative levels of the two signals. They need to be corrected before they are summed so they are not too loud, and so that they are the correct relative volume.
To answer your other question, a 5.1 file typically has 6 channels of audio going on at one given time. A stereo sound file only has two channels at most. So the 5.1 equivalent of a soundfile (metadata excluded) would be usually right around three times larger than your average stereo file of an equivalent bitrate.

Related

'Mono' FFT Visualization of a Stereo Analog Audio Source

I have created a really basic FFT visualizer using a Teensy microcontroller, a display panel, and a pair of headphone jacks. I used kosme's FFT library for Arduino: https://github.com/kosme/arduinoFFT
Analog audio flows into the headphone input and to a junction where the microcontroller samples it. That junction is also connected to an audio out jack so that audio can be passed to some speakers.
This is all fine and good, but currently I'm only sampling the left audio channel. Any time music is stereo separated, the visualization cannot account for any sound on the right channel. I want to rectify this but I'm not sure whether I should start with hardware or software.
Is there a circuit I should build to mix the left and right audio channels? I figure I could do something like so:
But I'm pretty sure that my schematic is misguided. I included bias voltage to try and DC couple the audio signal so that it will properly ride over the diodes. Making sure that the output matches the input is important to me though.
Or maybe should this best be approached in software? Should I instead just be sampling both channels separately and then doing some math to combine them?
Combining the stereo channels of one end of the fork without combining the other two is very difficult. Working in software is much easier.
If you take two sets of samples, you've doubled the amount of math that the microcontroller needs to do.
But if you take readings from both pins and divide them by two, you can add them together and have one set of samples which represents the 'mono' signal.
Keep in mind that human ears have an uneven response to sound volumes, so a 'medium' volume reading on both pins, summed and halved, will result in a 'lower-medium' value. It's better to divide by 1.5 or 1.75 if you can spare the cycles for more complicated division.

How to recognize if an audio sample has been compressed and then decompressed?

Some years ago I made a music audio recording, and I can't find the original WAV files, I have only compressed MP3s. Now I found an audio CD, but I don't know if it was made using the original, uncompressed WAVs, or if it was made from compressed MP3 or OGG files.
Is there a way how to detect if an audio sample has been compressed and decompressed using a lossy compression such as MP, OGG, ..., without having the original to compare to?
Update:
Trying #MisterHenson's suggestion, I plotted the spectra of the two samples, with obvious differences in the graphs:
The sample from the CD:
The sample from the MP3:
This practically solves solves my current problem, but still I have these open questions:
If the spectra were visually indistinguishable, I wouldn't know if there is a real difference, or that I just can't distinguish them (i.e. the compression would be of better quality). What else could I try?
Similarly what would I do if I didn't have the MP3 file to compare to, just a single audio sample?
Is there an automated method, that'd answer the question with a reasonable probability?
I made an example to stress the topology of all MP3 transcodes, the source material being a Chopin nocturne. MP3 on top, Lossless on bottom. All recordings have background noise of some amplitude, and that noise is faintly visible here. What the MP3 transcode (Lame's V2 preset in this case) does is create a hard limit at ~16kHz. On a 320kbps bitrate 44.1kHz sample rate MP3, this hard limit appears at around 20kHz, but it would still be visibly different in this image.
You can pick out this shelf without having the original lossless file for comparison. I'm willing to say all music has amplitude at frequencies above even 19kHz. Here's an example for which I do not have the lossless source file, just a 320kbps MP3. You can see the very hard limit at 20kHz as well as a milder cutoff at 19kHz. Were it lossless, that red blob in the middle would extend all the way up to 22kHz since the sample rate is 44.1kHz.
I would say this process is probably automatable, but I do not know of any attempts to automate it. If this were automated, though, I'd say it could pick Lossy from Lossless with much higher accuracy than you or I, by virtue of it being able to analyze the entire spectrum as opposed to just the high frequency cutoffs.
Full res images:
http://i.imgur.com/dezONol.jpg
http://i.imgur.com/1qokxAN.jpg
The above approaches sound very promising although maybe a little complicated -- you might first try something easy, like check the distribution of the least significant bit. In a natural sample, LSB should be an almost exact 50/50 distribution between zeroes and ones (actually across many samples would have some variance following a binomial distribution but with millions or billions of bits this will be ridiculously close to 50/50 in any given sample). In a lossy sample, you will find an unlikely distribution in the LSB.
Something like this:
1 -- extract LSB from each data point
2 -- apply chi-squared test to judge if distribution is unusual
Here is the deal.
A raw sample (or a raw piece of sound) is encoded in a certain quality.
Some sound cards can go further with 64bit sampling.
But let's assume that we have sound files of a certain KNOWN quality.
CD quality is okay for the human ear.
A studio, would make use of more quality samples though. Like 24bit as a standard.
So you got a waveform filename.wav that really has a sample rate 44100 Hz.
What does that mean?
It means the computer can take a huge amount of different samples per second to represent almost the exact sound.
Is the sound original? Depends on how it was made.
If it was made by your computer and a piece of software using a 16bit default sound card yes it is.
If it was from an analogue recording though, it loses some of its quality on the digitization at 44100 Hz fortunately not so significant for the human ear.
NOTE THAT mp3 recordings is a bad idea for professional recording.
But since mp3 recording do exist... this adds complexity to your question. :P
So some sound quality is lost on digitization with a 16bit sound card.
Now similar thing can happen when you encode something to mp3.
Check out your picture. Above 17000 there is no sound. It was butchered to make the sound file significant smaller, without making any significant damage to the audio quality. Is it the same piece of sound? No. It sounds the same though. But a sound engineer LOVES original and good quality samples, because of the information that is NOT cut.
Imagine me, making an original sound, so balanced and compressed that even after an mp3 converting it is hard to tell if it is original sound or not. Imagine me using equalizers to cut any sharp edges, and gate effects to extremely normalize it. Also, my sound generators are some 8bit oscillators passing through some fx and filters.
If I convert it back to wavetable, there might be no difference.
For instance:
[UNCHANGED FREQUENCIES][CUT FREQUENCIES]
Waveform: =================================
mp3: =======================
Waveform: =======================
Waveform:
[UNCHANGED FREQUENCIES][CUT FREQUENCIES]
Waveform: =================
mp3 =================
Waveform: =================
The following seems impossible to me (except if the converter has bugs thing that can be heard)
[UNCHANGED FREQUENCIES][CUT FREQUENCIES]
Waveform: =========================
mp3 =======================
Waveform: =============================
So your question depends on the original source you used in the first waveform.
Good news is that a sample is RARELY THAT limited and compressed.
So it seems to me that the CD you used will probably sound like original waveform,
while as you can see, the mp3 has cut out frequencies.
To be sure of course you need a frequency analyzer and spectrum as MischaNix already has shown.
There are many mp3 encodings too. Some are static, some dynamic, some cut more and some cut less sound information. Some are also bigger than others for that reason.
Now there are lossless formats too.
And then there is ogg that is small enough and also has great quality.
So this question can become a huge topic for no reason here. I will not talk about all these.
If the issue is giving an original sample, your pictures show me significant differences between the two samples. I mean, making a waveform out of the mp3 cut variation, should look like that cut variation. You can not get information out of nothing.
Burn the mp3 on a cd, then get the wave, compare the new waveform with the old and the mp3 waveform. It will probably not be the same thing so you might hit the jackpot here. It is possible you got an original backup on your hands.
From now on though, try sampling raw material and store them in a CD or DVD before discarding them.
Or at least keep good uncompressed samples in a backup.
Open questions:
If the spectra were visually indistinguishable, I wouldn't know if there is a real difference, or that I just can't distinguish them.
Correct. But this would occur seldom without intention on sampling.
Why asking such a question? :) Do you have steganography in mind?
If yes, make sure to keep in mind the nature of the piece of sound you are gonna use. Samples are not appropriate. "Finished songs" are!
Similarly what would I do if I didn't have the MP3 file to compare to, just a single audio sample?
Since there are many mp3 encoding settings of different qualities, you can check if the lowest quality was used. If not there is uncertainty because of the compression capabilities. If this applies to the whole sample, then you got to see if compression was needed. That's why you can not be certain on a song. You don't record with SO hard compression in the first place. I guess this is another meta-reason why you need a natural sound. So if its about a recording you might be lucky.
Now about a finished mastered song... things get rough once again. It is about the nature, the type, of the sound. A recording is easier to figure out what is going on if you knew you used waveform recording. An mp3 recording of course is a waste of time. On the other hand a finished song, usually nowadays makes compressors, limiters, gates and chain compressors burn out. The amount of use of this techniques in modern mastering is enormous. So... you will really need luck to find out if the original piece was compressed before, before having an original waveform to begin with.
Is there an automated method, that'd answer the question with a reasonable probability?
None that I know. Sorry. :(
But that doesn't mean than nobody can make one.
BUT!
A stereo sample is usually split out to two channels. Left and right.
Now if you got a spectrum analyzer in a Digital Audio Workstation,
and take a look only on the left channels of two different samples, you can on the fly see
if they are the same or not I guess.
In order to understand what I mean, take a look at THIS link.
Go at 05:00 and just watch the interface.
Phew. Hope this will help you further, since it took some time. :P
Cheers.
Edit: Fixing some stuff here and there.
I found a description of the problem, a solution and an implementation in Python by Maurits van der Schee, that works with a FLAC though.
From the sample only the first 30 seconds are analyzed. For every
second the frequency spectrum of the sample is computed by applying a
Hanning Window and doing a Fast Fourier Transform. These spectrums are
added, so that eventually you end up with 30 stacked spectrums. These
are divided by 30 to get the average spectrum. Then the spectrum is
normalized using log10. After that we applied a rolling average on the
spectrum with a window size of 1/100th of the frequency, being
44100/100=441 samples.
If there is an unnatural cutoff in the frequency spectrum, this cutoff
is the thing we need to find. We sweep the spectrum from 44100th back
to the 1st frequency, where the variable frequency is f. As soon as
the magnitude at f-220 is more than 1.25 higher than the magnitude at
f and the magnitude at f is no bigger than 1.1x the magnitude at 44100
we have found the cutoff point. The cutoff point is multiplied by 100
and divided by the frequency to get to the percentage of the spectrum
not cut off.
Things to look for:
Cut-off frequency changing on frame boundaries (not going to be a 100% hard cut, but look for "audible" to "inaudible" and vice versa)
Frequencies disappearing or appearing on frame boundaries (again, not 100%)
Noise levels changing on frame boundaries (actually pretty solid for lossy codecs)
For MP3, the frame boundaries are precisely every 1152 samples, though you might be able to "see" the granules every 576 samples.
For Vorbis, the frame boundaries are typically every 128 or 1024 samples depending on transients the encoder "saw". You can probably get away with doing every 128 samples...
You'll have to research the other formats to know their frame sizes (I don't know them offhand).

How do I combine digital audio?

I have two wave files for which I have the digital samples extracted. I need to play both at the same time. How do I combine the two samples to produce the output sample that is both sounds playing together. How is this done for N simultaneous samples? Is it as simple as adding the samples and taking the average?
Combining sounds (at the same sample rate) just involves an element-wise addition of the two arrays. You do not need to divide by N unless you have an issue with headroom. If the value of the sum exceeds the maximum output level, this will result in clipping, giving an audible distortion.
Unless you have a large N, or a small N where each of your source sounds are normalised to the maximum output level, you should not have a problem with clipping. If you know the waveforms of the signals in advance, you can simply scale each waveform by the same scalar value beforehand so that the output does not clip. Alternatively, if you are rendering the sound offline, you can just sum your waveforms and then normalise the composite signal so it does not clip.
If you are dealing with a live input stream of N sources, you can minimise clipping using a limiter.
http://en.wikipedia.org/wiki/Dynamic_range_compression#Limiting
Yes, you can simply sum the two, and divide by two.
Indeed, that's the average.
When both samples have the same sample-rate it's really as straightforward as that.
Combine digital audio by adding the individual samples together.
There will be a loudness increase when combining several uncorrelated sound sources, but the relationship between loudness and N number of sources is not linear. Four simultaneous sounds will be approximately twice as loud as one, not four times as loud. (That's a 6dB increase.)
As you suspected you do need to keep in mind the final output volume when playing back multiple sounds simultaneously but dividing by N when combining N simultaneous sources is not the right way to do so.
The easiest way is to add a volume control to your application. The user will turn down your application when it's too loud. This is simple and usually the correct approach when combining a small number of sounds.
A manual volume control is not the right solution for all problems. For example a first person shooter. Imagine running from a quiet corridor out into a raging gun battle. The sound environment will go from very quiet with a few sound sources to very loud with lots of sound sources. In these cases you'll likely need some form of automatic gain control.

Basic unit of Sound?

If we consider computer graphics to be the art of image synthesis where the basic unit is a pixel.
What is the basic unit of sound synthesis?
[This relates to programming as I want to generate this via a computer program.]
Thanks!
The basic unit is a sample
In a WAVE file, the sample is just an integer specifying where to move the speaker head to.
The sample rate determines how often a new sample is fed to the speakers (I'm not entirely sure how this part works, but it does get converted to an analog signal first). The samples are typically laid out in the file one right after another.
When you plot all the samples with x-axis being time and y-axis being sample_value, you can see the waveform.
In a wave file, samples can (theoretically) be any bit-size from 0-65535, which remains constant throughout the wave file. But typically 16 or 24 bits are used.
Computer graphics can also have vector shapes as basic units, not just pixels. Generally, vector graphics are generated via computer tools while captured data tends to appear as a grid of pixels (corresponding to an array of sensors in a camera or other capture device). Obviously there is considerable crossover between those classifications.
Similarly, there are sampled (such as .WAV) and generative (such as .MIDI) forms of computer audio. In the sampled case, the smallest unit is a single sample. Just like an array of pixels in the brightness, x- and y-dimensions come together to form an image, an array of samples in the loudness and time dimensions come together to form a sound. In the generative case, it will be something more like a single tone rendered in a particular voice just like vector graphics have paths drawn with particular textures.
A pixel can have a value and be encoded in digital bitmap samples. The same properties apply to sound and digital audio samples.
A pixel is a physical device that can only render the amplitudes of 3 frequencies of light (Red, Green, Blue) at a time. A speaker is a physical device that can render the amplitudes of a wide range of frequencies (~40,000) at a time. The bit resolution of a sample (number of bits used to to store the value of a sample) mainly determines how many colors/tones can be rendered - the fidelity of the physical playback device.
Also, as patterns of pixels can be encoded or compressed, most patterns of sound samples are also encoded or compressed (or both).
The fundamental unit of signal processing (of which audio is a special case) would be the sample.
The frequency at which you need to sample a signal depends on the maximum frequency present in the waveform. Sampling theorem states that it is normally sufficient to sample at twice the frequency of the maximum frequency present in the signal.
http://en.wikipedia.org/wiki/Sampling_theorem
The human ear is sensitive to sounds up to around 20kHz (the upper frequency lowers with age). This is why music on CD is sampled at 44kHz.
It is often more useful to think of music as being comprised of individual frequencies.
http://www.phys.unsw.edu.au/jw/sound.spectrum.html
Most sound analysis and creation is based on this idea.
Related concepts:
Psychoacoustics: Human perception of sound. Relates to modern sound compression techniques such as mp3.
Fourier series: How complex waveforms are composed of individual frequencies.
I would say the basic unit of sound synthesis is the sine wave. But your definition of synthesis is perhaps different to what audio people would refer to sound synthesis. Sound systhesis is the creation of sound using the fundamental components of sound.
With sine waves, we can synthesise sounds using many techniques such as substractive synthesis, additive synthesis or FM synthesis.
Fourier theory states that every sound is a summation of sine waves of differing phases, frequencies and amplitudes.
OK, so how do we represent a sine wave on a computer? well, a sine wave will be generated using a buffer(array) of 'samples' that have been generated by a function or read from a table. The same technique applies to any sound captured on a computer.
A 'sample' is typically represented as number between -1 and 1 that directly correlates to the amplitude of a sound at a given moment in time. A typical sound recorded at 16 bit depth, would have 65536 (2pow16) possible amplitude values. When being recorded, typically, a sample will be captured 44.1k per second of sound. This is called the sampling frequency rate, or simply the sample rate.
Upon playback from you computer, each sample will pass though an Digital to Analogue converter and generate a vibration on your pc speaker and will in turn cause your ear to percieve the recorded sound.
Sound can be expressed as several different units, but the most common in synthesis/computer music is decibels (dB), which are a relative logarithmic measure of amplitude. Specifically they are normally relative to the maximum amplitude of the audio system.
When measuring sound in "real life", the units are normally A-weighted Decibels or dB(A).
The frequency of a sound (i.e. its pitch) is its amplitude over time, or in the digital world, its amplitude over samples. The number of samples per unit of real time is called the sampling rate; conventional hi-fi systems have sampling rates of 44 kHz (44,000 samples per second) and synthesis/recording software usually supports up to 96 kHz.
Everything sound in the digital domain can be represented as a waveform with the X-axis representing the time (or sample number) and the Y-axis representing the amplitude.
frequency and amplitude of the wave are what make up sound.
That is for a tone.
Music or for that matter most noise is a composite of multiple simultaneous sound waves superimposed on one another.
The unit for amplitute is the
Bel. (We use tenths of a Bel
therefore the term decibel)
The unit for frequency is the
Hertz.
That being said synthesis of music is a large field.
Bitmapped graphics are based on sampling the amplitude of light in a 2D space, where each sample is digitized to a given bit depth and often converted to a logarithmic representation at a different bit depth. The samples are always positive, since you can't be darker than pure black. Each of these samples is called a pixel.
Sound recording is most often based on sampling the magnitude of sound pressure at a microphone, where the samples are taken at constant time intervals. These samples can be positive or negative with respect to perfect silence. Most often these samples are not converted to a logarithm, even though sound is perceived in a logarithmic fashion just as light is. There is no special term to refer to these samples as there is with pixels.
The Bels and Decibels mentioned by others are useful in the context of measuring peak or average sound levels. They are not used to describe the individual sound samples.
You might also find it useful to know how sound file formats compare to image file formats. WAVE is an uncompressed format specific to Windows and is analogous to BMP. MP3 is a lossy compression analogous to JPEG. FLAC is a lossless compression analogous to 24-bit PNG.
If computer graphics are colored dots in 2 dimensional space representing a 3 dimensional space, then sound synthesis is amplitude values regularly partitioned in time representing musical events.
If you want your result to sound like music (the kind of music most people like at least), then you are either going to use some standard synthesis techniques, or literally waste decades of your life reinventing them from scratch.
The most basic techniques are additive synthesis, in which the individual elements are the frequencies, amplitudes, and phases of sine oscillators; subtractive synthesis, where you work with filter coefficients and a complex input waveform; frequency modulation synthesis, where you work with modulation depths and rates of stages of modulation; granular synthesis where short (hundredths to tenths of a second long) enveloped pieces of a recorded sound or an artificial waveform are combined in immense numbers. Each of these in practice uses parameters that evolve over the course of a note, and often you will mix elements of various techniques into a larger instrument.
I recommend this book, though it doesn't have the math for many concepts it at least lays the ground for the concepts used, and gives a nice overview of the techniques.
You wouldn't waste your time going sample by sample to do music in practice any more than you would waste your time going pixel by pixel to render 3d (in other words yeah go sample by sample if making a tool for other people to make music with, but that is way too low a level if you are interested in the task of making music).
Probably the envelope. A tone/note has a shape described by: attack decay sustain release
The byte, or word, depending on the bit-depth of the sound.

How can I determine how loud a WAV file will sound?

I have a bunch of different audio recordings in WAV format (all different instruments and pitches), and I want to "normalize" them so that they all sound approximately the same volume when played.
I've tried measuring the average sample magnitude (the sum of all absolute values divided by the number of samples), but normalizing by this measurement doesn't work very well. I think this method isn't working because it doesn't take into account the frequency of the sounds, and I know that higher-frequency recordings sound louder than lower-frequency sounds of the same amplitude.
Does anyone know a good method for measuring the loudness of a sound?
Root Mean Square is often used to estimate the loudness of sound files. This is because a sound that is very loud might not be perceived that way if it is very short. Also remember that power increases exponentially with the square of amplitude.
The audio geeks at Hydrogen Audio know a ton about this stuff...check out their free Replay Gain software. You may not need to do any programming at all.
EDIT: Included comment feedback on power vs. amplitude.
To add to PeterAllenWebb's response:
Before you calculate the RMS, you should "center" your sample first (think of a 5-minute .wav where each sample has the maximum +amplitude). The best way to do that is to use a highpass filter at a subsonic frequency.
That would still not take the frequencies that humans are sensitive to in count. To do that, you could use A-weighting. There's a page where you can calculate it online:
http://www.diracdelta.co.uk/science/source/a/w/aweighting/source.html
The code seems to be here:
http://www.diracdelta.co.uk/science/source/a/w/aweighting/multicalc.js
Well not being an expert on audio and adding to the previous comment, you should figure out what you define as the "shortest amount of time for peak power" and then just convert the wave to raw floating point and use RMS over the stretch of time and continuously take chunks of that length of time, find the MAX and there you have your highest peak power.
To reiterate what some other people have said, use RMS value to estimate the "loudness" of a passage of sound.
But, if you're dealing with impulsive sounds like plucking or drum hits, you'd want to do a sliding RMS value and pick out only the peak RMS value. Measure 100 ms of the sound, slide the window, measure again, etc. and then normalize according to the largest value you find.
Definitely remove any DC value before doing the RMS, and A-weighting will make it more like how we hear. Here's code for A-weighting in MATLAB/Octave and Python.
I might be way off here, but, if you have wavepad you can load in multiple files and mess with the volumes a little bit so they are all the same. Also, if you have certain sections of a file that are louder, you can select that section and lower the volume for that one section.
EDIT: And sorry, it;s not really a "method" for measuring volume, but if you just need to make them all the same this should work fine.

Resources