Search different audio files for equal short samples - audio

Consider multiple (at least two) different audio-files, like several different mixes or remixes. Naively I would say, it must be possible to detect samples, especially the vocals, that are almost equal in two or more of the files, of course only then, if the vocal samples aren't modified, stretched, pitched, reverbed too much etc.
So with what kind of algorithm or technique this could be done? Let's say, the user would try to set time markers in all files best possible, which describe the data windows to compare, containing the presumably equal sounds, vocals etc.
I know that no direct approach, trying to directly compare wav data in any way is useful. But even if I have the frequency domain data (e.g. from FFT), I would have to use a comparison algorithm that kind of shifts the comparing-windows through time scale, since I cannot assume the samples, I want to find, are time sync over all files.
Thanks in advance for any suggestions.

Hi this is possible !!
You can use one technique called LSH (locality sensitive hashing), is very robust.
another way to do this is try make spectrogram analysis in your audio files ...
Construct database song
1. Record your Full Song
2. Transform the sound to spectrum
3. slice your Spectrogram in chunk and get three or four high Frequencies
4. Store all the points
Match the song
1. Record one short sample.
2. Transform the sound into another spectrum
3. slice your Spectrogram in chunk and get three or four hight Frequencies
4. Compare the collected frequencies with your database song.
5. your match is the song with have the high hit !
you can see here how make ..
http://translate.google.com/translate?hl=EN&sl=pt&u=http://ederwander.wordpress.com/2011/05/09/audio-fingerprint-em-python/
ederwander

Related

How to find what time a part of audio starts and ends in another audio?

I have two audio files in which a sentence is read (like singing a song) by two different people. So they have different lengths. They are just vocal, no instrument in it.
A1: Audio File 1
A2: Audio File 2
Sample sentence : "Lorem ipsum dolor sit amet, ..."
I know the time every word starts and ends in A1. And I need to find automatically that what time every word starts and ends in A2. (Any language, preferably Python or C#)
Times are saved in XML. So, I can split A1 file by word. So, how to find sound of a word in another audio that has different duration (of word) and different voice?
So from what I read, it seems you would want to use Dynamic Time Warping (DTW). Of course, I'll leave the explanation for wikipedia, but it is generally used to recognize speech patterns without getting noise from different pronunciation.
Sadly, I am more well versed in C, Java and Python. So I will be suggesting python Libraries.
fastdtw
pydtw
mlpy
rpy2
With rpy2 you can actually use R's library and use their implementation of DTW in your python code. Sadly, I couldn't find any good tutorials for this but there are good examples if you choose to use R.
Please let me know if that doesn't help, Cheers!
My approach for this would be to record the dB volume at a constant interval (such as every 100 milliseconds) store this volume in a list or array. I found a way of doing this on java here: Decibel values at specific points in wav file. It is possible in other languages. Meanwhile, take note of the max volume:
max = 0;
currentVolume = f(x)
if currentVolume > max
{
max = currentVolume
}
Then divide the maximum volume by an editable threshold, in my example I went for 7. Say the maximum volume is 21, 21/7 = 3dB, let's call this measure X.
We second threshold, such as 1 and multiply it by X. Whenever the volume is greater than this new value (1*x), we consider that to be the start of a word. When it is less than the given value, we consider it to be the end of a word.
Visual explanation
Without knowing how sophisticated your understanding of the problem space is it isn't easy to know whether to point you in a direction or provide detail on why this problem is non-trivial.
I'd suggest that you start with something like https://cloud.google.com/speech/ and try to convert the speech blocks to text and then perform a similarity comparison on these.
If you really want to try to do the processing yourself you could look at doing some spectrographic analysis. Take the wave form data and perform an FFT to get frequency distributions and look for marker patterns that align your samples.
With only single word comparison of different speakers you are probably not going to be able to apply any kind of neural network unless you are able to train them on the 2 speakers entire speech set and use the network to then try to compare the individual word chunks.
It's been a few years since I did any of this so maybe it's easier these days but my recollection is that although this sounds conceptually simple it might prove to be more difficult than you realise.
The Dynamic Time Warping looks like the most promising suggestion.
secret sauce of below : pointA - pointB is zero if both points have same value ... that is numerically do a pointA minus pointB ... below leverages this to identify at what file byte index offset gives us this zero value when comparing the raw audio curves from a pair of input files ... or an close to zero in a relative sense if both source audio are different even slightly
approach is open up both files and pluck out the raw audio curve of each file ... define two variables bestSum and currentSum, set both to MAX_INT_VALUE ( any arbitrary high value ) ... iterate across the both files simultaneously and obtain the integer value of the current raw audio curve level of file A do same on other file B ... for each such integer just subtract the integer from file A from integer from file B ... continue this loop until you have reached end of one file ... inside of above loop add to currentSum variable the current value of the above mentioned subtraction ... at bottom of above loop update bestSum to become currentSum if currentSum < bestSum also store current file index offset ...
create an outer loop which does a repeat all of above by introducing an offset in time of one file then relaunch above inner loop ... your common audio is when you are using the offset which has the minimum total sum value .. that is the offset when you encountered bestSum
do not start coding until you have gained intuition that above makes perfect sense
I highly encourage you to plot out the curve of the raw audio for one file to confirm you are accessing this sequence of integers ... do this before attempting above algorithm
it will help to visualize above by viewing each input source audio as a curve and you simply keep one curve steady as you slide the other audio curve left or right until you see the curve shapes match or get very close to matching

Finding the frequency per second from an audio file

I am currently making a game, similar to Audiosurf. I am trying to find the frequency of an audio file(like .mpg3 or .wav) at every second. Based on the value I will build the level. I have been doing a lot of research on this topic. I have a way to get the samples within the audio file, i am using the unity engine to make this game. I am thinking about breaking the samples into samples per second(using the transfer rate), then do an FFT on each of those and then find the highest frequency within each. Am I on the right path? Can anyone ofter any suggestions or if I am not on the right path, can anyone one correct me? Any help would be appreciated.
You are on the right path with the FFT part and splitting your samples into bins. Here is a library for that: http://www.fftw.org/
Where it gets hairy is with picking your frequency, let me tell you off the bat just throw away the highest frequency in the spectrum, it's part of the static. Maybe you could use the lowest frequency to catch the bassline, but likely the bass drums and even atmospheric sound effects will interfere there.
Now provided you do find some heuristic that allows you to pick the "frequency" at a given moment in the song, this most likely doesn't have correlation to the music itself. You are really better off re-working your idea to use frequency spectrum at each moment, not just a single frequency.
EDIT: The fourier transform will provide you with an array of complex numbers, each represents the amplitude as the real component and phase as the imaginary component for its bin.

How do I combine digital audio?

I have two wave files for which I have the digital samples extracted. I need to play both at the same time. How do I combine the two samples to produce the output sample that is both sounds playing together. How is this done for N simultaneous samples? Is it as simple as adding the samples and taking the average?
Combining sounds (at the same sample rate) just involves an element-wise addition of the two arrays. You do not need to divide by N unless you have an issue with headroom. If the value of the sum exceeds the maximum output level, this will result in clipping, giving an audible distortion.
Unless you have a large N, or a small N where each of your source sounds are normalised to the maximum output level, you should not have a problem with clipping. If you know the waveforms of the signals in advance, you can simply scale each waveform by the same scalar value beforehand so that the output does not clip. Alternatively, if you are rendering the sound offline, you can just sum your waveforms and then normalise the composite signal so it does not clip.
If you are dealing with a live input stream of N sources, you can minimise clipping using a limiter.
http://en.wikipedia.org/wiki/Dynamic_range_compression#Limiting
Yes, you can simply sum the two, and divide by two.
Indeed, that's the average.
When both samples have the same sample-rate it's really as straightforward as that.
Combine digital audio by adding the individual samples together.
There will be a loudness increase when combining several uncorrelated sound sources, but the relationship between loudness and N number of sources is not linear. Four simultaneous sounds will be approximately twice as loud as one, not four times as loud. (That's a 6dB increase.)
As you suspected you do need to keep in mind the final output volume when playing back multiple sounds simultaneously but dividing by N when combining N simultaneous sources is not the right way to do so.
The easiest way is to add a volume control to your application. The user will turn down your application when it's too loud. This is simple and usually the correct approach when combining a small number of sounds.
A manual volume control is not the right solution for all problems. For example a first person shooter. Imagine running from a quiet corridor out into a raging gun battle. The sound environment will go from very quiet with a few sound sources to very loud with lots of sound sources. In these cases you'll likely need some form of automatic gain control.

What exactly is a "Sample"?

From the OpenAL documentation it looks like if an Sample is one single floating point value like lets say 1.94422
Is that correct? Or is a sample an array of a lot of values? What are audio programming dudes talking about when they say "Sample"? Is it the smallest possible snippet of an audio file?
I imagine an uncompressed audio file to look like a giant array with millions of floating point values, where every value is a point in a graph that forms the sound wave. So every little point is a sample?
Exactly. A sample is a value.
When you convert and analog signal to its digital representation, you convert a continuous function to a discrete and quantized one.
It means that you have a grid of vertical and horizontal lines and all the possible values lie on the intersection of the lines. The gap between vertical lines represents the distance between two consecutive samples, the gap between horizontal one is the minimum differences you may represent.
In every vertical line you have a sample, which (in linear encoding) is equal to n-times k where k is the quantum, minimum differences references above.
I imagine an uncompressed audio file
to look like a giant array with
millions of floating point values,
where every value is a point in a
graph that forms the sound wave. So
every little point is a sample?
Yes, that is right. A sample is the value calculated by your A/D converter for that particular point in time. There's a sample for each channel (e.g. left and right in stereo mode. Both samples form a frame.
According to the Wikipedia article on signal processing:
A sample refers to a value or set of values at a point in time and/or space.
So yes, it could just be a single floating point value. Although as Johannes pointed out, if there are multiple channels of audio (EG: right/left), you would expect one value for each channel.
In audio programming, the term "sample" does indeed refer to a single measurement value. Among audio engineers and producers, however, the term "sample" normally refers to an entire snippet of sound taken (or sampled) from a famous song or movie or some other original audio source.

Downsampling and applying a lowpass filter to digital audio

I've got a 44Khz audio stream from a CD, represented as an array of 16 bit PCM samples. I'd like to cut it down to an 11KHz stream. How do I do that? From my days of engineering class many years ago, I know that the stream won't be able to describe anything over 5500Hz accurately anymore, so I assume I want to cut everything above that out too. Any ideas? Thanks.
Update: There is some code on this page that converts from 48KHz to 8KHz using a simple algorithm and a coefficient array that looks like { 1, 4, 12, 12, 4, 1 }. I think that is what I need, but I need it for a factor of 4x rather than 6x. Any idea how those constants are calculated? Also, I end up converting the 16 byte samples to floats anyway, so I can do the downsampling with floats rather than shorts, if that helps the quality at all.
Read on FIR and IIR filters. These are the filters that use a coefficent array.
If you do a google search on "FIR or IIR filter designer" you will find lots of software and online-applets that does the hard job (getting the coefficients) for you.
EDIT:
This page here ( http://www-users.cs.york.ac.uk/~fisher/mkfilter/ ) lets you enter the parameters of your filter and will spit out ready to use C-Code...
You're right in that you need apply lowpass filtering on your signal. Any signal over 5500 Hz will be present in your downsampled signal but 'aliased' as another frequency so you'll have to remove those before downsampling.
It's a good idea to do the filtering with floats. There are fixed point filter algorithms too but those generally have quality tradeoffs to work. If you've got floats then use them!
Using DFT's for filtering is generally overkill and it makes things more complicated because dft's are not a contiuous process but work on buffers.
Digital filters generally come in two tastes. FIR and IIR. The're generally the same idea but IIF filters use feedback loops to achieve a steeper response with far less coefficients. This might be a good idea for downsampling because you need a very steep filter slope there.
Downsampling is sort of a special case. Because you're going to throw away 3 out of 4 samples there's no need to calculate them. There is a special class of filters for this called polyphase filters.
Try googling for polyphase IIR or polyphase FIR for more information.
Notice (in additions to the other comments) that the simple-easy-intuitive approach "downsample by a factor of 4 by replacing each group of 4 consecutive samples by the average value", is not optimal but is nevertheless not wrong, nor practically nor conceptually. Because the averaging amounts precisely to a low pass filter (a rectangular window, which corresponds to a sinc in frequency). What would be conceptually wrong is to just downsample by taking one of each 4 samples: that would definitely introduce aliasing.
By the way: practically any software that does some resampling (audio, image or whatever; example for the audio case: sox) takes this into account, and frequently lets you choose the underlying low-pass filter.
You need to apply a lowpass filter before you downsample the signal to avoid "aliasing". The cutoff frequency of the lowpass filter should be less than the nyquist frequency, which is half the sample frequency.
The "best" solution possible is indeed a DFT, discarding the top 3/4 of the frequencies, and performing an inverse DFT, with the domain restricted to the bottom 1/4th. Discarding the top 3/4ths is a low-pass filter in this case. Padding to a power of 2 number of samples will probably give you a speed benefit. Be aware of how your FFT package stores samples though. If it's a complex FFT (which is much easier to analyze, and generally has nicer properties), the frequencies will either go from -22 to 22, or 0 to 44. In the first case, you want the middle 1/4th. In the latter, the outermost 1/4th.
You can do an adequate job by averaging sample values together. The naïve way of grabbing samples four by four and doing an equal weighted average works, but isn't too great. Instead you'll want to use a "kernel" function that averages them together in a non-intuitive way.
Mathwise, discarding everything outside the low-frequency band is multiplication by a box function in frequency space. The (inverse) Fourier transform turns pointwise multiplication into a convolution of the (inverse) Fourier transforms of the functions, and vice-versa. So, if we want to work in the time domain, we need to perform a convolution with the (inverse) Fourier transform of box function. This turns out to be proportional to the "sinc" function (sin at)/at, where a is the width of the box in the frequency space. So at every 4th location (since you're downsampling by a factor of 4) you can add up the points near it, multiplied by sin (a dt) / a dt, where dt is the distance in time to that location. How nearby? Well, that depends on how good you want it to sound. It's common to ignore everything outside the first zero, for instance, or just take the number of points to be the ratio by which you're downsampling.
Finally there's the piss-poor (but fast) way of just discarding the majority of the samples, keeping just the zeroth, the fourth, and so on.
Honestly, if it fits in memory, I'd recommend just going the DFT route. If it doesn't use one of the software filter packages that others have recommended to construct the filter for you.
The process you're after called "Decimation".
There are 2 steps:
Applying Low Pass Filter on the data (In your case LPF with Cut Off at Pi / 4).
Downsampling (In you case taking 1 out of 4 samples).
There are many methods to design and apply the Low Pass Filter.
You may start here:
http://en.wikipedia.org/wiki/Filter_design
You could make use of libsamplerate to do the heavy lifting. Libsamplerate is a C API, and takes care of calculating the filter coefficients. You to select from different quality filters so that you can trade off quality for speed.
If you would prefer not to write any code, you could just use Audacity to do the sample rate conversion. It offers a powerful GUI, and makes use of libsamplerate for it's sample rate conversion.
I would try applying DFT, chopping 3/4 of the result and applying inverse DFT. I can't tell if it will sound good without actually trying tough.
I recently came across BruteFIR which may already do some of what you're interested in?
You have to apply low-pass filter (removing frequencies above 5500 Hz) and then apply decimation (leave every Nth sample, every 4th in your case).
For decimation, FIR, not IIR filters are usually employed, because they don't depend on previous outputs and therefore you don't have to calculate anything for discarded samples. IIRs, generally, depends on both inputs and outputs, so, unless a specific type of IIR is used, you'd have to calculate every output sample before discarding 3/4 of them.
Just googled an intro-level article on the subject: https://www.dspguru.com/dsp/faqs/multirate/decimation

Resources